Skip to main navigation Skip to search Skip to main content

CETR - Content extraction via tag ratios

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

We present Content Extraction via Tag Ratios (CETR) - a method to extract content text from diverse webpages by using the HTML document's tag ratios. We describe how to compute tag ratios on a line-by-line basis and then cluster the resulting histogram into content and non-content areas. Initially, we find that the tag ratio histogram is not easily clustered because of its one-dimensionality; therefore we extend the original approach in order to model the data in two dimensions. Next, we present a tailored clustering technique which operates on the two-dimensional model, and then evaluate our approach against a large set of alternative methods using standard accuracy, precision and recall metrics on a large and varied Web corpus. Finally, we show that, in most cases, CETR achieves better content extraction performance than existing methods, especially across varying web domains, languages and styles.

Original languageEnglish (US)
Title of host publicationProceedings of the 19th International Conference on World Wide Web, WWW '10
Pages971-980
Number of pages10
DOIs
StatePublished - 2010
Event19th International World Wide Web Conference, WWW2010 - Raleigh, NC, United States
Duration: Apr 26 2010Apr 30 2010

Publication series

NameProceedings of the 19th International Conference on World Wide Web, WWW '10

Other

Other19th International World Wide Web Conference, WWW2010
Country/TerritoryUnited States
CityRaleigh, NC
Period4/26/104/30/10

Keywords

  • content extraction
  • tag ratio
  • world wide web

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Computer Science Applications

Fingerprint

Dive into the research topics of 'CETR - Content extraction via tag ratios'. Together they form a unique fingerprint.

Cite this