TY - GEN
T1 - CETR - Content extraction via tag ratios
AU - Weninger, Tim
AU - Hsu, William H.
AU - Han, Jiawei
PY - 2010
Y1 - 2010
N2 - We present Content Extraction via Tag Ratios (CETR) - a method to extract content text from diverse webpages by using the HTML document's tag ratios. We describe how to compute tag ratios on a line-by-line basis and then cluster the resulting histogram into content and non-content areas. Initially, we find that the tag ratio histogram is not easily clustered because of its one-dimensionality; therefore we extend the original approach in order to model the data in two dimensions. Next, we present a tailored clustering technique which operates on the two-dimensional model, and then evaluate our approach against a large set of alternative methods using standard accuracy, precision and recall metrics on a large and varied Web corpus. Finally, we show that, in most cases, CETR achieves better content extraction performance than existing methods, especially across varying web domains, languages and styles.
AB - We present Content Extraction via Tag Ratios (CETR) - a method to extract content text from diverse webpages by using the HTML document's tag ratios. We describe how to compute tag ratios on a line-by-line basis and then cluster the resulting histogram into content and non-content areas. Initially, we find that the tag ratio histogram is not easily clustered because of its one-dimensionality; therefore we extend the original approach in order to model the data in two dimensions. Next, we present a tailored clustering technique which operates on the two-dimensional model, and then evaluate our approach against a large set of alternative methods using standard accuracy, precision and recall metrics on a large and varied Web corpus. Finally, we show that, in most cases, CETR achieves better content extraction performance than existing methods, especially across varying web domains, languages and styles.
KW - content extraction
KW - tag ratio
KW - world wide web
UR - https://www.scopus.com/pages/publications/77954569037
UR - https://www.scopus.com/pages/publications/77954569037#tab=citedBy
U2 - 10.1145/1772690.1772789
DO - 10.1145/1772690.1772789
M3 - Conference contribution
AN - SCOPUS:77954569037
SN - 9781605587998
T3 - Proceedings of the 19th International Conference on World Wide Web, WWW '10
SP - 971
EP - 980
BT - Proceedings of the 19th International Conference on World Wide Web, WWW '10
T2 - 19th International World Wide Web Conference, WWW2010
Y2 - 26 April 2010 through 30 April 2010
ER -