Hierarchical web-page clustering via in-page and cross-page link structures

Cindy Xide Lin, Yintao Yu, Jiawei Han, Bing Liu

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Despite of the wide diversity of web-pages, web-pages residing in a particular organization, in most cases, are organized with semantically hierarchic structures. For example, the website of a computer science department contains pages about its people, courses and research, among which pages of people are categorized into faculty, staff and students, and pages of research diversify into different areas. Uncovering such hierarchic structures could supply users a convenient way of comprehensive navigation and accelerate other web mining tasks. In this study, we extract a similarity matrix among pages via in-page and crosspage link structures, based on which a density-based clustering algorithm is developed, which hierarchically groups densely linked webpages into semantic clusters. Our experiments show that this method is efficient and effective, and sheds light on mining and exploring web structures.

Original languageEnglish (US)
Title of host publicationAdvances in Knowledge Discovery and Data Mining - 14th Pacific-Asia Conference, PAKDD 2010, Proceedings
Pages222-229
Number of pages8
EditionPART 2
DOIs
StatePublished - 2010
Event14th Pacific-Asia Conference on Knowledge Discovery and Data Mining, PAKDD 2010 - Hyderabad, India
Duration: Jun 21 2010Jun 24 2010

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
NumberPART 2
Volume6119 LNAI
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Other

Other14th Pacific-Asia Conference on Knowledge Discovery and Data Mining, PAKDD 2010
Country/TerritoryIndia
CityHyderabad
Period6/21/106/24/10

ASJC Scopus subject areas

  • Theoretical Computer Science
  • Computer Science(all)

Fingerprint

Dive into the research topics of 'Hierarchical web-page clustering via in-page and cross-page link structures'. Together they form a unique fingerprint.

Cite this