Comparative document analysis for large text corpora

Xiang Ren, Yuanhua Lv, Kuansan Wang, Jiawei Han

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

This paper presents a novel research problem, Comparative Docu- ment Analysis (CDA), that is, joint discovery of commonalities and differences between two individual documents (or two sets of doc- uments) in a large text corpus. Given any pair of documents from a (background) document collection, CDA aims to automatically identify sets of quality phrases to summarize the commonalities of both documents and highlight the distinctions of each with respect to the other informatively and concisely. Our solution uses a gen- eral graph-based framework to derive novel measures on phrase semantic commonality and pairwise distinction, where the back- ground corpus is used for computing phrase-document semantic relevance. We use the measures to guide the selection of sets of phrases by solving two joint optimization problems. A scalable iterative algorithm is developed to integrate the maximization of phrase commonality or distinction measure with the learning of phrase-document semantic relevance. Experiments on large text corpora from two different domains-scientific papers and news- demonstrate the effectiveness and robustness of the proposed frame- work on comparing documents. Analysis on a 10GB+ text corpus demonstrates the scalability of our method, whose computation time grows linearly as the corpus size increases. Our case study on comparing news articles published at different dates shows the power of the proposed method on comparing sets of documents.

Original languageEnglish (US)
Title of host publicationWSDM 2017 - Proceedings of the 10th ACM International Conference on Web Search and Data Mining
PublisherAssociation for Computing Machinery, Inc
Pages325-334
Number of pages10
ISBN (Electronic)9781450346757
DOIs
StatePublished - Feb 2 2017
Event10th ACM International Conference on Web Search and Data Mining, WSDM 2017 - Cambridge, United Kingdom
Duration: Feb 6 2017Feb 10 2017

Publication series

NameWSDM 2017 - Proceedings of the 10th ACM International Conference on Web Search and Data Mining

Other

Other10th ACM International Conference on Web Search and Data Mining, WSDM 2017
CountryUnited Kingdom
CityCambridge
Period2/6/172/10/17

ASJC Scopus subject areas

  • Computer Science Applications
  • Information Systems
  • Computer Networks and Communications
  • Software

Fingerprint Dive into the research topics of 'Comparative document analysis for large text corpora'. Together they form a unique fingerprint.

Cite this