TY - GEN
T1 - Comparative document analysis for large text corpora
AU - Ren, Xiang
AU - Lv, Yuanhua
AU - Wang, Kuansan
AU - Han, Jiawei
N1 - Publisher Copyright:
© 2017 ACM.
PY - 2017/2/2
Y1 - 2017/2/2
N2 - This paper presents a novel research problem, Comparative Docu- ment Analysis (CDA), that is, joint discovery of commonalities and differences between two individual documents (or two sets of doc- uments) in a large text corpus. Given any pair of documents from a (background) document collection, CDA aims to automatically identify sets of quality phrases to summarize the commonalities of both documents and highlight the distinctions of each with respect to the other informatively and concisely. Our solution uses a gen- eral graph-based framework to derive novel measures on phrase semantic commonality and pairwise distinction, where the back- ground corpus is used for computing phrase-document semantic relevance. We use the measures to guide the selection of sets of phrases by solving two joint optimization problems. A scalable iterative algorithm is developed to integrate the maximization of phrase commonality or distinction measure with the learning of phrase-document semantic relevance. Experiments on large text corpora from two different domains-scientific papers and news- demonstrate the effectiveness and robustness of the proposed frame- work on comparing documents. Analysis on a 10GB+ text corpus demonstrates the scalability of our method, whose computation time grows linearly as the corpus size increases. Our case study on comparing news articles published at different dates shows the power of the proposed method on comparing sets of documents.
AB - This paper presents a novel research problem, Comparative Docu- ment Analysis (CDA), that is, joint discovery of commonalities and differences between two individual documents (or two sets of doc- uments) in a large text corpus. Given any pair of documents from a (background) document collection, CDA aims to automatically identify sets of quality phrases to summarize the commonalities of both documents and highlight the distinctions of each with respect to the other informatively and concisely. Our solution uses a gen- eral graph-based framework to derive novel measures on phrase semantic commonality and pairwise distinction, where the back- ground corpus is used for computing phrase-document semantic relevance. We use the measures to guide the selection of sets of phrases by solving two joint optimization problems. A scalable iterative algorithm is developed to integrate the maximization of phrase commonality or distinction measure with the learning of phrase-document semantic relevance. Experiments on large text corpora from two different domains-scientific papers and news- demonstrate the effectiveness and robustness of the proposed frame- work on comparing documents. Analysis on a 10GB+ text corpus demonstrates the scalability of our method, whose computation time grows linearly as the corpus size increases. Our case study on comparing news articles published at different dates shows the power of the proposed method on comparing sets of documents.
UR - http://www.scopus.com/inward/record.url?scp=85015271490&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85015271490&partnerID=8YFLogxK
U2 - 10.1145/3018661.3018690
DO - 10.1145/3018661.3018690
M3 - Conference contribution
AN - SCOPUS:85015271490
T3 - WSDM 2017 - Proceedings of the 10th ACM International Conference on Web Search and Data Mining
SP - 325
EP - 334
BT - WSDM 2017 - Proceedings of the 10th ACM International Conference on Web Search and Data Mining
PB - Association for Computing Machinery
T2 - 10th ACM International Conference on Web Search and Data Mining, WSDM 2017
Y2 - 6 February 2017 through 10 February 2017
ER -