TY - GEN
T1 - Unsupervised Story Discovery from Continuous News Streams via Scalable Thematic Embedding
AU - Yoon, Susik
AU - Lee, Dongha
AU - Zhang, Yunyi
AU - Han, Jiawei
N1 - The first author was supported by the National Research Foundation of Korea (Basic Science Research Program: 2021R1A6A3A14043765). The research was supported in part by US DARPA (FA8750-19-2-1004 and HR001121C0165), National Science Foundation (IIS-19-56151, IIS-17-41317, IIS 17-04532, 2019897, and 2118329), and the Institute of Information and Communications Technology Planning and Evaluation (IITP) in Korea (2020-0-01361).
PY - 2023/7/19
Y1 - 2023/7/19
N2 - Unsupervised discovery of stories with correlated news articles in real-time helps people digest massive news streams without expensive human annotations. A common approach of the existing studies for unsupervised online story discovery is to represent news articles with symbolic- or graph-based embedding and incrementally cluster them into stories. Recent large language models are expected to improve the embedding further, but a straightforward adoption of the models by indiscriminately encoding all information in articles is ineffective to deal with text-rich and evolving news streams. In this work, we propose a novel thematic embedding with an off-the-shelf pretrained sentence encoder to dynamically represent articles and stories by considering their shared temporal themes. To realize the idea for unsupervised online story discovery, a scalable framework USTORY is introduced with two main techniques, theme- and time-aware dynamic embedding and novelty-aware adaptive clustering, fueled by lightweight story summaries. A thorough evaluation with real news data sets demonstrates that USTORY achieves higher story discovery performances than baselines while being robust and scalable to various streaming settings.
AB - Unsupervised discovery of stories with correlated news articles in real-time helps people digest massive news streams without expensive human annotations. A common approach of the existing studies for unsupervised online story discovery is to represent news articles with symbolic- or graph-based embedding and incrementally cluster them into stories. Recent large language models are expected to improve the embedding further, but a straightforward adoption of the models by indiscriminately encoding all information in articles is ineffective to deal with text-rich and evolving news streams. In this work, we propose a novel thematic embedding with an off-the-shelf pretrained sentence encoder to dynamically represent articles and stories by considering their shared temporal themes. To realize the idea for unsupervised online story discovery, a scalable framework USTORY is introduced with two main techniques, theme- and time-aware dynamic embedding and novelty-aware adaptive clustering, fueled by lightweight story summaries. A thorough evaluation with real news data sets demonstrates that USTORY achieves higher story discovery performances than baselines while being robust and scalable to various streaming settings.
KW - Document Embedding
KW - News Story Discovery
KW - News Stream Mining
UR - http://www.scopus.com/inward/record.url?scp=85168656299&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85168656299&partnerID=8YFLogxK
U2 - 10.1145/3539618.3591782
DO - 10.1145/3539618.3591782
M3 - Conference contribution
AN - SCOPUS:85168656299
T3 - SIGIR 2023 - Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval
SP - 802
EP - 811
BT - SIGIR 2023 - Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval
PB - Association for Computing Machinery
T2 - 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2023
Y2 - 23 July 2023 through 27 July 2023
ER -