TY - GEN
T1 - Improving Scientific Document Retrieval with Concept Coverage-based Query Set Generation
AU - Kang, Seong Ku
AU - Jin, Bowen
AU - Kweon, Wonbin
AU - Zhang, Yu
AU - Lee, Dongha
AU - Han, Jiawei
AU - Yu, Hwanjo
N1 - This work was supported IITP grant funded by MSIT (No.2018-0-00584, No.2019-0-01906), NRF grant funded by the MSIT (No.RS-2023-00217286, No.2020R1A2B5B03097210). It was also in part by US DARPA INCAS Program No. HR0011-21-C0165 and BRIES Program No. HR0011-24-3-0325, National Science Foundation IIS-19-56151, the Molecule Maker Lab Institute: An AI Research Institutes program supported by NSF under Award No. 2019897, and the Institute for Geospatial Understanding through an Integrative Discovery Environment (I-GUIDE) by NSF under Award No. 2118329.
PY - 2025/3/10
Y1 - 2025/3/10
N2 - In specialized fields like the scientific domain, constructing large-scale human-annotated datasets poses a significant challenge due to the need for domain expertise. Recent methods have employed large language models to generate synthetic queries, which serve as proxies for actual user queries. However, they lack control over the content generated, often resulting in incomplete coverage of academic concepts in documents. We introduce Concept Coverage-based Query set Generation (CCQGen) framework, designed to generate a set of queries with comprehensive coverage of the document's concepts. A key distinction of CCQGen is that it adaptively adjusts the generation process based on the previously generated queries. We identify concepts not sufficiently covered by previous queries, and leverage them as conditions for subsequent query generation. This approach guides each new query to complement the previous ones, aiding in a thorough understanding of the document. Extensive experiments demonstrate that CCQGen significantly enhances query quality and retrieval performance.
AB - In specialized fields like the scientific domain, constructing large-scale human-annotated datasets poses a significant challenge due to the need for domain expertise. Recent methods have employed large language models to generate synthetic queries, which serve as proxies for actual user queries. However, they lack control over the content generated, often resulting in incomplete coverage of academic concepts in documents. We introduce Concept Coverage-based Query set Generation (CCQGen) framework, designed to generate a set of queries with comprehensive coverage of the document's concepts. A key distinction of CCQGen is that it adaptively adjusts the generation process based on the previously generated queries. We identify concepts not sufficiently covered by previous queries, and leverage them as conditions for subsequent query generation. This approach guides each new query to complement the previous ones, aiding in a thorough understanding of the document. Extensive experiments demonstrate that CCQGen significantly enhances query quality and retrieval performance.
KW - Information retrieval
KW - Query generation
KW - Scientific document search
UR - http://www.scopus.com/inward/record.url?scp=105001672896&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=105001672896&partnerID=8YFLogxK
U2 - 10.1145/3701551.3703544
DO - 10.1145/3701551.3703544
M3 - Conference contribution
AN - SCOPUS:105001672896
T3 - WSDM 2025 - Proceedings of the 18th ACM International Conference on Web Search and Data Mining
SP - 895
EP - 904
BT - WSDM 2025 - Proceedings of the 18th ACM International Conference on Web Search and Data Mining
PB - Association for Computing Machinery
T2 - 18th ACM International Conference on Web Search and Data Mining, WSDM 2025
Y2 - 10 March 2025 through 14 March 2025
ER -