TY - GEN
T1 - GENRES
T2 - 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL 2024
AU - Jiang, Pengcheng
AU - Lin, Jiacheng
AU - Wang, Zifeng
AU - Sun, Jimeng
AU - Han, Jiawei
N1 - This work was supported in part by US DARPA KAIROS Program No. FA8750-19-2-1004 and INCAS Program No. HR001121C0165, National Science Foundation IIS-19-56151, and the Molecule Maker Lab Institute: An AI Research Institutes program supported by NSF under Award No. 2019897, and the Institute for Geospatial Understanding through an Integrative Discovery Environment (I-GUIDE) by NSF under Award No. 2118329. The work was also supported by NSF award SCH-2205289, SCH-2014438, and IIS-2034479.
PY - 2024
Y1 - 2024
N2 - The field of relation extraction (RE) is experiencing a notable shift towards generative relation extraction (GRE), leveraging the capabilities of large language models (LLMs). However, we discovered that traditional relation extraction (RE) metrics like precision and recall fall short in evaluating GRE methods. This shortfall arises because these metrics rely on exact matching with human-annotated reference relations, while GRE methods often produce diverse and semantically accurate relations that differ from the references. To fill this gap, we introduce GENRES for a multidimensional assessment in terms of the topic similarity, uniqueness, granularity, factualness, and completeness of the GRE results. With GENRES, we empirically identified that (1) precision/recall fails to justify the performance of GRE methods; (2) human-annotated referential relations can be incomplete; (3) prompting LLMs with a fixed set of relations or entities can cause hallucinations. Next, we conducted a human evaluation of GRE methods that shows GENRES is consistent with human preferences for RE quality. Last, we made a comprehensive evaluation of fourteen leading LLMs using GENRES across document, bag, and sentence level RE datasets, respectively, to set the benchmark for future research in GRE.
AB - The field of relation extraction (RE) is experiencing a notable shift towards generative relation extraction (GRE), leveraging the capabilities of large language models (LLMs). However, we discovered that traditional relation extraction (RE) metrics like precision and recall fall short in evaluating GRE methods. This shortfall arises because these metrics rely on exact matching with human-annotated reference relations, while GRE methods often produce diverse and semantically accurate relations that differ from the references. To fill this gap, we introduce GENRES for a multidimensional assessment in terms of the topic similarity, uniqueness, granularity, factualness, and completeness of the GRE results. With GENRES, we empirically identified that (1) precision/recall fails to justify the performance of GRE methods; (2) human-annotated referential relations can be incomplete; (3) prompting LLMs with a fixed set of relations or entities can cause hallucinations. Next, we conducted a human evaluation of GRE methods that shows GENRES is consistent with human preferences for RE quality. Last, we made a comprehensive evaluation of fourteen leading LLMs using GENRES across document, bag, and sentence level RE datasets, respectively, to set the benchmark for future research in GRE.
UR - http://www.scopus.com/inward/record.url?scp=85200254454&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85200254454&partnerID=8YFLogxK
U2 - 10.18653/v1/2024.naacl-long.155
DO - 10.18653/v1/2024.naacl-long.155
M3 - Conference contribution
AN - SCOPUS:85200254454
T3 - Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL 2024
SP - 2820
EP - 2837
BT - Long Papers
A2 - Duh, Kevin
A2 - Gomez, Helena
A2 - Bethard, Steven
PB - Association for Computational Linguistics (ACL)
Y2 - 16 June 2024 through 21 June 2024
ER -