TY - GEN
T1 - GLEN
T2 - 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023
AU - Zhan, Qiusi
AU - Li, Sha
AU - Conger, Kathryn
AU - Palmer, Martha
AU - Ji, Heng
AU - Han, Jiawei
N1 - Publisher Copyright:
©2023 Association for Computational Linguistics.
PY - 2023
Y1 - 2023
N2 - The progress of event extraction research has been hindered by the absence of wide-coverage, large-scale datasets. To make event extraction systems more accessible, we build a general-purpose event detection dataset GLEN which covers 205K event mentions with 3,465 different types, making it more than 20x larger in ontology than today's largest event dataset. GLEN is created by utilizing the DWD Overlay, which provides a mapping between Wikidata Qnodes and PropBank rolesets. This enables us to use the abundant existing annotation for PropBank as distant supervision. In addition, we also propose a new multi-stage event detection model CEDAR specifically designed to handle the large ontology size in GLEN. We show that our model exhibits superior performance compared to a range of baselines including InstructGPT. Finally, we perform error analysis and show that label noise is still the largest challenge for improving performance for this new dataset.
AB - The progress of event extraction research has been hindered by the absence of wide-coverage, large-scale datasets. To make event extraction systems more accessible, we build a general-purpose event detection dataset GLEN which covers 205K event mentions with 3,465 different types, making it more than 20x larger in ontology than today's largest event dataset. GLEN is created by utilizing the DWD Overlay, which provides a mapping between Wikidata Qnodes and PropBank rolesets. This enables us to use the abundant existing annotation for PropBank as distant supervision. In addition, we also propose a new multi-stage event detection model CEDAR specifically designed to handle the large ontology size in GLEN. We show that our model exhibits superior performance compared to a range of baselines including InstructGPT. Finally, we perform error analysis and show that label noise is still the largest challenge for improving performance for this new dataset.
UR - http://www.scopus.com/inward/record.url?scp=85178394964&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85178394964&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85178394964
T3 - EMNLP 2023 - 2023 Conference on Empirical Methods in Natural Language Processing, Proceedings
SP - 2823
EP - 2838
BT - EMNLP 2023 - 2023 Conference on Empirical Methods in Natural Language Processing, Proceedings
A2 - Bouamor, Houda
A2 - Pino, Juan
A2 - Bali, Kalika
PB - Association for Computational Linguistics (ACL)
Y2 - 6 December 2023 through 10 December 2023
ER -