TY - GEN
T1 - Facet-aware evaluation for extractive summarization
AU - Mao, Yuning
AU - Liu, Liyuan
AU - Zhu, Qi
AU - Ren, Xiang
AU - Han, Jiawei
N1 - We thank Woojeong Jin and Jiaming Shen for the valuable feedback on the paper draft. We thank anonymous reviewers for the constructive comments. Research was sponsored in part by US DARPA KAIROS Program No. FA8750-19-2-1004 and SocialSim Program No. W911NF-17-C-0099, National Science Foundation IIS 16-18481, IIS 17-04532, and IIS-17-41317, and DTRA HD-TRA11810026. Any opinions, findings, and conclusions or recommendations expressed herein are those of the authors and should not be interpreted as necessarily representing the views, either expressed or implied, of DARPA or the U.S. Government.
PY - 2020
Y1 - 2020
N2 - Commonly adopted metrics for extractive summarization focus on lexical overlap at the token level. In this paper, we present a facet-aware evaluation setup for better assessment of the information coverage in extracted summaries. Specifically, we treat each sentence in the reference summary as a facet, identify the sentences in the document that express the semantics of each facet as support sentences of the facet, and automatically evaluate extractive summarization methods by comparing the indices of extracted sentences and support sentences of all the facets in the reference summary. To facilitate this new evaluation setup, we construct an extractive version of the CNN/Daily Mail dataset and perform a thorough quantitative investigation, through which we demonstrate that facet-aware evaluation manifests better correlation with human judgment than ROUGE, enables fine-grained evaluation as well as comparative analysis, and reveals valuable insights of state-of-the-art summarization methods.
AB - Commonly adopted metrics for extractive summarization focus on lexical overlap at the token level. In this paper, we present a facet-aware evaluation setup for better assessment of the information coverage in extracted summaries. Specifically, we treat each sentence in the reference summary as a facet, identify the sentences in the document that express the semantics of each facet as support sentences of the facet, and automatically evaluate extractive summarization methods by comparing the indices of extracted sentences and support sentences of all the facets in the reference summary. To facilitate this new evaluation setup, we construct an extractive version of the CNN/Daily Mail dataset and perform a thorough quantitative investigation, through which we demonstrate that facet-aware evaluation manifests better correlation with human judgment than ROUGE, enables fine-grained evaluation as well as comparative analysis, and reveals valuable insights of state-of-the-art summarization methods.
UR - https://www.scopus.com/pages/publications/85098421497
UR - https://www.scopus.com/pages/publications/85098421497#tab=citedBy
U2 - 10.18653/v1/2020.acl-main.445
DO - 10.18653/v1/2020.acl-main.445
M3 - Conference contribution
AN - SCOPUS:85098421497
T3 - Proceedings of the Annual Meeting of the Association for Computational Linguistics
SP - 4941
EP - 4957
BT - ACL 2020 - 58th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference
PB - Association for Computational Linguistics (ACL)
T2 - 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020
Y2 - 5 July 2020 through 10 July 2020
ER -