TY - GEN
T1 - Reo-relevance, extraness, omission
T2 - 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019
AU - Jiang, Ming
AU - Hu, Junjie
AU - Huang, Qiuyuan
AU - Zhang, Lei
AU - Diesner, Jana
AU - Gao, Jianfeng
N1 - We appreciate anonymous reviewers for their constructive comments and insightful suggestions. This work was partly performed when Ming Jiang was interning at Microsoft Research. The authors would like to thank Pengchuan Zhang for his help with pre-training the grounding model.
PY - 2019
Y1 - 2019
N2 - Popular metrics used for evaluating image captioning systems, such as BLEU and CIDEr, provide a single score to gauge the system's overall effectiveness. This score is often not informative enough to indicate what specific errors are made by a given system. In this study, we present a fine-grained evaluation method REO for automatically measuring the performance of image captioning systems. REO assesses the quality of captions from three perspectives: 1) Relevance to the ground truth, 2) Extraness of the content that is irrelevant to the ground truth, and 3) Omission of the elements in the images and human references. Experiments on three benchmark datasets demonstrate that our method achieves a higher consistency with human judgments and provides more intuitive evaluation results than alternative metrics.
AB - Popular metrics used for evaluating image captioning systems, such as BLEU and CIDEr, provide a single score to gauge the system's overall effectiveness. This score is often not informative enough to indicate what specific errors are made by a given system. In this study, we present a fine-grained evaluation method REO for automatically measuring the performance of image captioning systems. REO assesses the quality of captions from three perspectives: 1) Relevance to the ground truth, 2) Extraness of the content that is irrelevant to the ground truth, and 3) Omission of the elements in the images and human references. Experiments on three benchmark datasets demonstrate that our method achieves a higher consistency with human judgments and provides more intuitive evaluation results than alternative metrics.
UR - http://www.scopus.com/inward/record.url?scp=85084288070&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85084288070&partnerID=8YFLogxK
U2 - 10.18653/v1/D19-1156
DO - 10.18653/v1/D19-1156
M3 - Conference contribution
AN - SCOPUS:85084288070
T3 - EMNLP-IJCNLP 2019 - 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing, Proceedings of the Conference
SP - 1475
EP - 1480
BT - EMNLP-IJCNLP 2019 - 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing, Proceedings of the Conference
PB - Association for Computational Linguistics
Y2 - 3 November 2019 through 7 November 2019
ER -