TY - GEN
T1 - Language models for image captioning
T2 - 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, ACL-IJCNLP 2015
AU - Devlin, Jacob
AU - Cheng, Hao
AU - Fang, Hao
AU - Gupta, Saurabh
AU - Deng, Li
AU - He, Xiaodong
AU - Zweig, Geoffrey
AU - Mitchell, Margaret
N1 - Publisher Copyright:
© 2015 Association for Computational Linguistics.
PY - 2015
Y1 - 2015
N2 - Two recent approaches have achieved state-of-the-art results in image caption-ing. The first uses a pipelined process where a set of candidate words is gen-erated by a convolutional neural network (CNN) trained on images, and then a max-imum entropy (ME) language model is used to arrange these words into a coherent sentence. The second uses the penultimate activation layer of the CNN as input to a recurrent neural network (RNN) that then generates the caption sequence. In this pa-per, we compare the merits of these dif-ferent language modeling approaches for the first time by using the same state-of-the-art CNN as input. We examine is-sues in the different approaches, includ-ing linguistic irregularities, caption repe-tition, and data set overlap. By combining key aspects of the ME and RNN methods, we achieve a new record performance over previously published results on the bench-mark COCO dataset. However, the gains we see in BLEU do not translate to human judgments.
AB - Two recent approaches have achieved state-of-the-art results in image caption-ing. The first uses a pipelined process where a set of candidate words is gen-erated by a convolutional neural network (CNN) trained on images, and then a max-imum entropy (ME) language model is used to arrange these words into a coherent sentence. The second uses the penultimate activation layer of the CNN as input to a recurrent neural network (RNN) that then generates the caption sequence. In this pa-per, we compare the merits of these dif-ferent language modeling approaches for the first time by using the same state-of-the-art CNN as input. We examine is-sues in the different approaches, includ-ing linguistic irregularities, caption repe-tition, and data set overlap. By combining key aspects of the ME and RNN methods, we achieve a new record performance over previously published results on the bench-mark COCO dataset. However, the gains we see in BLEU do not translate to human judgments.
UR - http://www.scopus.com/inward/record.url?scp=84944096380&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84944096380&partnerID=8YFLogxK
U2 - 10.3115/v1/p15-2017
DO - 10.3115/v1/p15-2017
M3 - Conference contribution
AN - SCOPUS:84944096380
T3 - ACL-IJCNLP 2015 - 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, Proceedings of the Conference
SP - 100
EP - 105
BT - ACL-IJCNLP 2015 - 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, Proceedings of the Conference
PB - Association for Computational Linguistics (ACL)
Y2 - 26 July 2015 through 31 July 2015
ER -