TY - GEN
T1 - Fast, diverse and accurate image captioning guided by part-of-speech
AU - Deshpande, Aditya
AU - Aneja, Jyoti
AU - Wang, Liwei
AU - Schwing, Alexander G.
AU - Forsyth, David
N1 - Funding Information:
6.Conclusion Thedevelopeddiverseimagecaptioningapproachcon-ditions on part-of-speech. It obtains higher accuracy (best-1andbest-10)thanGANandVAE-based methodsand iscomputationallymoreefficient thantheclassicalbeam search. It performs better ondifferent diversity metrics compared to other methods. Acknowledgments: Supported by NSF Grant No. 1718221 and ONRMURIAwardN00014-16-1-2007.
Publisher Copyright:
© 2019 IEEE.
PY - 2019/6
Y1 - 2019/6
N2 - Image captioning is an ambiguous problem, with many suitable captions for an image. To address ambiguity, beam search is the de facto method for sampling multiple captions. However, beam search is computationally expensive and known to produce generic captions. To address this concern, some variational auto-encoder (VAE) and generative adversarial net (GAN) based methods have been proposed. Though diverse, GAN and VAE are less accurate. In this paper, we first predict a meaningful summary of the image, then generate the caption based on that summary. We use part-of-speech as summaries, since our summary should drive caption generation. We achieve the trifecta: (1) High accuracy for the diverse captions as evaluated by standard captioning metrics and user studies; (2) Faster computation of diverse captions compared to beam search and diverse beam search; and (3) High diversity as evaluated by counting novel sentences, distinct n-grams and mutual overlap (i.e., mBleu-4) scores.
AB - Image captioning is an ambiguous problem, with many suitable captions for an image. To address ambiguity, beam search is the de facto method for sampling multiple captions. However, beam search is computationally expensive and known to produce generic captions. To address this concern, some variational auto-encoder (VAE) and generative adversarial net (GAN) based methods have been proposed. Though diverse, GAN and VAE are less accurate. In this paper, we first predict a meaningful summary of the image, then generate the caption based on that summary. We use part-of-speech as summaries, since our summary should drive caption generation. We achieve the trifecta: (1) High accuracy for the diverse captions as evaluated by standard captioning metrics and user studies; (2) Faster computation of diverse captions compared to beam search and diverse beam search; and (3) High diversity as evaluated by counting novel sentences, distinct n-grams and mutual overlap (i.e., mBleu-4) scores.
KW - Big Data
KW - Deep Learning
KW - Large Scale Methods
UR - http://www.scopus.com/inward/record.url?scp=85078808201&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85078808201&partnerID=8YFLogxK
U2 - 10.1109/CVPR.2019.01095
DO - 10.1109/CVPR.2019.01095
M3 - Conference contribution
AN - SCOPUS:85078808201
T3 - Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
SP - 10687
EP - 10696
BT - Proceedings - 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2019
PB - IEEE Computer Society
T2 - 32nd IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2019
Y2 - 16 June 2019 through 20 June 2019
ER -