TY - GEN
T1 - Conditional image-text embedding networks
AU - Plummer, Bryan A.
AU - Kordas, Paige
AU - Kiapour, M. Hadi
AU - Zheng, Shuai
AU - Piramuthu, Robinson
AU - Lazebnik, Svetlana
N1 - Publisher Copyright:
© Springer Nature Switzerland AG 2018.
PY - 2018
Y1 - 2018
N2 - This paper presents an approach for grounding phrases in images which jointly learns multiple text-conditioned embeddings in a single end-to-end model. In order to differentiate text phrases into semantically distinct subspaces, we propose a concept weight branch that automatically assigns phrases to embeddings, whereas prior works predefine such assignments. Our proposed solution simplifies the representation requirements for individual embeddings and allows the underrepresented concepts to take advantage of the shared representations before feeding them into concept-specific layers. Comprehensive experiments verify the effectiveness of our approach across three phrase grounding datasets, Flickr30K Entities, ReferIt Game, and Visual Genome, where we obtain a (resp.) 4%, 3%, and 4% improvement in grounding performance over a strong region-phrase embedding baseline (Code: https://github.com/BryanPlummer/cite).
AB - This paper presents an approach for grounding phrases in images which jointly learns multiple text-conditioned embeddings in a single end-to-end model. In order to differentiate text phrases into semantically distinct subspaces, we propose a concept weight branch that automatically assigns phrases to embeddings, whereas prior works predefine such assignments. Our proposed solution simplifies the representation requirements for individual embeddings and allows the underrepresented concepts to take advantage of the shared representations before feeding them into concept-specific layers. Comprehensive experiments verify the effectiveness of our approach across three phrase grounding datasets, Flickr30K Entities, ReferIt Game, and Visual Genome, where we obtain a (resp.) 4%, 3%, and 4% improvement in grounding performance over a strong region-phrase embedding baseline (Code: https://github.com/BryanPlummer/cite).
KW - Conditional models
KW - Embedding methods
KW - Natural language grounding
KW - Phrase localization
UR - http://www.scopus.com/inward/record.url?scp=85055107100&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85055107100&partnerID=8YFLogxK
U2 - 10.1007/978-3-030-01258-8_16
DO - 10.1007/978-3-030-01258-8_16
M3 - Conference contribution
AN - SCOPUS:85055107100
SN - 9783030012571
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 258
EP - 274
BT - Computer Vision – ECCV 2018 - 15th European Conference, 2018, Proceedings
A2 - Hebert, Martial
A2 - Ferrari, Vittorio
A2 - Sminchisescu, Cristian
A2 - Weiss, Yair
PB - Springer
T2 - 15th European Conference on Computer Vision, ECCV 2018
Y2 - 8 September 2018 through 14 September 2018
ER -