TY - JOUR
T1 - Learning Two-Branch Neural Networks for Image-Text Matching Tasks
AU - Wang, Liwei
AU - Li, Yin
AU - Huang, Jing
AU - Lazebnik, Svetlana
N1 - Funding Information:
This material is based upon work supported by the National Science Foundation under Grants CIF-1302438 and IIS-1563727, Xerox UAC, and the Sloan Foundation. We would like to thank Bryan Plummer for providing features for region-phrase experiments, and Kevin Shih for thoughtful discussions on the similarity network and help with building the region-phrase experimental environment.
Publisher Copyright:
© 1979-2012 IEEE.
PY - 2019/2/1
Y1 - 2019/2/1
N2 - Image-language matching tasks have recently attracted a lot of attention in the computer vision field. These tasks include image-sentence matching, i.e., given an image query, retrieving relevant sentences and vice versa, and region-phrase matching or visual grounding, i.e., matching a phrase to relevant regions. This paper investigates two-branch neural networks for learning the similarity between these two data modalities. We propose two network structures that produce different output representations. The first one, referred to as an embedding network, learns an explicit shared latent embedding space with a maximum-margin ranking loss and novel neighborhood constraints. Compared to standard triplet sampling, we perform improved neighborhood sampling that takes neighborhood information into consideration while constructing mini-batches. The second network structure, referred to as a similarity network, fuses the two branches via element-wise product and is trained with regression loss to directly predict a similarity score. Extensive experiments show that our networks achieve high accuracies for phrase localization on the Flickr30K Entities dataset and for bi-directional image-sentence retrieval on Flickr30K and MSCOCO datasets.
AB - Image-language matching tasks have recently attracted a lot of attention in the computer vision field. These tasks include image-sentence matching, i.e., given an image query, retrieving relevant sentences and vice versa, and region-phrase matching or visual grounding, i.e., matching a phrase to relevant regions. This paper investigates two-branch neural networks for learning the similarity between these two data modalities. We propose two network structures that produce different output representations. The first one, referred to as an embedding network, learns an explicit shared latent embedding space with a maximum-margin ranking loss and novel neighborhood constraints. Compared to standard triplet sampling, we perform improved neighborhood sampling that takes neighborhood information into consideration while constructing mini-batches. The second network structure, referred to as a similarity network, fuses the two branches via element-wise product and is trained with regression loss to directly predict a similarity score. Extensive experiments show that our networks achieve high accuracies for phrase localization on the Flickr30K Entities dataset and for bi-directional image-sentence retrieval on Flickr30K and MSCOCO datasets.
KW - Deep learning
KW - cross-modal retrieval
KW - image-sentence retrieval
KW - phrase localization
KW - visual grounding
UR - http://www.scopus.com/inward/record.url?scp=85040982943&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85040982943&partnerID=8YFLogxK
U2 - 10.1109/TPAMI.2018.2797921
DO - 10.1109/TPAMI.2018.2797921
M3 - Article
C2 - 29994350
AN - SCOPUS:85040982943
SN - 0162-8828
VL - 41
SP - 394
EP - 407
JO - IEEE Transactions on Pattern Analysis and Machine Intelligence
JF - IEEE Transactions on Pattern Analysis and Machine Intelligence
IS - 2
M1 - 8268651
ER -