TY - JOUR
T1 - A multi-view embedding space for modeling internet images, tags, and their semantics
AU - Gong, Yunchao
AU - Ke, Qifa
AU - Isard, Michael
AU - Lazebnik, Svetlana
N1 - Funding Information:
and Mariyam Khalid for helping with manual evaluation of the auto-tagging experiments. Gong and Lazebnik were supported by NSF grant IIS 1228082, DARPA Computer Science Study Group (D12AP00305), and Microsoft Research Faculty Fellowship.
PY - 2014/1
Y1 - 2014/1
N2 - This paper investigates the problem of modeling Internet images and associated text or tags for tasks such as image-to-image search, tag-to-image search, and image-to-tag search (image annotation). We start with canonical correlation analysis (CCA), a popular and successful approach for mapping visual and textual features to the same latent space, and incorporate a third view capturing high-level image semantics, represented either by a single category or multiple non-mutually-exclusive concepts. We present two ways to train the three-view embedding: supervised, with the third view coming from ground-truth labels or search keywords; and unsupervised, with semantic themes automatically obtained by clustering the tags. To ensure high accuracy for retrieval tasks while keeping the learning process scalable, we combine multiple strong visual features and use explicit nonlinear kernel mappings to efficiently approximate kernel CCA. To perform retrieval, we use a specially designed similarity function in the embedded space, which substantially outperforms the Euclidean distance. The resulting system produces compelling qualitative results and outperforms a number of two-view baselines on retrieval tasks on three large-scale Internet image datasets.
AB - This paper investigates the problem of modeling Internet images and associated text or tags for tasks such as image-to-image search, tag-to-image search, and image-to-tag search (image annotation). We start with canonical correlation analysis (CCA), a popular and successful approach for mapping visual and textual features to the same latent space, and incorporate a third view capturing high-level image semantics, represented either by a single category or multiple non-mutually-exclusive concepts. We present two ways to train the three-view embedding: supervised, with the third view coming from ground-truth labels or search keywords; and unsupervised, with semantic themes automatically obtained by clustering the tags. To ensure high accuracy for retrieval tasks while keeping the learning process scalable, we combine multiple strong visual features and use explicit nonlinear kernel mappings to efficiently approximate kernel CCA. To perform retrieval, we use a specially designed similarity function in the embedded space, which substantially outperforms the Euclidean distance. The resulting system produces compelling qualitative results and outperforms a number of two-view baselines on retrieval tasks on three large-scale Internet image datasets.
KW - Canonical correlation
KW - Image search
KW - Internet images
KW - Tags
UR - http://www.scopus.com/inward/record.url?scp=84894905366&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84894905366&partnerID=8YFLogxK
U2 - 10.1007/s11263-013-0658-4
DO - 10.1007/s11263-013-0658-4
M3 - Article
AN - SCOPUS:84894905366
SN - 0920-5691
VL - 106
SP - 210
EP - 233
JO - International Journal of Computer Vision
JF - International Journal of Computer Vision
IS - 2
ER -