Unsupervised Textual Grounding: Linking Words to Image Concepts

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Textual grounding, i.e., linking words to objects in images, is a challenging but important task for robotics and human-computer interaction. Existing techniques benefit from recent progress in deep learning and generally formulate the task as a supervised learning problem, selecting a bounding box from a set of possible options. To train these deep net based approaches, access to a large-scale datasets is required, however, constructing such a dataset is time-consuming and expensive. Therefore, we develop a completely unsupervised mechanism for textual grounding using hypothesis testing as a mechanism to link words to detected image concepts. We demonstrate our approach on the ReferIt Game dataset and the Flickr30k data, outperforming baselines by 7.98% and 6.96% respectively.

Original languageEnglish (US)
Title of host publicationProceedings - 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2018
PublisherIEEE Computer Society
Pages6125-6134
Number of pages10
ISBN (Electronic)9781538664209
DOIs
StatePublished - Dec 14 2018
Event31st Meeting of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2018 - Salt Lake City, United States
Duration: Jun 18 2018Jun 22 2018

Publication series

NameProceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
ISSN (Print)1063-6919

Conference

Conference31st Meeting of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2018
CountryUnited States
CitySalt Lake City
Period6/18/186/22/18

Fingerprint

Electric grounding
Supervised learning
Human computer interaction
Robotics
Testing
Deep learning

ASJC Scopus subject areas

  • Software
  • Computer Vision and Pattern Recognition

Cite this

Yeh, R. A., Do, M. N., & Schwing, A. G. (2018). Unsupervised Textual Grounding: Linking Words to Image Concepts. In Proceedings - 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2018 (pp. 6125-6134). [8578739] (Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition). IEEE Computer Society. https://doi.org/10.1109/CVPR.2018.00641

Unsupervised Textual Grounding : Linking Words to Image Concepts. / Yeh, Raymond A.; Do, Minh N; Schwing, Alexander Gerhard.

Proceedings - 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2018. IEEE Computer Society, 2018. p. 6125-6134 8578739 (Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Yeh, RA, Do, MN & Schwing, AG 2018, Unsupervised Textual Grounding: Linking Words to Image Concepts. in Proceedings - 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2018., 8578739, Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, IEEE Computer Society, pp. 6125-6134, 31st Meeting of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, United States, 6/18/18. https://doi.org/10.1109/CVPR.2018.00641
Yeh RA, Do MN, Schwing AG. Unsupervised Textual Grounding: Linking Words to Image Concepts. In Proceedings - 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2018. IEEE Computer Society. 2018. p. 6125-6134. 8578739. (Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition). https://doi.org/10.1109/CVPR.2018.00641
Yeh, Raymond A. ; Do, Minh N ; Schwing, Alexander Gerhard. / Unsupervised Textual Grounding : Linking Words to Image Concepts. Proceedings - 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2018. IEEE Computer Society, 2018. pp. 6125-6134 (Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition).
@inproceedings{dd2a0081824541bab56810b250dc71e7,
title = "Unsupervised Textual Grounding: Linking Words to Image Concepts",
abstract = "Textual grounding, i.e., linking words to objects in images, is a challenging but important task for robotics and human-computer interaction. Existing techniques benefit from recent progress in deep learning and generally formulate the task as a supervised learning problem, selecting a bounding box from a set of possible options. To train these deep net based approaches, access to a large-scale datasets is required, however, constructing such a dataset is time-consuming and expensive. Therefore, we develop a completely unsupervised mechanism for textual grounding using hypothesis testing as a mechanism to link words to detected image concepts. We demonstrate our approach on the ReferIt Game dataset and the Flickr30k data, outperforming baselines by 7.98{\%} and 6.96{\%} respectively.",
author = "Yeh, {Raymond A.} and Do, {Minh N} and Schwing, {Alexander Gerhard}",
year = "2018",
month = "12",
day = "14",
doi = "10.1109/CVPR.2018.00641",
language = "English (US)",
series = "Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition",
publisher = "IEEE Computer Society",
pages = "6125--6134",
booktitle = "Proceedings - 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2018",

}

TY - GEN

T1 - Unsupervised Textual Grounding

T2 - Linking Words to Image Concepts

AU - Yeh, Raymond A.

AU - Do, Minh N

AU - Schwing, Alexander Gerhard

PY - 2018/12/14

Y1 - 2018/12/14

N2 - Textual grounding, i.e., linking words to objects in images, is a challenging but important task for robotics and human-computer interaction. Existing techniques benefit from recent progress in deep learning and generally formulate the task as a supervised learning problem, selecting a bounding box from a set of possible options. To train these deep net based approaches, access to a large-scale datasets is required, however, constructing such a dataset is time-consuming and expensive. Therefore, we develop a completely unsupervised mechanism for textual grounding using hypothesis testing as a mechanism to link words to detected image concepts. We demonstrate our approach on the ReferIt Game dataset and the Flickr30k data, outperforming baselines by 7.98% and 6.96% respectively.

AB - Textual grounding, i.e., linking words to objects in images, is a challenging but important task for robotics and human-computer interaction. Existing techniques benefit from recent progress in deep learning and generally formulate the task as a supervised learning problem, selecting a bounding box from a set of possible options. To train these deep net based approaches, access to a large-scale datasets is required, however, constructing such a dataset is time-consuming and expensive. Therefore, we develop a completely unsupervised mechanism for textual grounding using hypothesis testing as a mechanism to link words to detected image concepts. We demonstrate our approach on the ReferIt Game dataset and the Flickr30k data, outperforming baselines by 7.98% and 6.96% respectively.

UR - http://www.scopus.com/inward/record.url?scp=85057880538&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85057880538&partnerID=8YFLogxK

U2 - 10.1109/CVPR.2018.00641

DO - 10.1109/CVPR.2018.00641

M3 - Conference contribution

AN - SCOPUS:85057880538

T3 - Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition

SP - 6125

EP - 6134

BT - Proceedings - 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2018

PB - IEEE Computer Society

ER -