TY - JOUR
T1 - Evaluating automatically generated phoneme captions for images
AU - van der Hout, Justin
AU - D'Haese, Zoltán
AU - Hasegawa-Johnson, Mark
AU - Scharenborg, Odette
N1 - Funding Information:
The authors thank Markus Müller for creating the phonetic captions of the Flickr8k corpus, and the workers from Amazon Mechanical Turk for evaluating our captions.
Publisher Copyright:
© 2020 ISCA
PY - 2020
Y1 - 2020
N2 - Image2Speech is the relatively new task of generating a spoken description of an image. This paper presents an investigation into the evaluation of this task. For this, first an Image2Speech system was implemented which generates image captions consisting of phoneme sequences. This system outperformed the original Image2Speech system on the Flickr8k corpus. Subsequently, these phoneme captions were converted into sentences of words. The captions were rated by human evaluators for their goodness of describing the image. Finally, several objective metric scores of the results were correlated with these human ratings. Although BLEU4 does not perfectly correlate with human ratings, it obtained the highest correlation among the investigated metrics, and is the best currently existing metric for the Image2Speech task. Current metrics are limited by the fact that they assume their input to be words. A more appropriate metric for the Image2Speech task should assume its input to be parts of words, i.e. phonemes, instead.
AB - Image2Speech is the relatively new task of generating a spoken description of an image. This paper presents an investigation into the evaluation of this task. For this, first an Image2Speech system was implemented which generates image captions consisting of phoneme sequences. This system outperformed the original Image2Speech system on the Flickr8k corpus. Subsequently, these phoneme captions were converted into sentences of words. The captions were rated by human evaluators for their goodness of describing the image. Finally, several objective metric scores of the results were correlated with these human ratings. Although BLEU4 does not perfectly correlate with human ratings, it obtained the highest correlation among the investigated metrics, and is the best currently existing metric for the Image2Speech task. Current metrics are limited by the fact that they assume their input to be words. A more appropriate metric for the Image2Speech task should assume its input to be parts of words, i.e. phonemes, instead.
KW - Image captioning
KW - Speech
KW - Unwritten languages
UR - http://www.scopus.com/inward/record.url?scp=85098111607&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85098111607&partnerID=8YFLogxK
U2 - 10.21437/Interspeech.2020-2870
DO - 10.21437/Interspeech.2020-2870
M3 - Conference article
AN - SCOPUS:85098111607
VL - 2020-October
SP - 2317
EP - 2321
JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
SN - 2308-457X
T2 - 21st Annual Conference of the International Speech Communication Association, INTERSPEECH 2020
Y2 - 25 October 2020 through 29 October 2020
ER -