TY - JOUR
T1 - Synthesizing Spoken Descriptions of Images
AU - Wang, Xinsheng
AU - Van Der Hout, Justin
AU - Zhu, Jihua
AU - Hasegawa-Johnson, Mark
AU - Scharenborg, Odette
N1 - Funding Information:
This work was supported in part by the National Key R&D Program of China under Grant 2018AAA0102504 and in part by the Key Research and Development Program of Shaanxi under Grant 2021GY-025. The work of XinshengWang was supported by China Scholarship Council (CSC).
Publisher Copyright:
© 2014 IEEE.
PY - 2021
Y1 - 2021
N2 - Image captioning technology has great potential in many scenarios. However, current text-based image captioning methods cannot be applied to approximately half of the world's languages due to these languages' lack of a written form. To solve this problem, recently the image-to-speech task was proposed, which generates spoken descriptions of images bypassing any text via an intermediate representation consisting of phonemes (image-to-phoneme). Here, we present a comprehensive study on the image-to-speech task in which, 1) several representative image-to-text generation methods are implemented for the image-to-phoneme task, 2) objective metrics are sought to evaluate the image-to-phoneme task, and 3) an end-to-end image-to-speech model that is able to synthesize spoken descriptions of images bypassing both text and phonemes is proposed. Extensive experiments are conducted on the public benchmark database Flickr8k. Results of our experiments demonstrate that 1) State-of-the-art image-to-text models can perform well on the image-to-phoneme task, and 2) several evaluation metrics, including BLEU3, BLEU4, BLEU5, and ROUGE-L can be used to evaluate image-to-phoneme performance. Finally, 3) end-to-end image-to-speech bypassing text and phonemes is feasible.
AB - Image captioning technology has great potential in many scenarios. However, current text-based image captioning methods cannot be applied to approximately half of the world's languages due to these languages' lack of a written form. To solve this problem, recently the image-to-speech task was proposed, which generates spoken descriptions of images bypassing any text via an intermediate representation consisting of phonemes (image-to-phoneme). Here, we present a comprehensive study on the image-to-speech task in which, 1) several representative image-to-text generation methods are implemented for the image-to-phoneme task, 2) objective metrics are sought to evaluate the image-to-phoneme task, and 3) an end-to-end image-to-speech model that is able to synthesize spoken descriptions of images bypassing both text and phonemes is proposed. Extensive experiments are conducted on the public benchmark database Flickr8k. Results of our experiments demonstrate that 1) State-of-the-art image-to-text models can perform well on the image-to-phoneme task, and 2) several evaluation metrics, including BLEU3, BLEU4, BLEU5, and ROUGE-L can be used to evaluate image-to-phoneme performance. Finally, 3) end-to-end image-to-speech bypassing text and phonemes is feasible.
KW - Image-to-speech generation
KW - cross-modal captioning
KW - multimodal modelling
KW - speech synthesis
UR - http://www.scopus.com/inward/record.url?scp=85118250966&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85118250966&partnerID=8YFLogxK
U2 - 10.1109/TASLP.2021.3120644
DO - 10.1109/TASLP.2021.3120644
M3 - Article
AN - SCOPUS:85118250966
SN - 2329-9290
VL - 29
SP - 3242
EP - 3254
JO - IEEE/ACM Transactions on Speech and Language Processing
JF - IEEE/ACM Transactions on Speech and Language Processing
ER -