TY - GEN
T1 - A simple baseline for audio-visual scene-aware dialog
AU - Schwartz, Idan
AU - Schwing, Alexander G.
AU - Hazan, Tamir
N1 - Publisher Copyright:
© 2019 IEEE.
PY - 2019/6
Y1 - 2019/6
N2 - The recently proposed audio-visual scene-aware dialog task paves the way to a more data-driven way of learning virtual assistants, smart speakers and car navigation systems. However, very little is known to date about how to effectively extract meaningful information from a plethora of sensors that pound the computational engine of those devices. Therefore, in this paper, we provide and carefully analyze a simple baseline for audio-visual scene-aware dialog which is trained end-to-end. Our method differentiates in a data-driven manner useful signals from distracting ones using an attention mechanism. We evaluate the proposed approach on the recently introduced and challenging audio-visual scene-aware dataset, and demonstrate the key features that permit to outperform the current state-of-the-art by more than 20% on CIDEr.
AB - The recently proposed audio-visual scene-aware dialog task paves the way to a more data-driven way of learning virtual assistants, smart speakers and car navigation systems. However, very little is known to date about how to effectively extract meaningful information from a plethora of sensors that pound the computational engine of those devices. Therefore, in this paper, we provide and carefully analyze a simple baseline for audio-visual scene-aware dialog which is trained end-to-end. Our method differentiates in a data-driven manner useful signals from distracting ones using an attention mechanism. We evaluate the proposed approach on the recently introduced and challenging audio-visual scene-aware dataset, and demonstrate the key features that permit to outperform the current state-of-the-art by more than 20% on CIDEr.
KW - Vision + Language
UR - http://www.scopus.com/inward/record.url?scp=85078726008&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85078726008&partnerID=8YFLogxK
U2 - 10.1109/CVPR.2019.01283
DO - 10.1109/CVPR.2019.01283
M3 - Conference contribution
AN - SCOPUS:85078726008
T3 - Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
SP - 12540
EP - 12550
BT - Proceedings - 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2019
PB - IEEE Computer Society
T2 - 32nd IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2019
Y2 - 16 June 2019 through 20 June 2019
ER -