TY - GEN
T1 - Eye gaze for spoken language understanding in multi-modal conversational interactions
AU - Hakkani-Tür, Dilek
AU - Slaney, Malcolm
AU - Celikyilmaz, Asli
AU - Heck, Larry
N1 - Publisher Copyright:
Copyright 2014 ACM.
PY - 2014/11/12
Y1 - 2014/11/12
N2 - When humans converse with each other, they naturally amalgamate information from multiple modalities (i.e., speech, gestures, speech prosody, facial expressions, and eye gaze). This paper focuses on eye gaze and its combination with speech. We develop a model that resolves references to visual (screen) elements in a conversational web browsing system. The system detects eye gaze, recognizes speech, and then interprets the user's browsing intent (e.g., click on a specific element) through a combination of spoken language understanding and eye gaze tracking. We experiment with multi-turn interactions collected in a wizard-of-Oz scenario where users are asked to perform several web-browsing tasks. We compare several gaze features and evaluate their effectiveness when combined with speech-based lexical features. The resulting multi-modal system not only increases user intent (turn) accuracy by 17%, but also resolves the referring expression ambiguity commonly observed in dialog systems with a 10% increase in F-measure.
AB - When humans converse with each other, they naturally amalgamate information from multiple modalities (i.e., speech, gestures, speech prosody, facial expressions, and eye gaze). This paper focuses on eye gaze and its combination with speech. We develop a model that resolves references to visual (screen) elements in a conversational web browsing system. The system detects eye gaze, recognizes speech, and then interprets the user's browsing intent (e.g., click on a specific element) through a combination of spoken language understanding and eye gaze tracking. We experiment with multi-turn interactions collected in a wizard-of-Oz scenario where users are asked to perform several web-browsing tasks. We compare several gaze features and evaluate their effectiveness when combined with speech-based lexical features. The resulting multi-modal system not only increases user intent (turn) accuracy by 17%, but also resolves the referring expression ambiguity commonly observed in dialog systems with a 10% increase in F-measure.
KW - Eye gaze
KW - Reference resolution
KW - Spoken language understanding
UR - http://www.scopus.com/inward/record.url?scp=84947215502&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84947215502&partnerID=8YFLogxK
U2 - 10.1145/2663204.2663277
DO - 10.1145/2663204.2663277
M3 - Conference contribution
AN - SCOPUS:84947215502
T3 - ICMI 2014 - Proceedings of the 2014 International Conference on Multimodal Interaction
SP - 263
EP - 266
BT - ICMI 2014 - Proceedings of the 2014 International Conference on Multimodal Interaction
PB - Association for Computing Machinery
T2 - 16th ACM International Conference on Multimodal Interaction, ICMI 2014
Y2 - 12 November 2014 through 16 November 2014
ER -