Abstract
In this paper, we create an open-domain conversational system by combining the power of internet browser interfaces with multi-modal inputs and data mined from web search and browser logs. The work focuses on two novel components: (1) dynamic contextual adaptation of speech recognition and understanding models using visual context, and (2) fusion of users' speech and gesture inputs to understand their intents and associated arguments. The system was evaluated in a living room setup with live test subjects on a real-time implementation of the multimodal dialog system. Users interacted with a television browser using gestures and speech. Gestures were captured by Microsoft Kinect skeleton tracking and speech was recorded by a Kinect microphone array. Results show a 16% error rate reduction (ERR) for contextual ASR adaptation to clickable web page content, and 7-10% ERR when using gestures with speech. Analysis of the results suggest a strategy for selection of multimodal intent when users clearly and persistently indicate pointing intent (e.g., eye gaze), giving a 54.7% ERR over lexical features.
Original language | English (US) |
---|---|
Pages (from-to) | 96-101 |
Number of pages | 6 |
Journal | CEUR Workshop Proceedings |
Volume | 1012 |
State | Published - 2013 |
Externally published | Yes |
Event | 1st Workshop on Speech, Language and Audio in Multimedia, SLAM 2013 - Marseille, France Duration: Aug 22 2013 → Aug 23 2013 |
Keywords
- Conversational browsing
- Conversational search
- Multi-modal fusion
- Spoken dialog systems
- Spoken language understanding
ASJC Scopus subject areas
- General Computer Science