Multi-modal conversational search and browse

Larry Heck, Dilek Hakkani-Tür, Madhu Chinthakunta, Gokhan Tur, Rukmini Iyer, Partha Parthasarathy, Lisa Stifelman, Elizabeth Shriberg, Ashley Fidler

Research output: Contribution to journalConference articlepeer-review

Abstract

In this paper, we create an open-domain conversational system by combining the power of internet browser interfaces with multi-modal inputs and data mined from web search and browser logs. The work focuses on two novel components: (1) dynamic contextual adaptation of speech recognition and understanding models using visual context, and (2) fusion of users' speech and gesture inputs to understand their intents and associated arguments. The system was evaluated in a living room setup with live test subjects on a real-time implementation of the multimodal dialog system. Users interacted with a television browser using gestures and speech. Gestures were captured by Microsoft Kinect skeleton tracking and speech was recorded by a Kinect microphone array. Results show a 16% error rate reduction (ERR) for contextual ASR adaptation to clickable web page content, and 7-10% ERR when using gestures with speech. Analysis of the results suggest a strategy for selection of multimodal intent when users clearly and persistently indicate pointing intent (e.g., eye gaze), giving a 54.7% ERR over lexical features.

Original languageEnglish (US)
Pages (from-to)96-101
Number of pages6
JournalCEUR Workshop Proceedings
Volume1012
StatePublished - 2013
Externally publishedYes
Event1st Workshop on Speech, Language and Audio in Multimedia, SLAM 2013 - Marseille, France
Duration: Aug 22 2013Aug 23 2013

Keywords

  • Conversational browsing
  • Conversational search
  • Multi-modal fusion
  • Spoken dialog systems
  • Spoken language understanding

ASJC Scopus subject areas

  • General Computer Science

Fingerprint

Dive into the research topics of 'Multi-modal conversational search and browse'. Together they form a unique fingerprint.

Cite this