TY - GEN
T1 - Large Language Models as User-Agents For Evaluating Task-Oriented-Dialogue Systems
AU - Kazi, Taaha
AU - Lyu, Ruiliang
AU - Zhou, Sizhe
AU - Hakkani-Tur, Dilek
AU - Tur, Gokhan
N1 - This research was supported in part by Other Transaction award HR0011249XXX from the U.S. Defense Advanced Research Projects Agency (DARPA) Friction for Accountability in Conversational Transactions (FACT) program. This research project has benefitted from the Microsoft Accelerate Foundation Models Research (AFMR) grant program through which leading foundation models hosted by Microsoft Azure along with access to Azure credits were provided to conduct the research.
PY - 2024
Y1 - 2024
N2 - Traditionally, offline datasets have been used to evaluate task-oriented dialogue (TOD) models. These datasets lack context awareness, making them suboptimal benchmarks for conversational systems. In contrast, user-agents, which are context-aware, can simulate the variability and unpredictability of human conversations, making them better alternatives as evaluators. Prior research has utilized large language models (LLMs) to develop user-agents. Our work builds upon this by using LLMs to create user-agents for the evaluation of TOD systems. This involves prompting an LLM, using in-context examples as guidance, and tracking the user-goal state. Our evaluation of diversity and task completion metrics for the user-agents shows improved performance with the use of better prompts. Additionally, we propose methodologies for the automatic evaluation of TOD models within this dynamic framework. We make our code publicly available1,.
AB - Traditionally, offline datasets have been used to evaluate task-oriented dialogue (TOD) models. These datasets lack context awareness, making them suboptimal benchmarks for conversational systems. In contrast, user-agents, which are context-aware, can simulate the variability and unpredictability of human conversations, making them better alternatives as evaluators. Prior research has utilized large language models (LLMs) to develop user-agents. Our work builds upon this by using LLMs to create user-agents for the evaluation of TOD systems. This involves prompting an LLM, using in-context examples as guidance, and tracking the user-goal state. Our evaluation of diversity and task completion metrics for the user-agents shows improved performance with the use of better prompts. Additionally, we propose methodologies for the automatic evaluation of TOD models within this dynamic framework. We make our code publicly available1,.
KW - Task-oriented dialogue systems
KW - large language models
KW - task completion
KW - user simulation agents
UR - http://www.scopus.com/inward/record.url?scp=85217418974&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85217418974&partnerID=8YFLogxK
U2 - 10.1109/SLT61566.2024.10832298
DO - 10.1109/SLT61566.2024.10832298
M3 - Conference contribution
AN - SCOPUS:85217418974
T3 - Proceedings of 2024 IEEE Spoken Language Technology Workshop, SLT 2024
SP - 913
EP - 920
BT - Proceedings of 2024 IEEE Spoken Language Technology Workshop, SLT 2024
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2024 IEEE Spoken Language Technology Workshop, SLT 2024
Y2 - 2 December 2024 through 5 December 2024
ER -