TY - GEN
T1 - Confidence Estimation For LLM-Based Dialogue State Tracking
AU - Sun, Yi Jyun
AU - Dey, Suvodip
AU - Hakkani-Tur, Dilek
AU - Tur, Gokhan
N1 - This research was supported in part by Other Transaction award HR0011249XXX from the U.S. Defense Advanced Research Projects Agency (DARPA) Friction forAccountability in Conversational Transactions (FACT) program. This research project has benefited from the Microsoft Accelerate Foundation Models Research (AFMR) grant program through which leading foundation models hosted by Microsoft Azure along with access to Azure credits were provided to conduct the research.
PY - 2024
Y1 - 2024
N2 - Estimation of a model's confidence on its outputs is critical for Conversational AI systems based on large language models (LLMs), especially for reducing hallucination and preventing over-reliance. In this work, we provide an exhaustive exploration of methods, including approaches proposed for open- and closed-weight LLMs, aimed at quantifying and leveraging model uncertainty to improve the reliability of LLM-generated responses, specifically focusing on dialogue state tracking (DST) in task-oriented dialogue systems (TODS). Regardless of the model type, well-calibrated confidence scores are essential to handle uncertainties, thereby improving model performance. We evaluate four methods for estimating confidence scores based on softmax, raw token scores, verbalized confidences, and a combination of these methods, using the area under the curve (AUC) metric to assess calibration, with higher AUC indicating better calibration. We also enhance these with a self-probing mechanism, proposed for closed models. Furthermore, we assess these methods using an open-weight model fine-tuned for the task of DST, achieving superior joint goal accuracy (JGA). Our findings also suggest that fine-tuning open-weight LLMs can result in enhanced AUC performance, indicating better confidence score calibration.
AB - Estimation of a model's confidence on its outputs is critical for Conversational AI systems based on large language models (LLMs), especially for reducing hallucination and preventing over-reliance. In this work, we provide an exhaustive exploration of methods, including approaches proposed for open- and closed-weight LLMs, aimed at quantifying and leveraging model uncertainty to improve the reliability of LLM-generated responses, specifically focusing on dialogue state tracking (DST) in task-oriented dialogue systems (TODS). Regardless of the model type, well-calibrated confidence scores are essential to handle uncertainties, thereby improving model performance. We evaluate four methods for estimating confidence scores based on softmax, raw token scores, verbalized confidences, and a combination of these methods, using the area under the curve (AUC) metric to assess calibration, with higher AUC indicating better calibration. We also enhance these with a self-probing mechanism, proposed for closed models. Furthermore, we assess these methods using an open-weight model fine-tuned for the task of DST, achieving superior joint goal accuracy (JGA). Our findings also suggest that fine-tuning open-weight LLMs can result in enhanced AUC performance, indicating better confidence score calibration.
KW - Task-oriented dialogue systems
KW - confidence scores
KW - dialogue state tracking
KW - model uncertainty
UR - http://www.scopus.com/inward/record.url?scp=85217426759&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85217426759&partnerID=8YFLogxK
U2 - 10.1109/SLT61566.2024.10832237
DO - 10.1109/SLT61566.2024.10832237
M3 - Conference contribution
AN - SCOPUS:85217426759
T3 - Proceedings of 2024 IEEE Spoken Language Technology Workshop, SLT 2024
SP - 1083
EP - 1090
BT - Proceedings of 2024 IEEE Spoken Language Technology Workshop, SLT 2024
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2024 IEEE Spoken Language Technology Workshop, SLT 2024
Y2 - 2 December 2024 through 5 December 2024
ER -