TY - JOUR
T1 - Still no lie detector for language models
T2 - probing empirical and conceptual roadblocks
AU - Levinstein, Benjamin A.
AU - Herrmann, Daniel A.
N1 - Thanks to Amos Azaria, Nora Belrose, Dylan Bowman, Nick Cohen, Daniel Filan, Jacqueline Harding, Aydin Mohseni, Bruce Rushing, Murray Shanahan, Nate Sharadin, Julia Staffel, and audiences at UMass Amherst, the University of Rochester, and the Center for AI Safety for helpful comments and feedback. Special thanks to Amos Azaria and Tom Mitchell jointly for access to their code and datasets. We are grateful to the Center for AI Safety for use of their compute cluster. B.L. was partly supported by a Mellon New Directions Fellowship (number 1905-06835) and by Open Philanthropy. D.H. was partly supported by a Long-Term Future Fund grant.
PY - 2024
Y1 - 2024
N2 - We consider the questions of whether or not large language models (LLMs) have beliefs, and, if they do, how we might measure them. First, we consider whether or not we should expect LLMs to have something like beliefs in the first place. We consider some recent arguments aiming to show that LLMs cannot have beliefs. We show that these arguments are misguided. We provide a more productive framing of questions surrounding the status of beliefs in LLMs, and highlight the empirical nature of the problem. With this lesson in hand, we evaluate two existing approaches for measuring the beliefs of LLMs, one due to Azaria and Mitchell (The internal state of an llm knows when its lying, 2023) and the other to Burns et al. (Discovering latent knowledge in language models without supervision, 2022). Moving from the armchair to the desk chair, we provide empirical results that show that these methods fail to generalize in very basic ways. We then argue that, even if LLMs have beliefs, these methods are unlikely to be successful for conceptual reasons. Thus, there is still no lie-detector for LLMs. We conclude by suggesting some concrete paths for future work.
AB - We consider the questions of whether or not large language models (LLMs) have beliefs, and, if they do, how we might measure them. First, we consider whether or not we should expect LLMs to have something like beliefs in the first place. We consider some recent arguments aiming to show that LLMs cannot have beliefs. We show that these arguments are misguided. We provide a more productive framing of questions surrounding the status of beliefs in LLMs, and highlight the empirical nature of the problem. With this lesson in hand, we evaluate two existing approaches for measuring the beliefs of LLMs, one due to Azaria and Mitchell (The internal state of an llm knows when its lying, 2023) and the other to Burns et al. (Discovering latent knowledge in language models without supervision, 2022). Moving from the armchair to the desk chair, we provide empirical results that show that these methods fail to generalize in very basic ways. We then argue that, even if LLMs have beliefs, these methods are unlikely to be successful for conceptual reasons. Thus, there is still no lie-detector for LLMs. We conclude by suggesting some concrete paths for future work.
KW - CCS
KW - Interpretability
KW - Large language models
KW - Probes
UR - http://www.scopus.com/inward/record.url?scp=85185103321&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85185103321&partnerID=8YFLogxK
U2 - 10.1007/s11098-023-02094-3
DO - 10.1007/s11098-023-02094-3
M3 - Article
AN - SCOPUS:85185103321
SN - 0031-8116
JO - Philosophical Studies
JF - Philosophical Studies
ER -