Limits of Detecting Text Generated by Large-Scale Language Models

Lav R. Varshney, Nitish Shirish Keskar, Richard Socher

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Some consider large-scale language models that can generate long and coherent pieces of text as dangerous, since they may be used in misinformation campaigns. Here we formulate large-scale language model output detection as a hypothesis testing problem to classify text as genuine or generated. We show that error exponents for particular language models are bounded in terms of their perplexity, a standard measure of language generation performance. Under the assumption that human language is stationary and ergodic, the formulation is ex-tended from considering specific language models to considering maximum likelihood language models, among the class of k-order Markov approximations; error probabilities are characterized. Some discussion of incorporating semantic side information is also given.

Original languageEnglish (US)
Title of host publication2020 Information Theory and Applications Workshop, ITA 2020
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9781728141909
DOIs
StatePublished - Feb 2 2020
Externally publishedYes
Event2020 Information Theory and Applications Workshop, ITA 2020 - San Diego, United States
Duration: Feb 2 2020Feb 7 2020

Publication series

Name2020 Information Theory and Applications Workshop, ITA 2020

Conference

Conference2020 Information Theory and Applications Workshop, ITA 2020
Country/TerritoryUnited States
CitySan Diego
Period2/2/202/7/20

ASJC Scopus subject areas

  • Artificial Intelligence
  • Computational Theory and Mathematics
  • Computer Science Applications
  • Information Systems and Management
  • Control and Optimization

Fingerprint

Dive into the research topics of 'Limits of Detecting Text Generated by Large-Scale Language Models'. Together they form a unique fingerprint.

Cite this