BabyBERTa: Learning More Grammar With Small-Scale Child-Directed Language

Philip A. Huebner, Elior Sulem, Cynthia Fisher, Dan Roth

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Transformer-based language models have taken the NLP world by storm. However, their potential for addressing important questions in language acquisition research has been largely ignored. In this work, we examined the grammatical knowledge of RoBERTa (Liu et al., 2019) when trained on a 5M word corpus of language acquisition data to simulate the input available to children between the ages 1 and 6. Using the behavioral probing paradigm, we found that a smaller version of RoBERTa-base that never predicts unmasked tokens, which we term BabyBERTa, acquires grammatical knowledge comparable to that of pre-trained RoBERTa-base - and does so with approximately 15X fewer parameters and 6,000X fewer words. We discuss implications for building more efficient models and the learnability of grammar from input available to children. Lastly, to support research on this front, we release our novel grammar test suite that is compatible with the small vocabulary of child-directed input.

Original languageEnglish (US)
Title of host publicationCoNLL 2021 - 25th Conference on Computational Natural Language Learning, Proceedings
EditorsArianna Bisazza, Omri Abend
PublisherAssociation for Computational Linguistics (ACL)
Pages624-646
Number of pages23
ISBN (Electronic)9781955917056
StatePublished - 2021
Event25th Conference on Computational Natural Language Learning, CoNLL 2021 - Virtual, Online
Duration: Nov 10 2021Nov 11 2021

Publication series

NameCoNLL 2021 - 25th Conference on Computational Natural Language Learning, Proceedings

Conference

Conference25th Conference on Computational Natural Language Learning, CoNLL 2021
CityVirtual, Online
Period11/10/2111/11/21

ASJC Scopus subject areas

  • Artificial Intelligence
  • Human-Computer Interaction
  • Linguistics and Language

Fingerprint

Dive into the research topics of 'BabyBERTa: Learning More Grammar With Small-Scale Child-Directed Language'. Together they form a unique fingerprint.

Cite this