TY - GEN
T1 - Alexa Teacher Model
T2 - 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD 2022
AU - Fitzgerald, Jack
AU - Ananthakrishnan, Shankar
AU - Arkoudas, Konstantine
AU - Bernardi, Davide
AU - Bhagia, Abhishek
AU - Delli Bovi, Claudio
AU - Cao, Jin
AU - Chada, Rakesh
AU - Chauhan, Amit
AU - Chen, Luoxin
AU - Dwarakanath, Anurag
AU - Dwivedi, Satyam
AU - Gojayev, Turan
AU - Gopalakrishnan, Karthik
AU - Gueudre, Thomas
AU - Hakkani-Tur, Dilek
AU - Hamza, Wael
AU - Hueser, Jonathan J.
AU - Jose, Kevin Martin
AU - Khan, Haidar
AU - Liu, Beiye
AU - Lu, Jianhua
AU - Manzotti, Alessandro
AU - Natarajan, Pradeep
AU - Owczarzak, Karolina
AU - Oz, Gokmen
AU - Palumbo, Enrico
AU - Peris, Charith
AU - Prakash, Chandana Satya
AU - Rawls, Stephen
AU - Rosenbaum, Andy
AU - Shenoy, Anjali
AU - Soltan, Saleh
AU - Sridhar, Mukund Harakere
AU - Tan, Lizhen
AU - Triefenbach, Fabian
AU - Wei, Pan
AU - Yu, Haiyang
AU - Zheng, Shuai
AU - Tur, Gokhan
AU - Natarajan, Prem
N1 - Publisher Copyright:
© 2022 Owner/Author.
PY - 2022/8/14
Y1 - 2022/8/14
N2 - We present results from a large-scale experiment on pretraining encoders with non-embedding parameter counts ranging from 700M to 9.3B, their subsequent distillation into smaller models ranging from 17M-170M parameters, and their application to the Natural Language Understanding (NLU) component of a virtual assistant system. Though we train using 70% spoken-form data, our teacher models perform comparably to XLM-R and mT5 when evaluated on the written-form Cross-lingual Natural Language Inference (XNLI) corpus. We perform a second stage of pretraining on our teacher models using in-domain data from our system, improving error rates by 3.86% relative for intent classification and 7.01% relative for slot filling. We find that even a 170M-parameter model distilled from our Stage 2 teacher model has 2.88% better intent classification and 7.69% better slot filling error rates when compared to the 2.3B-parameter teacher trained only on public data (Stage 1), emphasizing the importance of in-domain data for pretraining. When evaluated offline using labeled NLU data, our 17M-parameter Stage 2 distilled model outperforms both XLM-R Base (85M params) and DistillBERT (42M params) by 4.23% to 6.14%, respectively. Finally, we present results from a full virtual assistant experimentation platform, where we find that models trained using our pretraining and distillation pipeline outperform models distilled from 85M-parameter teachers by 3.74%-4.91% on an automatic measurement of full-system user dissatisfaction.
AB - We present results from a large-scale experiment on pretraining encoders with non-embedding parameter counts ranging from 700M to 9.3B, their subsequent distillation into smaller models ranging from 17M-170M parameters, and their application to the Natural Language Understanding (NLU) component of a virtual assistant system. Though we train using 70% spoken-form data, our teacher models perform comparably to XLM-R and mT5 when evaluated on the written-form Cross-lingual Natural Language Inference (XNLI) corpus. We perform a second stage of pretraining on our teacher models using in-domain data from our system, improving error rates by 3.86% relative for intent classification and 7.01% relative for slot filling. We find that even a 170M-parameter model distilled from our Stage 2 teacher model has 2.88% better intent classification and 7.69% better slot filling error rates when compared to the 2.3B-parameter teacher trained only on public data (Stage 1), emphasizing the importance of in-domain data for pretraining. When evaluated offline using labeled NLU data, our 17M-parameter Stage 2 distilled model outperforms both XLM-R Base (85M params) and DistillBERT (42M params) by 4.23% to 6.14%, respectively. Finally, we present results from a full virtual assistant experimentation platform, where we find that models trained using our pretraining and distillation pipeline outperform models distilled from 85M-parameter teachers by 3.74%-4.91% on an automatic measurement of full-system user dissatisfaction.
KW - distributed training
KW - knowledge distillation
KW - model pretraining
KW - natural language understanding
KW - self-attention
KW - transformers
KW - virtual assistant
KW - voice a.i.
UR - http://www.scopus.com/inward/record.url?scp=85134245670&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85134245670&partnerID=8YFLogxK
U2 - 10.1145/3534678.3539173
DO - 10.1145/3534678.3539173
M3 - Conference contribution
AN - SCOPUS:85134245670
T3 - Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
SP - 2893
EP - 2902
BT - KDD 2022 - Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining
PB - Association for Computing Machinery
Y2 - 14 August 2022 through 18 August 2022
ER -