TY - GEN
T1 - Quick Dense Retrievers Consume KALE
T2 - 4th Workshop on Simple and Efficient Natural Language Processing, SustaiNLP 2023
AU - Campos, Daniel
AU - Magnani, Alessandro
AU - Zhai, Cheng Xiang
N1 - Publisher Copyright:
© 2023 Proceedings of the Annual Meeting of the Association for Computational Linguistics. All rights reserved.
PY - 2023
Y1 - 2023
N2 - In this paper, we consider the problem of improving the inference latency of language model-based dense retrieval systems by introducing structural compression and model size asymmetry between the context and query encoders. First, we investigate the impact of pre and post-training compression on the MSMARCO, Natural Questions, TriviaQA, SQUAD, and SCIFACT, finding that asymmetry in the dual-encoders in dense retrieval can lead to improved inference efficiency. Knowing this, we introduce Kullback-Leibler Alignment of Embeddings (KALE), an efficient and accurate method for increasing the inference efficiency of dense retrieval methods by pruning and aligning the query encoder after training. Specifically, KALE extends traditional Knowledge Distillation after bi-encoder training, allowing for effective query encoder compression without full retraining or index generation. Using KALE and asymmetric training, we can generate models which exceed the performance of DistilBERT despite having 3x faster inference.
AB - In this paper, we consider the problem of improving the inference latency of language model-based dense retrieval systems by introducing structural compression and model size asymmetry between the context and query encoders. First, we investigate the impact of pre and post-training compression on the MSMARCO, Natural Questions, TriviaQA, SQUAD, and SCIFACT, finding that asymmetry in the dual-encoders in dense retrieval can lead to improved inference efficiency. Knowing this, we introduce Kullback-Leibler Alignment of Embeddings (KALE), an efficient and accurate method for increasing the inference efficiency of dense retrieval methods by pruning and aligning the query encoder after training. Specifically, KALE extends traditional Knowledge Distillation after bi-encoder training, allowing for effective query encoder compression without full retraining or index generation. Using KALE and asymmetric training, we can generate models which exceed the performance of DistilBERT despite having 3x faster inference.
UR - http://www.scopus.com/inward/record.url?scp=85175819813&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85175819813&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85175819813
T3 - Proceedings of the Annual Meeting of the Association for Computational Linguistics
SP - 59
EP - 77
BT - 4th Workshop on Simple and Efficient Natural Language Processing, SustaiNLP 2023 - Proceedings of the Workshop
A2 - Moosavi, Nafise Sadat
A2 - Gurevych, Iryna
A2 - Hou, Yufang
A2 - Kim, Gyuwan
A2 - Young, Jin Kim
A2 - Schuster, Tal
A2 - Agrawal, Ameeta
PB - Association for Computational Linguistics (ACL)
Y2 - 13 July 2023
ER -