Abstract
Research on neural IR has so far been focused primarily on standard supervised learning settings, where it outperforms traditional term matching baselines. Many practical use cases of such models, however, may involve previously unseen target domains. In this paper, we propose to improve the out-of-domain generalization of Dense Passage Retrieval (DPR)—a popular choice for neural IR—through synthetic data augmentation only in the source domain. We empirically show that pre-finetuning DPR with additional synthetic data in its source domain (Wikipedia), which we generate using a fine-tuned sequence-to-sequence generator1, can be a low-cost yet effective first step towards its generalization. Across five different test sets, our augmented model shows more robust performance than DPR in both in-domain and zero-shot out-of-domain evaluation.
Original language | English (US) |
---|---|
Pages (from-to) | 1065-1070 |
Number of pages | 6 |
Journal | Proceedings - International Conference on Computational Linguistics, COLING |
Volume | 29 |
Issue number | 1 |
State | Published - 2022 |
Event | 29th International Conference on Computational Linguistics, COLING 2022 - Gyeongju, Korea, Republic of Duration: Oct 12 2022 → Oct 17 2022 |
ASJC Scopus subject areas
- Computational Theory and Mathematics
- Computer Science Applications
- Theoretical Computer Science