TY - GEN
T1 - A PAC-Bayesian approach to minimum perplexity language modeling
AU - Bharadwaj, Sujeeth
AU - Hasegawa-Johnson, Mark
PY - 2014
Y1 - 2014
N2 - Despite the overwhelming use of statistical language models in speech recognition, machine translation, and several other domains, few high probability guarantees exist on their generalization error. In this paper, we bound the test set perplexity of two popular language models - The n-gram model and class-based n-grams - using PAC-Bayesian theorems for unsupervised learning. We extend the bound to sequence clustering, wherein classes represent longer context such as phrases. The new bound is dominated by the maximum number of sequences represented by each cluster, which is polynomial in the vocabulary size. We show that we can still encourage small sample generalization by sparsifying the cluster assignment probabilities. We incorporate our bound into an efficient HMM-based sequence clustering algorithm and validate the theory with empirical results on the resource management corpus.
AB - Despite the overwhelming use of statistical language models in speech recognition, machine translation, and several other domains, few high probability guarantees exist on their generalization error. In this paper, we bound the test set perplexity of two popular language models - The n-gram model and class-based n-grams - using PAC-Bayesian theorems for unsupervised learning. We extend the bound to sequence clustering, wherein classes represent longer context such as phrases. The new bound is dominated by the maximum number of sequences represented by each cluster, which is polynomial in the vocabulary size. We show that we can still encourage small sample generalization by sparsifying the cluster assignment probabilities. We incorporate our bound into an efficient HMM-based sequence clustering algorithm and validate the theory with empirical results on the resource management corpus.
UR - http://www.scopus.com/inward/record.url?scp=84959879503&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84959879503&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:84959879503
T3 - COLING 2014 - 25th International Conference on Computational Linguistics, Proceedings of COLING 2014: Technical Papers
SP - 130
EP - 140
BT - COLING 2014 - 25th International Conference on Computational Linguistics, Proceedings of COLING 2014
PB - Association for Computational Linguistics, ACL Anthology
T2 - 25th International Conference on Computational Linguistics, COLING 2014
Y2 - 23 August 2014 through 29 August 2014
ER -