TY - GEN
T1 - New tools for web-scale N-grams
AU - Lin, Dekang
AU - Church, Kenneth
AU - Ji, Heng
AU - Sekine, Satoshi
AU - Yarowsky, David
AU - Bergsma, Shane
AU - Patil, Kailash
AU - Pitler, Emily
AU - Lathbury, Rachel
AU - Rao, Vikram
AU - Dalwani, Kapil
AU - Narsale, Sushant
N1 - Funding Information:
We gratefully acknowledge Frederick Jelinek and the members of the Center for Language and Speech Processing at Johns Hopkins University for hosting the workshop at which this research was conducted. We thank the IBM/Google Academic Cloud Computing Initiative for providing access to their computing cluster. We also thank the National Science Foundation, Google Research, and the Defense Advanced Research Projects Agency for sponsoring the workshop, and Thorsten Brants, Fernando Pereira and Alfred Spector at Google for their help with providing the new N-gram data.
PY - 2010
Y1 - 2010
N2 - While the web provides a fantastic linguistic resource, collecting and processing data at web-scale is beyond the reach of most academic laboratories. Previous research has relied on search engines to collect online information, but this is hopelessly inefficient for building large-scale linguistic resources, such as lists of named-entity types or clusters of distributionally-similar words. An alternative to processing web-scale text directly is to use the information provided in an N-gram corpus. An N-gram corpus is an efficient compression of large amounts of text. An N-gram corpus states how often each sequence of words (up to length N) occurs. We propose tools for working with enhanced web-scale N-gram corpora that include richer levels of source annotation, such as part-of-speech tags. We describe a new set of search tools that make use of these tags, and collectively lower the barrier for lexical learning and ambiguity resolution at web-scale. The tools will allow novel sources of information to be applied to long-standing natural language challenges.
AB - While the web provides a fantastic linguistic resource, collecting and processing data at web-scale is beyond the reach of most academic laboratories. Previous research has relied on search engines to collect online information, but this is hopelessly inefficient for building large-scale linguistic resources, such as lists of named-entity types or clusters of distributionally-similar words. An alternative to processing web-scale text directly is to use the information provided in an N-gram corpus. An N-gram corpus is an efficient compression of large amounts of text. An N-gram corpus states how often each sequence of words (up to length N) occurs. We propose tools for working with enhanced web-scale N-gram corpora that include richer levels of source annotation, such as part-of-speech tags. We describe a new set of search tools that make use of these tags, and collectively lower the barrier for lexical learning and ambiguity resolution at web-scale. The tools will allow novel sources of information to be applied to long-standing natural language challenges.
UR - http://www.scopus.com/inward/record.url?scp=84951821679&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84951821679&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:84951821679
T3 - Proceedings of the 7th International Conference on Language Resources and Evaluation, LREC 2010
SP - 2221
EP - 2227
BT - Proceedings of the 7th International Conference on Language Resources and Evaluation, LREC 2010
A2 - Tapias, Daniel
A2 - Russo, Irene
A2 - Hamon, Olivier
A2 - Piperidis, Stelios
A2 - Calzolari, Nicoletta
A2 - Choukri, Khalid
A2 - Mariani, Joseph
A2 - Mazo, Helene
A2 - Maegaard, Bente
A2 - Odijk, Jan
A2 - Rosner, Mike
PB - European Language Resources Association (ELRA)
T2 - 7th International Conference on Language Resources and Evaluation, LREC 2010
Y2 - 17 May 2010 through 23 May 2010
ER -