TY - JOUR
T1 - Identifying overrepresented concepts in gene lists from literature
T2 - A statistical approach based on Poisson mixture model
AU - He, Xin
AU - Sarma, Moushumi S.
AU - Ling, Xu
AU - Chee, Brant
AU - Zhai, Chengxiang
AU - Schatz, Bruce
N1 - Funding Information:
We thank Gene Robinson for helpful discussions on our analysis of honey bee genes, and Todd Littell and David Arcoleo for programming support. This work was supported by the U.S. National Science Foundation under awards FIBR-04-25852.
PY - 2010/5/20
Y1 - 2010/5/20
N2 - Background: Large-scale genomic studies often identify large gene lists, for example, the genes sharing the same expression patterns. The interpretation of these gene lists is generally achieved by extracting concepts overrepresented in the gene lists. This analysis often depends on manual annotation of genes based on controlled vocabularies, in particular, Gene Ontology (GO). However, the annotation of genes is a labor-intensive process; and the vocabularies are generally incomplete, leaving some important biological domains inadequately covered.Results: We propose a statistical method that uses the primary literature, i.e. free-text, as the source to perform overrepresentation analysis. The method is based on a statistical framework of mixture model and addresses the methodological flaws in several existing programs. We implemented this method within a literature mining system, BeeSpace, taking advantage of its analysis environment and added features that facilitate the interactive analysis of gene sets. Through experimentation with several datasets, we showed that our program can effectively summarize the important conceptual themes of large gene sets, even when traditional GO-based analysis does not yield informative results.Conclusions: We conclude that the current work will provide biologists with a tool that effectively complements the existing ones for overrepresentation analysis from genomic experiments. Our program, Genelist Analyzer, is freely available at: http://workerbee.igb.uiuc.edu:8080/BeeSpace/Search.jsp.
AB - Background: Large-scale genomic studies often identify large gene lists, for example, the genes sharing the same expression patterns. The interpretation of these gene lists is generally achieved by extracting concepts overrepresented in the gene lists. This analysis often depends on manual annotation of genes based on controlled vocabularies, in particular, Gene Ontology (GO). However, the annotation of genes is a labor-intensive process; and the vocabularies are generally incomplete, leaving some important biological domains inadequately covered.Results: We propose a statistical method that uses the primary literature, i.e. free-text, as the source to perform overrepresentation analysis. The method is based on a statistical framework of mixture model and addresses the methodological flaws in several existing programs. We implemented this method within a literature mining system, BeeSpace, taking advantage of its analysis environment and added features that facilitate the interactive analysis of gene sets. Through experimentation with several datasets, we showed that our program can effectively summarize the important conceptual themes of large gene sets, even when traditional GO-based analysis does not yield informative results.Conclusions: We conclude that the current work will provide biologists with a tool that effectively complements the existing ones for overrepresentation analysis from genomic experiments. Our program, Genelist Analyzer, is freely available at: http://workerbee.igb.uiuc.edu:8080/BeeSpace/Search.jsp.
UR - http://www.scopus.com/inward/record.url?scp=77953993713&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=77953993713&partnerID=8YFLogxK
U2 - 10.1186/1471-2105-11-272
DO - 10.1186/1471-2105-11-272
M3 - Article
C2 - 20487560
AN - SCOPUS:77953993713
SN - 1471-2105
VL - 11
JO - BMC Bioinformatics
JF - BMC Bioinformatics
M1 - 272
ER -