TY - JOUR
T1 - Compressive genomics for protein databases
AU - Daniels, Noah M.
AU - Gallant, Andrew
AU - Peng, Jian
AU - Cowen, Lenore J.
AU - Baym, Michael
AU - Berger, Bonnie
N1 - Funding Information:
Funding: This work was partially supported by a grant from the Simons Foundation and the NIH (to B.B.). N.D., A.G. and L.C. were funded in part by NIH grant (R01GM080330). M.B. was funded in part by an NSF MSPRF grant.
PY - 2013/7/1
Y1 - 2013/7/1
N2 - Motivation: The exponential growth of protein sequence databases has increasingly made the fundamental question of searching for homologs a computational bottleneck. The amount of unique data, however, is not growing nearly as fast; we can exploit this fact to greatly accelerate homology search. Acceleration of programs in the popular PSI/DELTA-BLAST family of tools will not only speed-up homology search directly but also the huge collection of other current programs that primarily interact with large protein databases via precisely these tools.Results: We introduce a suite of homology search tools, powered by compressively accelerated protein BLAST (CaBLASTP), which are significantly faster than and comparably accurate with all known state-of-the-art tools, including HHblits, DELTA-BLAST and PSI-BLAST. Further, our tools are implemented in a manner that allows direct substitution into existing analysis pipelines. The key idea is that we introduce a local similarity-based compression scheme that allows us to operate directly on the compressed data. Importantly, CaBLASTP's runtime scales almost linearly in the amount of unique data, as opposed to current BLASTP variants, which scale linearly in the size of the full protein database being searched. Our compressive algorithms will speed-up many tasks, such as protein structure prediction and orthology mapping, which rely heavily on homology search.
AB - Motivation: The exponential growth of protein sequence databases has increasingly made the fundamental question of searching for homologs a computational bottleneck. The amount of unique data, however, is not growing nearly as fast; we can exploit this fact to greatly accelerate homology search. Acceleration of programs in the popular PSI/DELTA-BLAST family of tools will not only speed-up homology search directly but also the huge collection of other current programs that primarily interact with large protein databases via precisely these tools.Results: We introduce a suite of homology search tools, powered by compressively accelerated protein BLAST (CaBLASTP), which are significantly faster than and comparably accurate with all known state-of-the-art tools, including HHblits, DELTA-BLAST and PSI-BLAST. Further, our tools are implemented in a manner that allows direct substitution into existing analysis pipelines. The key idea is that we introduce a local similarity-based compression scheme that allows us to operate directly on the compressed data. Importantly, CaBLASTP's runtime scales almost linearly in the amount of unique data, as opposed to current BLASTP variants, which scale linearly in the size of the full protein database being searched. Our compressive algorithms will speed-up many tasks, such as protein structure prediction and orthology mapping, which rely heavily on homology search.
UR - http://www.scopus.com/inward/record.url?scp=84879920277&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84879920277&partnerID=8YFLogxK
U2 - 10.1093/bioinformatics/btt214
DO - 10.1093/bioinformatics/btt214
M3 - Article
C2 - 23812995
AN - SCOPUS:84879920277
SN - 1367-4803
VL - 29
SP - i283-i290
JO - Bioinformatics
JF - Bioinformatics
IS - 13
ER -