HIPPI: Highly accurate protein family classification with ensembles of HMMs

Nam phuong Nguyen, Michael Nute, Siavash Mirarab, Tandy Warnow

Research output: Contribution to journalArticle

Abstract

Background: Given a new biological sequence, detecting membership in a known family is a basic step in many bioinformatics analyses, with applications to protein structure and function prediction and metagenomic taxon identification and abundance profiling, among others. Yet family identification of sequences that are distantly related to sequences in public databases or that are fragmentary remains one of the more difficult analytical problems in bioinformatics. Results: We present a new technique for family identification called HIPPI (Hierarchical Profile Hidden Markov Models for Protein family Identification). HIPPI uses a novel technique to represent a multiple sequence alignment for a given protein family or superfamily by an ensemble of profile hidden Markov models computed using HMMER. An evaluation of HIPPI on the Pfam database shows that HIPPI has better overall precision and recall than blastp, HMMER, and pipelines based on HHsearch, and maintains good accuracy even for fragmentary query sequences and for protein families with low average pairwise sequence identity, both conditions where other methods degrade in accuracy. Conclusion: HIPPI provides accurate protein family identification and is robust to difficult model conditions. Our results, combined with observations from previous studies, show that ensembles of profile Hidden Markov models can better represent multiple sequence alignments than a single profile Hidden Markov model, and thus can improve downstream analyses for various bioinformatic tasks. Further research is needed to determine the best practices for building the ensemble of profile Hidden Markov models. HIPPI is available on GitHub at https://github.com/smirarab/sepp.

Original languageEnglish (US)
Article number765
JournalBMC genomics
Volume17
DOIs
StatePublished - Nov 11 2016

Fingerprint

Computational Biology
Sequence Alignment
Proteins
Databases
Metagenomics
Practice Guidelines
Research

Keywords

  • Ensemble of profile hidden Markov models
  • Pfam
  • Protein family identification

ASJC Scopus subject areas

  • Biotechnology
  • Genetics

Cite this

HIPPI : Highly accurate protein family classification with ensembles of HMMs. / Nguyen, Nam phuong; Nute, Michael; Mirarab, Siavash; Warnow, Tandy.

In: BMC genomics, Vol. 17, 765, 11.11.2016.

Research output: Contribution to journalArticle

Nguyen, Nam phuong ; Nute, Michael ; Mirarab, Siavash ; Warnow, Tandy. / HIPPI : Highly accurate protein family classification with ensembles of HMMs. In: BMC genomics. 2016 ; Vol. 17.
@article{f839b5d43e454a3f9ccb83731367905f,
title = "HIPPI: Highly accurate protein family classification with ensembles of HMMs",
abstract = "Background: Given a new biological sequence, detecting membership in a known family is a basic step in many bioinformatics analyses, with applications to protein structure and function prediction and metagenomic taxon identification and abundance profiling, among others. Yet family identification of sequences that are distantly related to sequences in public databases or that are fragmentary remains one of the more difficult analytical problems in bioinformatics. Results: We present a new technique for family identification called HIPPI (Hierarchical Profile Hidden Markov Models for Protein family Identification). HIPPI uses a novel technique to represent a multiple sequence alignment for a given protein family or superfamily by an ensemble of profile hidden Markov models computed using HMMER. An evaluation of HIPPI on the Pfam database shows that HIPPI has better overall precision and recall than blastp, HMMER, and pipelines based on HHsearch, and maintains good accuracy even for fragmentary query sequences and for protein families with low average pairwise sequence identity, both conditions where other methods degrade in accuracy. Conclusion: HIPPI provides accurate protein family identification and is robust to difficult model conditions. Our results, combined with observations from previous studies, show that ensembles of profile Hidden Markov models can better represent multiple sequence alignments than a single profile Hidden Markov model, and thus can improve downstream analyses for various bioinformatic tasks. Further research is needed to determine the best practices for building the ensemble of profile Hidden Markov models. HIPPI is available on GitHub at https://github.com/smirarab/sepp.",
keywords = "Ensemble of profile hidden Markov models, Pfam, Protein family identification",
author = "Nguyen, {Nam phuong} and Michael Nute and Siavash Mirarab and Tandy Warnow",
year = "2016",
month = "11",
day = "11",
doi = "10.1186/s12864-016-3097-0",
language = "English (US)",
volume = "17",
journal = "BMC Genomics",
issn = "1471-2164",
publisher = "BioMed Central",

}

TY - JOUR

T1 - HIPPI

T2 - Highly accurate protein family classification with ensembles of HMMs

AU - Nguyen, Nam phuong

AU - Nute, Michael

AU - Mirarab, Siavash

AU - Warnow, Tandy

PY - 2016/11/11

Y1 - 2016/11/11

N2 - Background: Given a new biological sequence, detecting membership in a known family is a basic step in many bioinformatics analyses, with applications to protein structure and function prediction and metagenomic taxon identification and abundance profiling, among others. Yet family identification of sequences that are distantly related to sequences in public databases or that are fragmentary remains one of the more difficult analytical problems in bioinformatics. Results: We present a new technique for family identification called HIPPI (Hierarchical Profile Hidden Markov Models for Protein family Identification). HIPPI uses a novel technique to represent a multiple sequence alignment for a given protein family or superfamily by an ensemble of profile hidden Markov models computed using HMMER. An evaluation of HIPPI on the Pfam database shows that HIPPI has better overall precision and recall than blastp, HMMER, and pipelines based on HHsearch, and maintains good accuracy even for fragmentary query sequences and for protein families with low average pairwise sequence identity, both conditions where other methods degrade in accuracy. Conclusion: HIPPI provides accurate protein family identification and is robust to difficult model conditions. Our results, combined with observations from previous studies, show that ensembles of profile Hidden Markov models can better represent multiple sequence alignments than a single profile Hidden Markov model, and thus can improve downstream analyses for various bioinformatic tasks. Further research is needed to determine the best practices for building the ensemble of profile Hidden Markov models. HIPPI is available on GitHub at https://github.com/smirarab/sepp.

AB - Background: Given a new biological sequence, detecting membership in a known family is a basic step in many bioinformatics analyses, with applications to protein structure and function prediction and metagenomic taxon identification and abundance profiling, among others. Yet family identification of sequences that are distantly related to sequences in public databases or that are fragmentary remains one of the more difficult analytical problems in bioinformatics. Results: We present a new technique for family identification called HIPPI (Hierarchical Profile Hidden Markov Models for Protein family Identification). HIPPI uses a novel technique to represent a multiple sequence alignment for a given protein family or superfamily by an ensemble of profile hidden Markov models computed using HMMER. An evaluation of HIPPI on the Pfam database shows that HIPPI has better overall precision and recall than blastp, HMMER, and pipelines based on HHsearch, and maintains good accuracy even for fragmentary query sequences and for protein families with low average pairwise sequence identity, both conditions where other methods degrade in accuracy. Conclusion: HIPPI provides accurate protein family identification and is robust to difficult model conditions. Our results, combined with observations from previous studies, show that ensembles of profile Hidden Markov models can better represent multiple sequence alignments than a single profile Hidden Markov model, and thus can improve downstream analyses for various bioinformatic tasks. Further research is needed to determine the best practices for building the ensemble of profile Hidden Markov models. HIPPI is available on GitHub at https://github.com/smirarab/sepp.

KW - Ensemble of profile hidden Markov models

KW - Pfam

KW - Protein family identification

UR - http://www.scopus.com/inward/record.url?scp=85000461282&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85000461282&partnerID=8YFLogxK

U2 - 10.1186/s12864-016-3097-0

DO - 10.1186/s12864-016-3097-0

M3 - Article

C2 - 28185571

AN - SCOPUS:85000461282

VL - 17

JO - BMC Genomics

JF - BMC Genomics

SN - 1471-2164

M1 - 765

ER -