Multiple sequence alignment: A major challenge to large-scale phylogenetics

Kevin Liu, C. Randal Linder, Tandy Warnow

Research output: Contribution to journalArticle

Abstract

Over the last decade, dramatic advances have been made in developing methods for large-scale phylogeny estimation, so that it is now feasible for investigators with moderate computational resources to obtain reasonable solutions to maximum likelihood and maximum parsimony, even for datasets with a few thousand sequences. There has also been progress on developing methods for multiple sequence alignment, so that greater alignment accuracy (and subsequent improvement in phylogenetic accuracy) is now possible through automated methods. However, these methods have not been tested under conditions that reflect properties of datasets confronted by large-scale phylogenetic estimation projects. In this paper we report on a study that compares several alignment methods on a benchmark collection of nucleotide sequence datasets of up to 78,132 sequences. We show that as the number of sequences increases, the number of alignment methods that can analyze the datasets decreases. Furthermore, the most accurate alignment methods are unable to analyze the very largest datasets we studied, so that only moderately accurate alignment methods can be used on the largest datasets. As a result, alignments computed for large datasets have relatively large error rates, and maximum likelihood phylogenies computed on these alignments also have high error rates. Therefore, the estimation of highly accurate multiple sequence alignments is a major challenge for Tree of Life projects, and more generally for large-scale systematics studies.

Original languageEnglish (US)
Article numberecurrents.RRN1198
JournalPLoS Currents
Issue numberNOV
DOIs
StatePublished - Dec 1 2010
Externally publishedYes

Fingerprint

Sequence Alignment
Phylogeny
Benchmarking
Datasets
Research Personnel

ASJC Scopus subject areas

  • Medicine (miscellaneous)

Cite this

Multiple sequence alignment : A major challenge to large-scale phylogenetics. / Liu, Kevin; Linder, C. Randal; Warnow, Tandy.

In: PLoS Currents, No. NOV, ecurrents.RRN1198, 01.12.2010.

Research output: Contribution to journalArticle

@article{8165386c38034a5e82381214aeccfc0f,
title = "Multiple sequence alignment: A major challenge to large-scale phylogenetics",
abstract = "Over the last decade, dramatic advances have been made in developing methods for large-scale phylogeny estimation, so that it is now feasible for investigators with moderate computational resources to obtain reasonable solutions to maximum likelihood and maximum parsimony, even for datasets with a few thousand sequences. There has also been progress on developing methods for multiple sequence alignment, so that greater alignment accuracy (and subsequent improvement in phylogenetic accuracy) is now possible through automated methods. However, these methods have not been tested under conditions that reflect properties of datasets confronted by large-scale phylogenetic estimation projects. In this paper we report on a study that compares several alignment methods on a benchmark collection of nucleotide sequence datasets of up to 78,132 sequences. We show that as the number of sequences increases, the number of alignment methods that can analyze the datasets decreases. Furthermore, the most accurate alignment methods are unable to analyze the very largest datasets we studied, so that only moderately accurate alignment methods can be used on the largest datasets. As a result, alignments computed for large datasets have relatively large error rates, and maximum likelihood phylogenies computed on these alignments also have high error rates. Therefore, the estimation of highly accurate multiple sequence alignments is a major challenge for Tree of Life projects, and more generally for large-scale systematics studies.",
author = "Kevin Liu and Linder, {C. Randal} and Tandy Warnow",
year = "2010",
month = "12",
day = "1",
doi = "10.1371/currents.RRN1198",
language = "English (US)",
journal = "PLoS Currents",
issn = "2157-3999",
publisher = "Public Library of Science",
number = "NOV",

}

TY - JOUR

T1 - Multiple sequence alignment

T2 - A major challenge to large-scale phylogenetics

AU - Liu, Kevin

AU - Linder, C. Randal

AU - Warnow, Tandy

PY - 2010/12/1

Y1 - 2010/12/1

N2 - Over the last decade, dramatic advances have been made in developing methods for large-scale phylogeny estimation, so that it is now feasible for investigators with moderate computational resources to obtain reasonable solutions to maximum likelihood and maximum parsimony, even for datasets with a few thousand sequences. There has also been progress on developing methods for multiple sequence alignment, so that greater alignment accuracy (and subsequent improvement in phylogenetic accuracy) is now possible through automated methods. However, these methods have not been tested under conditions that reflect properties of datasets confronted by large-scale phylogenetic estimation projects. In this paper we report on a study that compares several alignment methods on a benchmark collection of nucleotide sequence datasets of up to 78,132 sequences. We show that as the number of sequences increases, the number of alignment methods that can analyze the datasets decreases. Furthermore, the most accurate alignment methods are unable to analyze the very largest datasets we studied, so that only moderately accurate alignment methods can be used on the largest datasets. As a result, alignments computed for large datasets have relatively large error rates, and maximum likelihood phylogenies computed on these alignments also have high error rates. Therefore, the estimation of highly accurate multiple sequence alignments is a major challenge for Tree of Life projects, and more generally for large-scale systematics studies.

AB - Over the last decade, dramatic advances have been made in developing methods for large-scale phylogeny estimation, so that it is now feasible for investigators with moderate computational resources to obtain reasonable solutions to maximum likelihood and maximum parsimony, even for datasets with a few thousand sequences. There has also been progress on developing methods for multiple sequence alignment, so that greater alignment accuracy (and subsequent improvement in phylogenetic accuracy) is now possible through automated methods. However, these methods have not been tested under conditions that reflect properties of datasets confronted by large-scale phylogenetic estimation projects. In this paper we report on a study that compares several alignment methods on a benchmark collection of nucleotide sequence datasets of up to 78,132 sequences. We show that as the number of sequences increases, the number of alignment methods that can analyze the datasets decreases. Furthermore, the most accurate alignment methods are unable to analyze the very largest datasets we studied, so that only moderately accurate alignment methods can be used on the largest datasets. As a result, alignments computed for large datasets have relatively large error rates, and maximum likelihood phylogenies computed on these alignments also have high error rates. Therefore, the estimation of highly accurate multiple sequence alignments is a major challenge for Tree of Life projects, and more generally for large-scale systematics studies.

UR - http://www.scopus.com/inward/record.url?scp=80053194704&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=80053194704&partnerID=8YFLogxK

U2 - 10.1371/currents.RRN1198

DO - 10.1371/currents.RRN1198

M3 - Article

C2 - 21113338

AN - SCOPUS:80053194704

JO - PLoS Currents

JF - PLoS Currents

SN - 2157-3999

IS - NOV

M1 - ecurrents.RRN1198

ER -