Abstract

Background Most statistical methods for phylogenetic estimation in use today treat a gap (generally representing an insertion or deletion, i.e., indel) within the input sequence alignment as missing data. However, the statistical properties of this treatment of indels have not been fully investigated. Results We prove that maximum likelihood phylogeny estimation, treating indels as missing data, can be statistically inconsistent for a general (and rather simple) model of sequence evolution, even when given the true alignment. Therefore, accurate phylogeny estimation cannot be guaranteed for maximum likelihood analyses, even given arbitrarily long sequences, when indels are present and treated as missing data. Conclusions Our result shows that the standard statistical techniques used to estimate phylogenies from sequence alignments may have unfavorable statistical properties, even when the sequence alignment is accurate and the assumed substitution model matches the generation model. This suggests that the recent research focus on developing statistical methods that treat indel events properly is an important direction for phylogeny estimation.

Original languageEnglish (US)
Article numberecurrents.RRN1308
JournalPLoS Currents
DOIs
StatePublished - Oct 19 2012

Fingerprint

Phylogeny
Sequence Alignment
Research

ASJC Scopus subject areas

  • Medicine (miscellaneous)

Cite this

@article{bf6342b9bc8c4a1684318d5cee31e95a,
title = "Standard maximum likelihood analyses of alignments with gaps can be statistically inconsistent",
abstract = "Background Most statistical methods for phylogenetic estimation in use today treat a gap (generally representing an insertion or deletion, i.e., indel) within the input sequence alignment as missing data. However, the statistical properties of this treatment of indels have not been fully investigated. Results We prove that maximum likelihood phylogeny estimation, treating indels as missing data, can be statistically inconsistent for a general (and rather simple) model of sequence evolution, even when given the true alignment. Therefore, accurate phylogeny estimation cannot be guaranteed for maximum likelihood analyses, even given arbitrarily long sequences, when indels are present and treated as missing data. Conclusions Our result shows that the standard statistical techniques used to estimate phylogenies from sequence alignments may have unfavorable statistical properties, even when the sequence alignment is accurate and the assumed substitution model matches the generation model. This suggests that the recent research focus on developing statistical methods that treat indel events properly is an important direction for phylogeny estimation.",
author = "Tandy Warnow",
year = "2012",
month = "10",
day = "19",
doi = "10.1371/currents.RRN1308",
language = "English (US)",
journal = "PLoS Currents",
issn = "2157-3999",
publisher = "Public Library of Science",

}

TY - JOUR

T1 - Standard maximum likelihood analyses of alignments with gaps can be statistically inconsistent

AU - Warnow, Tandy

PY - 2012/10/19

Y1 - 2012/10/19

N2 - Background Most statistical methods for phylogenetic estimation in use today treat a gap (generally representing an insertion or deletion, i.e., indel) within the input sequence alignment as missing data. However, the statistical properties of this treatment of indels have not been fully investigated. Results We prove that maximum likelihood phylogeny estimation, treating indels as missing data, can be statistically inconsistent for a general (and rather simple) model of sequence evolution, even when given the true alignment. Therefore, accurate phylogeny estimation cannot be guaranteed for maximum likelihood analyses, even given arbitrarily long sequences, when indels are present and treated as missing data. Conclusions Our result shows that the standard statistical techniques used to estimate phylogenies from sequence alignments may have unfavorable statistical properties, even when the sequence alignment is accurate and the assumed substitution model matches the generation model. This suggests that the recent research focus on developing statistical methods that treat indel events properly is an important direction for phylogeny estimation.

AB - Background Most statistical methods for phylogenetic estimation in use today treat a gap (generally representing an insertion or deletion, i.e., indel) within the input sequence alignment as missing data. However, the statistical properties of this treatment of indels have not been fully investigated. Results We prove that maximum likelihood phylogeny estimation, treating indels as missing data, can be statistically inconsistent for a general (and rather simple) model of sequence evolution, even when given the true alignment. Therefore, accurate phylogeny estimation cannot be guaranteed for maximum likelihood analyses, even given arbitrarily long sequences, when indels are present and treated as missing data. Conclusions Our result shows that the standard statistical techniques used to estimate phylogenies from sequence alignments may have unfavorable statistical properties, even when the sequence alignment is accurate and the assumed substitution model matches the generation model. This suggests that the recent research focus on developing statistical methods that treat indel events properly is an important direction for phylogeny estimation.

UR - http://www.scopus.com/inward/record.url?scp=84867479202&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84867479202&partnerID=8YFLogxK

U2 - 10.1371/currents.RRN1308

DO - 10.1371/currents.RRN1308

M3 - Article

AN - SCOPUS:84867479202

JO - PLoS Currents

JF - PLoS Currents

SN - 2157-3999

M1 - ecurrents.RRN1308

ER -