Statistically based postprocessing of phylogenetic analysis by clustering

Cara Stockham, Li San Wang, Tandy Warnow

Research output: Contribution to journalArticlepeer-review


Motivation: Phylogenetic analyses often produce thousands of candidate trees. Biologists resolve the conflict by computing the consensus of these trees. Single-tree consensus as postprocessing methods can be unsatisfactory due to their inherent limitations. Results: In this paper we present an alternative approach by using clustering algorithms on the set of candidate trees. We propose bicriterion problems, in particular using the concept of information loss, and new consensus trees called characteristic trees that minimize the information loss. Our empirical study using four biological datasets shows that our approach provides a significant improvement in the information content, while adding only a small amount of complexity. Furthermore, the consensus trees we obtain for each of our large clusters are more resolved than the single-tree consensus trees. We also provide some initial progress on theoretical questions that arise in this context. Availability: Software available upon request from the authors. The agglomerative clustering is implemented using Matlab (MathWorks, 2000) with the Statistics Toolbox. The Robinson-Foulds distance matrices and the strict consensus trees are computed using PAUP (Swofford, 2001) and the Daniel Huson's tree library on Intel Pentium workstations running Debian Linux.

Original languageEnglish (US)
Pages (from-to)S285-S293
Issue numberSUPPL. 1
StatePublished - 2002
Externally publishedYes


  • Clustering
  • Consensus methods
  • Information theory
  • Maximum parsimony
  • Phylogenetics

ASJC Scopus subject areas

  • Clinical Biochemistry
  • Computer Science Applications
  • Computational Theory and Mathematics


Dive into the research topics of 'Statistically based postprocessing of phylogenetic analysis by clustering'. Together they form a unique fingerprint.

Cite this