Abstract

Motivation: Multiple sequence alignment is a basic part of much biological research, including phylogeny estimation and protein structure and function prediction. Different alignments on the same set of unaligned sequences are often compared, sometimes in order to assess the accuracy of alignment methods or to infer a consensus alignment from a set of estimated alignments. Three of the standard techniques for comparing alignments, Developer, Modeler and Total Column (TC) scores can be derived through calculations of the set of homologies that the alignments share. However, the brute-force technique for calculating this set is quadratic in the input size. The remaining standard technique, Cline Shift Score, inherently requires quadratic time. Results: In this article, we prove that each of these scores can be computed in linear time, and we present FastSP, a linear-time algorithm for calculating these scores. Even on the largest alignments we explored (one with 50 000 sequences), FastSP completed <2 min and used at most 2 GB of the main memory. The best alternative is qscore, a method whose empirical running time is approximately the same as FastSP when given sufficient memory (at least 8 GB), but whose asymptotic running time has never been theoretically established. In addition, for comparisons of large alignments under lower memory conditions (at most 4 GB of main memory), qscore uses substantial memory (up to 10 GB for the datasets we studied), took more time and failed to analyze the largest datasets.

Original languageEnglish (US)
Article numberbtr553
Pages (from-to)3250-3258
Number of pages9
JournalBioinformatics
Volume27
Issue number23
DOIs
StatePublished - Dec 1 2011

Fingerprint

Linear Time
Alignment
Data storage equipment
Sequence Alignment
Phylogeny
Multiple Sequence Alignment
Consensus
Protein Structure
Linear-time Algorithm
Large Data Sets
Homology
Sufficient
Research
Proteins
Prediction
Alternatives
Datasets

ASJC Scopus subject areas

  • Statistics and Probability
  • Biochemistry
  • Molecular Biology
  • Computer Science Applications
  • Computational Theory and Mathematics
  • Computational Mathematics

Cite this

FASTSP : Linear time calculation of alignment accuracy. / Mirarab, Siavash; Warnow, Tandy.

In: Bioinformatics, Vol. 27, No. 23, btr553, 01.12.2011, p. 3250-3258.

Research output: Contribution to journalArticle

Mirarab, Siavash ; Warnow, Tandy. / FASTSP : Linear time calculation of alignment accuracy. In: Bioinformatics. 2011 ; Vol. 27, No. 23. pp. 3250-3258.
@article{2477ab8006824108bbd0e63b4f9312a0,
title = "FASTSP: Linear time calculation of alignment accuracy",
abstract = "Motivation: Multiple sequence alignment is a basic part of much biological research, including phylogeny estimation and protein structure and function prediction. Different alignments on the same set of unaligned sequences are often compared, sometimes in order to assess the accuracy of alignment methods or to infer a consensus alignment from a set of estimated alignments. Three of the standard techniques for comparing alignments, Developer, Modeler and Total Column (TC) scores can be derived through calculations of the set of homologies that the alignments share. However, the brute-force technique for calculating this set is quadratic in the input size. The remaining standard technique, Cline Shift Score, inherently requires quadratic time. Results: In this article, we prove that each of these scores can be computed in linear time, and we present FastSP, a linear-time algorithm for calculating these scores. Even on the largest alignments we explored (one with 50 000 sequences), FastSP completed <2 min and used at most 2 GB of the main memory. The best alternative is qscore, a method whose empirical running time is approximately the same as FastSP when given sufficient memory (at least 8 GB), but whose asymptotic running time has never been theoretically established. In addition, for comparisons of large alignments under lower memory conditions (at most 4 GB of main memory), qscore uses substantial memory (up to 10 GB for the datasets we studied), took more time and failed to analyze the largest datasets.",
author = "Siavash Mirarab and Tandy Warnow",
year = "2011",
month = "12",
day = "1",
doi = "10.1093/bioinformatics/btr553",
language = "English (US)",
volume = "27",
pages = "3250--3258",
journal = "Bioinformatics",
issn = "1367-4803",
publisher = "Oxford University Press",
number = "23",

}

TY - JOUR

T1 - FASTSP

T2 - Linear time calculation of alignment accuracy

AU - Mirarab, Siavash

AU - Warnow, Tandy

PY - 2011/12/1

Y1 - 2011/12/1

N2 - Motivation: Multiple sequence alignment is a basic part of much biological research, including phylogeny estimation and protein structure and function prediction. Different alignments on the same set of unaligned sequences are often compared, sometimes in order to assess the accuracy of alignment methods or to infer a consensus alignment from a set of estimated alignments. Three of the standard techniques for comparing alignments, Developer, Modeler and Total Column (TC) scores can be derived through calculations of the set of homologies that the alignments share. However, the brute-force technique for calculating this set is quadratic in the input size. The remaining standard technique, Cline Shift Score, inherently requires quadratic time. Results: In this article, we prove that each of these scores can be computed in linear time, and we present FastSP, a linear-time algorithm for calculating these scores. Even on the largest alignments we explored (one with 50 000 sequences), FastSP completed <2 min and used at most 2 GB of the main memory. The best alternative is qscore, a method whose empirical running time is approximately the same as FastSP when given sufficient memory (at least 8 GB), but whose asymptotic running time has never been theoretically established. In addition, for comparisons of large alignments under lower memory conditions (at most 4 GB of main memory), qscore uses substantial memory (up to 10 GB for the datasets we studied), took more time and failed to analyze the largest datasets.

AB - Motivation: Multiple sequence alignment is a basic part of much biological research, including phylogeny estimation and protein structure and function prediction. Different alignments on the same set of unaligned sequences are often compared, sometimes in order to assess the accuracy of alignment methods or to infer a consensus alignment from a set of estimated alignments. Three of the standard techniques for comparing alignments, Developer, Modeler and Total Column (TC) scores can be derived through calculations of the set of homologies that the alignments share. However, the brute-force technique for calculating this set is quadratic in the input size. The remaining standard technique, Cline Shift Score, inherently requires quadratic time. Results: In this article, we prove that each of these scores can be computed in linear time, and we present FastSP, a linear-time algorithm for calculating these scores. Even on the largest alignments we explored (one with 50 000 sequences), FastSP completed <2 min and used at most 2 GB of the main memory. The best alternative is qscore, a method whose empirical running time is approximately the same as FastSP when given sufficient memory (at least 8 GB), but whose asymptotic running time has never been theoretically established. In addition, for comparisons of large alignments under lower memory conditions (at most 4 GB of main memory), qscore uses substantial memory (up to 10 GB for the datasets we studied), took more time and failed to analyze the largest datasets.

UR - http://www.scopus.com/inward/record.url?scp=82255185493&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=82255185493&partnerID=8YFLogxK

U2 - 10.1093/bioinformatics/btr553

DO - 10.1093/bioinformatics/btr553

M3 - Article

C2 - 21984754

AN - SCOPUS:82255185493

VL - 27

SP - 3250

EP - 3258

JO - Bioinformatics

JF - Bioinformatics

SN - 1367-4803

IS - 23

M1 - btr553

ER -