Phylogenomics from whole genome sequences using aTRAM

Julie M. Allen, Bret Boyd, Nam Phuong Nguyen, Pranjal Vachaspati, Tandy Warnow, Daisie I. Huang, Patrick G.S. Grady, Kayce C. Bell, Quentin C.B. Cronk, Lawrence Mugisha, Barry R. Pittendrigh, M. Soledad Leonardi, David L. Reed, Kevin Paul Johnson

Research output: Contribution to journalArticle

Abstract

Novel sequencing technologies are rapidly expanding the size of data sets that can be applied to phylogenetic studies. Currently the most commonly used phylogenomic approaches involve some formof genome reduction. While these approaches make assembling phylogenomic data sets more economical for organisms with large genomes, they reduce the genomic coverage and thereby the long-term utility of the data. Currently, for organisms with moderate to small genomes (>1000 Mbp) it is feasible to sequence the entire genome at modest coverage (10-30×). Computational challenges for handling these large data sets can be alleviated by assembling targeted reads, rather than assembling the entire genome, to produce a phylogenomic data matrix. Here we demonstrate the use of automated Target Restricted Assembly Method (aTRAM) to assemble 1107 single-copy ortholog genes from whole genome sequencing of sucking lice (Anoplura) and out-groups. We developed a pipeline to extract exon sequences from the aTRAM assemblies by annotating them with respect to the original target protein.We aligned these protein sequences with the inferred amino acids and then performed phylogenetic analyses on both the concatenated matrix of genes and on each gene separately in a coalescent analysis. Finally, we tested the limits of successful assembly in aTRAM by assembling 100 genes from close- to distantly related taxa at high to low levels of coverage. Both the concatenated analysis and the coalescent-based analysis produced the same tree topology, whichwas consistent with previously published results and resolved weakly supported nodes. These results demonstrate that this approach is successful at developing phylogenomic data sets fromrawgenome sequencing reads. Further,we found that with coverages above 5-10×, aTRAM was successful at assembling 80-90% of the contigs for both close and distantly related taxa. As sequencing costs continue to decline, we expect full genome sequencing will become more feasible for a wider array of organisms, and aTRAM will enable mining of these genomic data sets for an extensive variety of applications, including phylogenomics.

Original languageEnglish (US)
Pages (from-to)786-798
Number of pages13
JournalSystematic biology
Volume66
Issue number5
DOIs
StatePublished - Sep 1 2017

Fingerprint

genome
Genome
application coverage
Anoplura
gene
Genes
methodology
organisms
genomics
genes
phylogenetics
louse
matrix
protein
phylogeny
method
topology
exons
Exons
Proteins

Keywords

  • aTRAM
  • gene assembly
  • genome sequencing
  • phylogenomics

ASJC Scopus subject areas

  • Ecology, Evolution, Behavior and Systematics
  • Genetics

Cite this

Allen, J. M., Boyd, B., Nguyen, N. P., Vachaspati, P., Warnow, T., Huang, D. I., ... Johnson, K. P. (2017). Phylogenomics from whole genome sequences using aTRAM. Systematic biology, 66(5), 786-798. https://doi.org/10.1093/sysbio/syw105

Phylogenomics from whole genome sequences using aTRAM. / Allen, Julie M.; Boyd, Bret; Nguyen, Nam Phuong; Vachaspati, Pranjal; Warnow, Tandy; Huang, Daisie I.; Grady, Patrick G.S.; Bell, Kayce C.; Cronk, Quentin C.B.; Mugisha, Lawrence; Pittendrigh, Barry R.; Leonardi, M. Soledad; Reed, David L.; Johnson, Kevin Paul.

In: Systematic biology, Vol. 66, No. 5, 01.09.2017, p. 786-798.

Research output: Contribution to journalArticle

Allen, JM, Boyd, B, Nguyen, NP, Vachaspati, P, Warnow, T, Huang, DI, Grady, PGS, Bell, KC, Cronk, QCB, Mugisha, L, Pittendrigh, BR, Leonardi, MS, Reed, DL & Johnson, KP 2017, 'Phylogenomics from whole genome sequences using aTRAM', Systematic biology, vol. 66, no. 5, pp. 786-798. https://doi.org/10.1093/sysbio/syw105
Allen JM, Boyd B, Nguyen NP, Vachaspati P, Warnow T, Huang DI et al. Phylogenomics from whole genome sequences using aTRAM. Systematic biology. 2017 Sep 1;66(5):786-798. https://doi.org/10.1093/sysbio/syw105
Allen, Julie M. ; Boyd, Bret ; Nguyen, Nam Phuong ; Vachaspati, Pranjal ; Warnow, Tandy ; Huang, Daisie I. ; Grady, Patrick G.S. ; Bell, Kayce C. ; Cronk, Quentin C.B. ; Mugisha, Lawrence ; Pittendrigh, Barry R. ; Leonardi, M. Soledad ; Reed, David L. ; Johnson, Kevin Paul. / Phylogenomics from whole genome sequences using aTRAM. In: Systematic biology. 2017 ; Vol. 66, No. 5. pp. 786-798.
@article{a9107051ab60434cbe9e202276e65139,
title = "Phylogenomics from whole genome sequences using aTRAM",
abstract = "Novel sequencing technologies are rapidly expanding the size of data sets that can be applied to phylogenetic studies. Currently the most commonly used phylogenomic approaches involve some formof genome reduction. While these approaches make assembling phylogenomic data sets more economical for organisms with large genomes, they reduce the genomic coverage and thereby the long-term utility of the data. Currently, for organisms with moderate to small genomes (>1000 Mbp) it is feasible to sequence the entire genome at modest coverage (10-30×). Computational challenges for handling these large data sets can be alleviated by assembling targeted reads, rather than assembling the entire genome, to produce a phylogenomic data matrix. Here we demonstrate the use of automated Target Restricted Assembly Method (aTRAM) to assemble 1107 single-copy ortholog genes from whole genome sequencing of sucking lice (Anoplura) and out-groups. We developed a pipeline to extract exon sequences from the aTRAM assemblies by annotating them with respect to the original target protein.We aligned these protein sequences with the inferred amino acids and then performed phylogenetic analyses on both the concatenated matrix of genes and on each gene separately in a coalescent analysis. Finally, we tested the limits of successful assembly in aTRAM by assembling 100 genes from close- to distantly related taxa at high to low levels of coverage. Both the concatenated analysis and the coalescent-based analysis produced the same tree topology, whichwas consistent with previously published results and resolved weakly supported nodes. These results demonstrate that this approach is successful at developing phylogenomic data sets fromrawgenome sequencing reads. Further,we found that with coverages above 5-10×, aTRAM was successful at assembling 80-90{\%} of the contigs for both close and distantly related taxa. As sequencing costs continue to decline, we expect full genome sequencing will become more feasible for a wider array of organisms, and aTRAM will enable mining of these genomic data sets for an extensive variety of applications, including phylogenomics.",
keywords = "aTRAM, gene assembly, genome sequencing, phylogenomics",
author = "Allen, {Julie M.} and Bret Boyd and Nguyen, {Nam Phuong} and Pranjal Vachaspati and Tandy Warnow and Huang, {Daisie I.} and Grady, {Patrick G.S.} and Bell, {Kayce C.} and Cronk, {Quentin C.B.} and Lawrence Mugisha and Pittendrigh, {Barry R.} and Leonardi, {M. Soledad} and Reed, {David L.} and Johnson, {Kevin Paul}",
year = "2017",
month = "9",
day = "1",
doi = "10.1093/sysbio/syw105",
language = "English (US)",
volume = "66",
pages = "786--798",
journal = "Systematic Biology",
issn = "1063-5157",
publisher = "Oxford University Press",
number = "5",

}

TY - JOUR

T1 - Phylogenomics from whole genome sequences using aTRAM

AU - Allen, Julie M.

AU - Boyd, Bret

AU - Nguyen, Nam Phuong

AU - Vachaspati, Pranjal

AU - Warnow, Tandy

AU - Huang, Daisie I.

AU - Grady, Patrick G.S.

AU - Bell, Kayce C.

AU - Cronk, Quentin C.B.

AU - Mugisha, Lawrence

AU - Pittendrigh, Barry R.

AU - Leonardi, M. Soledad

AU - Reed, David L.

AU - Johnson, Kevin Paul

PY - 2017/9/1

Y1 - 2017/9/1

N2 - Novel sequencing technologies are rapidly expanding the size of data sets that can be applied to phylogenetic studies. Currently the most commonly used phylogenomic approaches involve some formof genome reduction. While these approaches make assembling phylogenomic data sets more economical for organisms with large genomes, they reduce the genomic coverage and thereby the long-term utility of the data. Currently, for organisms with moderate to small genomes (>1000 Mbp) it is feasible to sequence the entire genome at modest coverage (10-30×). Computational challenges for handling these large data sets can be alleviated by assembling targeted reads, rather than assembling the entire genome, to produce a phylogenomic data matrix. Here we demonstrate the use of automated Target Restricted Assembly Method (aTRAM) to assemble 1107 single-copy ortholog genes from whole genome sequencing of sucking lice (Anoplura) and out-groups. We developed a pipeline to extract exon sequences from the aTRAM assemblies by annotating them with respect to the original target protein.We aligned these protein sequences with the inferred amino acids and then performed phylogenetic analyses on both the concatenated matrix of genes and on each gene separately in a coalescent analysis. Finally, we tested the limits of successful assembly in aTRAM by assembling 100 genes from close- to distantly related taxa at high to low levels of coverage. Both the concatenated analysis and the coalescent-based analysis produced the same tree topology, whichwas consistent with previously published results and resolved weakly supported nodes. These results demonstrate that this approach is successful at developing phylogenomic data sets fromrawgenome sequencing reads. Further,we found that with coverages above 5-10×, aTRAM was successful at assembling 80-90% of the contigs for both close and distantly related taxa. As sequencing costs continue to decline, we expect full genome sequencing will become more feasible for a wider array of organisms, and aTRAM will enable mining of these genomic data sets for an extensive variety of applications, including phylogenomics.

AB - Novel sequencing technologies are rapidly expanding the size of data sets that can be applied to phylogenetic studies. Currently the most commonly used phylogenomic approaches involve some formof genome reduction. While these approaches make assembling phylogenomic data sets more economical for organisms with large genomes, they reduce the genomic coverage and thereby the long-term utility of the data. Currently, for organisms with moderate to small genomes (>1000 Mbp) it is feasible to sequence the entire genome at modest coverage (10-30×). Computational challenges for handling these large data sets can be alleviated by assembling targeted reads, rather than assembling the entire genome, to produce a phylogenomic data matrix. Here we demonstrate the use of automated Target Restricted Assembly Method (aTRAM) to assemble 1107 single-copy ortholog genes from whole genome sequencing of sucking lice (Anoplura) and out-groups. We developed a pipeline to extract exon sequences from the aTRAM assemblies by annotating them with respect to the original target protein.We aligned these protein sequences with the inferred amino acids and then performed phylogenetic analyses on both the concatenated matrix of genes and on each gene separately in a coalescent analysis. Finally, we tested the limits of successful assembly in aTRAM by assembling 100 genes from close- to distantly related taxa at high to low levels of coverage. Both the concatenated analysis and the coalescent-based analysis produced the same tree topology, whichwas consistent with previously published results and resolved weakly supported nodes. These results demonstrate that this approach is successful at developing phylogenomic data sets fromrawgenome sequencing reads. Further,we found that with coverages above 5-10×, aTRAM was successful at assembling 80-90% of the contigs for both close and distantly related taxa. As sequencing costs continue to decline, we expect full genome sequencing will become more feasible for a wider array of organisms, and aTRAM will enable mining of these genomic data sets for an extensive variety of applications, including phylogenomics.

KW - aTRAM

KW - gene assembly

KW - genome sequencing

KW - phylogenomics

UR - http://www.scopus.com/inward/record.url?scp=85017514283&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85017514283&partnerID=8YFLogxK

U2 - 10.1093/sysbio/syw105

DO - 10.1093/sysbio/syw105

M3 - Article

C2 - 28123117

AN - SCOPUS:85017514283

VL - 66

SP - 786

EP - 798

JO - Systematic Biology

JF - Systematic Biology

SN - 1063-5157

IS - 5

ER -