A comparison of genotyping-by-sequencing analysis methods on low-coverage crop datasets shows advantages of a new workflow, GB-eaSy

Daniel P. Wickland, Gopal Battu, Karen A. Hudson, Brian W. Diers, Matthew E. Hudson

Research output: Contribution to journalArticle

Abstract

Background: Genotyping-by-sequencing (GBS), a method to identify genetic variants and quickly genotype samples, reduces genome complexity by using restriction enzymes to divide the genome into fragments whose ends are sequenced on short-read sequencing platforms. While cost-effective, this method produces extensive missing data and requires complex bioinformatics analysis. GBS is most commonly used on crop plant genomes, and because crop plants have highly variable ploidy and repeat content, the performance of GBS analysis software can vary by target organism. Here we focus our analysis on soybean, a polyploid crop with a highly duplicated genome, relatively little public GBS data and few dedicated tools. Results: We compared the performance of five GBS pipelines using low-coverage Illumina sequence data from three soybean populations. To address issues identified with existing methods, we developed GB-eaSy, a GBS bioinformatics workflow that incorporates widely used genomics tools, parallelization and automation to increase the accuracy and accessibility of GBS data analysis. Compared to other GBS pipelines, GB-eaSy rapidly and accurately identified the greatest number of SNPs, with SNP calls closely concordant with whole-genome sequencing of selected lines. Across all five GBS analysis platforms, SNP calls showed unexpectedly low convergence but generally high accuracy, indicating that the workflows arrived at largely complementary sets of valid SNP calls on the low-coverage data analyzed. Conclusions: We show that GB-eaSy is approximately as good as, or better than, other leading software solutions in the accuracy, yield and missing data fraction of variant calling, as tested on low-coverage genomic data from soybean. It also performs well relative to other solutions in terms of the run time and disk space required. In addition, GB-eaSy is built from existing open-source, modular software packages that are regularly updated and commonly used, making it straightforward to install and maintain. While GB-eaSy outperformed other individual methods on the datasets analyzed, our findings suggest that a comprehensive approach integrating the results from multiple GBS bioinformatics pipelines may be the optimal strategy to obtain the largest, most highly accurate SNP yield possible from low-coverage polyploid sequence data.

Original languageEnglish (US)
Article number586
JournalBMC bioinformatics
Volume18
Issue number1
DOIs
StatePublished - Dec 28 2017

Fingerprint

Workflow
Sequencing
Work Flow
Crops
Single Nucleotide Polymorphism
Coverage
Genes
Bioinformatics
Computational Biology
Soybeans
Genome
Polyploidy
Software
Pipelines
Soybean
Plant Genome
Ploidies
Automation
Genomics
Software packages

Keywords

  • Bioinformatics pipelines
  • Crops
  • GBS
  • Soybean
  • Variant calling
  • WGS

ASJC Scopus subject areas

  • Structural Biology
  • Biochemistry
  • Molecular Biology
  • Computer Science Applications
  • Applied Mathematics

Cite this

A comparison of genotyping-by-sequencing analysis methods on low-coverage crop datasets shows advantages of a new workflow, GB-eaSy. / Wickland, Daniel P.; Battu, Gopal; Hudson, Karen A.; Diers, Brian W.; Hudson, Matthew E.

In: BMC bioinformatics, Vol. 18, No. 1, 586, 28.12.2017.

Research output: Contribution to journalArticle

@article{a1bb7eb8fb4540ad97d86c44d3a23273,
title = "A comparison of genotyping-by-sequencing analysis methods on low-coverage crop datasets shows advantages of a new workflow, GB-eaSy",
abstract = "Background: Genotyping-by-sequencing (GBS), a method to identify genetic variants and quickly genotype samples, reduces genome complexity by using restriction enzymes to divide the genome into fragments whose ends are sequenced on short-read sequencing platforms. While cost-effective, this method produces extensive missing data and requires complex bioinformatics analysis. GBS is most commonly used on crop plant genomes, and because crop plants have highly variable ploidy and repeat content, the performance of GBS analysis software can vary by target organism. Here we focus our analysis on soybean, a polyploid crop with a highly duplicated genome, relatively little public GBS data and few dedicated tools. Results: We compared the performance of five GBS pipelines using low-coverage Illumina sequence data from three soybean populations. To address issues identified with existing methods, we developed GB-eaSy, a GBS bioinformatics workflow that incorporates widely used genomics tools, parallelization and automation to increase the accuracy and accessibility of GBS data analysis. Compared to other GBS pipelines, GB-eaSy rapidly and accurately identified the greatest number of SNPs, with SNP calls closely concordant with whole-genome sequencing of selected lines. Across all five GBS analysis platforms, SNP calls showed unexpectedly low convergence but generally high accuracy, indicating that the workflows arrived at largely complementary sets of valid SNP calls on the low-coverage data analyzed. Conclusions: We show that GB-eaSy is approximately as good as, or better than, other leading software solutions in the accuracy, yield and missing data fraction of variant calling, as tested on low-coverage genomic data from soybean. It also performs well relative to other solutions in terms of the run time and disk space required. In addition, GB-eaSy is built from existing open-source, modular software packages that are regularly updated and commonly used, making it straightforward to install and maintain. While GB-eaSy outperformed other individual methods on the datasets analyzed, our findings suggest that a comprehensive approach integrating the results from multiple GBS bioinformatics pipelines may be the optimal strategy to obtain the largest, most highly accurate SNP yield possible from low-coverage polyploid sequence data.",
keywords = "Bioinformatics pipelines, Crops, GBS, Soybean, Variant calling, WGS",
author = "Wickland, {Daniel P.} and Gopal Battu and Hudson, {Karen A.} and Diers, {Brian W.} and Hudson, {Matthew E.}",
year = "2017",
month = "12",
day = "28",
doi = "10.1186/s12859-017-2000-6",
language = "English (US)",
volume = "18",
journal = "BMC Bioinformatics",
issn = "1471-2105",
publisher = "BioMed Central",
number = "1",

}

TY - JOUR

T1 - A comparison of genotyping-by-sequencing analysis methods on low-coverage crop datasets shows advantages of a new workflow, GB-eaSy

AU - Wickland, Daniel P.

AU - Battu, Gopal

AU - Hudson, Karen A.

AU - Diers, Brian W.

AU - Hudson, Matthew E.

PY - 2017/12/28

Y1 - 2017/12/28

N2 - Background: Genotyping-by-sequencing (GBS), a method to identify genetic variants and quickly genotype samples, reduces genome complexity by using restriction enzymes to divide the genome into fragments whose ends are sequenced on short-read sequencing platforms. While cost-effective, this method produces extensive missing data and requires complex bioinformatics analysis. GBS is most commonly used on crop plant genomes, and because crop plants have highly variable ploidy and repeat content, the performance of GBS analysis software can vary by target organism. Here we focus our analysis on soybean, a polyploid crop with a highly duplicated genome, relatively little public GBS data and few dedicated tools. Results: We compared the performance of five GBS pipelines using low-coverage Illumina sequence data from three soybean populations. To address issues identified with existing methods, we developed GB-eaSy, a GBS bioinformatics workflow that incorporates widely used genomics tools, parallelization and automation to increase the accuracy and accessibility of GBS data analysis. Compared to other GBS pipelines, GB-eaSy rapidly and accurately identified the greatest number of SNPs, with SNP calls closely concordant with whole-genome sequencing of selected lines. Across all five GBS analysis platforms, SNP calls showed unexpectedly low convergence but generally high accuracy, indicating that the workflows arrived at largely complementary sets of valid SNP calls on the low-coverage data analyzed. Conclusions: We show that GB-eaSy is approximately as good as, or better than, other leading software solutions in the accuracy, yield and missing data fraction of variant calling, as tested on low-coverage genomic data from soybean. It also performs well relative to other solutions in terms of the run time and disk space required. In addition, GB-eaSy is built from existing open-source, modular software packages that are regularly updated and commonly used, making it straightforward to install and maintain. While GB-eaSy outperformed other individual methods on the datasets analyzed, our findings suggest that a comprehensive approach integrating the results from multiple GBS bioinformatics pipelines may be the optimal strategy to obtain the largest, most highly accurate SNP yield possible from low-coverage polyploid sequence data.

AB - Background: Genotyping-by-sequencing (GBS), a method to identify genetic variants and quickly genotype samples, reduces genome complexity by using restriction enzymes to divide the genome into fragments whose ends are sequenced on short-read sequencing platforms. While cost-effective, this method produces extensive missing data and requires complex bioinformatics analysis. GBS is most commonly used on crop plant genomes, and because crop plants have highly variable ploidy and repeat content, the performance of GBS analysis software can vary by target organism. Here we focus our analysis on soybean, a polyploid crop with a highly duplicated genome, relatively little public GBS data and few dedicated tools. Results: We compared the performance of five GBS pipelines using low-coverage Illumina sequence data from three soybean populations. To address issues identified with existing methods, we developed GB-eaSy, a GBS bioinformatics workflow that incorporates widely used genomics tools, parallelization and automation to increase the accuracy and accessibility of GBS data analysis. Compared to other GBS pipelines, GB-eaSy rapidly and accurately identified the greatest number of SNPs, with SNP calls closely concordant with whole-genome sequencing of selected lines. Across all five GBS analysis platforms, SNP calls showed unexpectedly low convergence but generally high accuracy, indicating that the workflows arrived at largely complementary sets of valid SNP calls on the low-coverage data analyzed. Conclusions: We show that GB-eaSy is approximately as good as, or better than, other leading software solutions in the accuracy, yield and missing data fraction of variant calling, as tested on low-coverage genomic data from soybean. It also performs well relative to other solutions in terms of the run time and disk space required. In addition, GB-eaSy is built from existing open-source, modular software packages that are regularly updated and commonly used, making it straightforward to install and maintain. While GB-eaSy outperformed other individual methods on the datasets analyzed, our findings suggest that a comprehensive approach integrating the results from multiple GBS bioinformatics pipelines may be the optimal strategy to obtain the largest, most highly accurate SNP yield possible from low-coverage polyploid sequence data.

KW - Bioinformatics pipelines

KW - Crops

KW - GBS

KW - Soybean

KW - Variant calling

KW - WGS

UR - http://www.scopus.com/inward/record.url?scp=85039843898&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85039843898&partnerID=8YFLogxK

U2 - 10.1186/s12859-017-2000-6

DO - 10.1186/s12859-017-2000-6

M3 - Article

C2 - 29281959

AN - SCOPUS:85039843898

VL - 18

JO - BMC Bioinformatics

JF - BMC Bioinformatics

SN - 1471-2105

IS - 1

M1 - 586

ER -