TY - JOUR
T1 - Simulating next-generation sequencing datasets from empirical mutation and sequencing models
AU - Stephens, Zachary D.
AU - Hudson, Matthew E.
AU - Mainzer, Liudmila S.
AU - Taschuk, Morgan
AU - Weber, Matthew R.
AU - Iyer, Ravishankar K.
N1 - Funding Information:
ZS and RI were supported by NSF grant MRI13-37732. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. We thank Victor Jongeneel at the National Center for Supercomputing applications and Francis Ouellette at the Ontario Institute for Cancer Research for their guidance during the development of this software. The mutational models from ICGC Breast (BRCA-US) and Melanoma (SKCM-US) data published here are in whole based upon data generated by the TCGA Research Network: http://cancergenome.nih.gov/.
Publisher Copyright:
© 2016 Stephens et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
PY - 2016/11
Y1 - 2016/11
N2 - An obstacle to validating and benchmarking methods for genome analysis is that there are few reference datasets available for which the "ground truth" about the mutational landscape of the sample genome is known and fully validated. Additionally, the free and public availability of real human genome datasets is incompatible with the preservation of donor privacy. In order to better analyze and understand genomic data, we need test datasets that model all variants, reflecting known biology as well as sequencing artifacts. Read simulators can fulfill this requirement, but are often criticized for limited resemblance to true data and overall inflexibility. We present NEAT (NExt-generation sequencing Analysis Toolkit), a set of tools that not only includes an easy-to-use read simulator, but also scripts to facilitate variant comparison and tool evaluation. NEAT has a wide variety of tunable parameters which can be set manually on the default model or parameterized using real datasets. The software is freely available at github.com/zstephens/neat-genreads.
AB - An obstacle to validating and benchmarking methods for genome analysis is that there are few reference datasets available for which the "ground truth" about the mutational landscape of the sample genome is known and fully validated. Additionally, the free and public availability of real human genome datasets is incompatible with the preservation of donor privacy. In order to better analyze and understand genomic data, we need test datasets that model all variants, reflecting known biology as well as sequencing artifacts. Read simulators can fulfill this requirement, but are often criticized for limited resemblance to true data and overall inflexibility. We present NEAT (NExt-generation sequencing Analysis Toolkit), a set of tools that not only includes an easy-to-use read simulator, but also scripts to facilitate variant comparison and tool evaluation. NEAT has a wide variety of tunable parameters which can be set manually on the default model or parameterized using real datasets. The software is freely available at github.com/zstephens/neat-genreads.
UR - http://www.scopus.com/inward/record.url?scp=84997402772&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84997402772&partnerID=8YFLogxK
U2 - 10.1371/journal.pone.0167047
DO - 10.1371/journal.pone.0167047
M3 - Article
C2 - 27893777
AN - SCOPUS:84997402772
SN - 1932-6203
VL - 11
JO - PloS one
JF - PloS one
IS - 11
M1 - e0167047
ER -