Imputation of unordered markers and the impact on genomic selection accuracy

Jessica E. Rutkoski, Jesse Poland, Jean Luc Jannink, Mark E. Sorrells

Research output: Contribution to journalArticle

Abstract

Genomic selection, a breeding method that promises to accelerate rates of genetic gain, requires dense, genome-wide marker data. Genotyping-by-sequencing can generate a large number of de novo markers. However, without a reference genome, these markers are unordered and typically have a large proportion of missing data. Because marker imputation algorithms were developed for species with a reference genome, algorithms suited for unordered markers have not been rigorously evaluated. Using four empirical datasets, we evaluate and characterize four such imputation methods, referred to as k-nearest neighbors, singular value decomposition, random forest regression, and expectation maximization imputation, in terms of their imputation accuracies and the factors affecting accuracy. The effect of imputation method on the genomic selection accuracy is assessed in comparison with mean imputation. The effect of excluding markers with a large proportion of missing data on the genomic selection accuracy is also examined. Our results show that imputation of unordered markers can be accurate, especially when linkage disequilibrium between markers is high and genotyped individuals are related. Of the methods evaluated, random forest regression imputation produced superior accuracy. In comparison with mean imputation, all four imputation methods we evaluated led to greater genomic selection accuracies when the level of missing data was high. Including rather than excluding markers with a large proportion of missing data nearly always led to greater GS accuracies. We conclude that high levels of missing data in dense marker sets is not a major obstacle for genomic selection, even when marker order is not known.

Original languageEnglish (US)
Pages (from-to)427-439
Number of pages13
JournalG3: Genes, Genomes, Genetics
Volume3
Issue number3
DOIs
StatePublished - Mar 2013
Externally publishedYes

Fingerprint

Genome
Linkage Disequilibrium
Breeding
Forests
Datasets

Keywords

  • Algorithms
  • Genomic
  • Genotyping-bysequencing
  • Genpred
  • Imputation
  • Resources
  • Selection
  • Shared data

ASJC Scopus subject areas

  • Molecular Biology
  • Genetics
  • Genetics(clinical)

Cite this

Imputation of unordered markers and the impact on genomic selection accuracy. / Rutkoski, Jessica E.; Poland, Jesse; Jannink, Jean Luc; Sorrells, Mark E.

In: G3: Genes, Genomes, Genetics, Vol. 3, No. 3, 03.2013, p. 427-439.

Research output: Contribution to journalArticle

Rutkoski, Jessica E. ; Poland, Jesse ; Jannink, Jean Luc ; Sorrells, Mark E. / Imputation of unordered markers and the impact on genomic selection accuracy. In: G3: Genes, Genomes, Genetics. 2013 ; Vol. 3, No. 3. pp. 427-439.
@article{5816564d686b4e1692a0baa7ac2720e0,
title = "Imputation of unordered markers and the impact on genomic selection accuracy",
abstract = "Genomic selection, a breeding method that promises to accelerate rates of genetic gain, requires dense, genome-wide marker data. Genotyping-by-sequencing can generate a large number of de novo markers. However, without a reference genome, these markers are unordered and typically have a large proportion of missing data. Because marker imputation algorithms were developed for species with a reference genome, algorithms suited for unordered markers have not been rigorously evaluated. Using four empirical datasets, we evaluate and characterize four such imputation methods, referred to as k-nearest neighbors, singular value decomposition, random forest regression, and expectation maximization imputation, in terms of their imputation accuracies and the factors affecting accuracy. The effect of imputation method on the genomic selection accuracy is assessed in comparison with mean imputation. The effect of excluding markers with a large proportion of missing data on the genomic selection accuracy is also examined. Our results show that imputation of unordered markers can be accurate, especially when linkage disequilibrium between markers is high and genotyped individuals are related. Of the methods evaluated, random forest regression imputation produced superior accuracy. In comparison with mean imputation, all four imputation methods we evaluated led to greater genomic selection accuracies when the level of missing data was high. Including rather than excluding markers with a large proportion of missing data nearly always led to greater GS accuracies. We conclude that high levels of missing data in dense marker sets is not a major obstacle for genomic selection, even when marker order is not known.",
keywords = "Algorithms, Genomic, Genotyping-bysequencing, Genpred, Imputation, Resources, Selection, Shared data",
author = "Rutkoski, {Jessica E.} and Jesse Poland and Jannink, {Jean Luc} and Sorrells, {Mark E.}",
year = "2013",
month = "3",
doi = "10.1534/g3.112.005363",
language = "English (US)",
volume = "3",
pages = "427--439",
journal = "G3: Genes, Genomes, Genetics",
issn = "2160-1836",
publisher = "Genetics Society of America",
number = "3",

}

TY - JOUR

T1 - Imputation of unordered markers and the impact on genomic selection accuracy

AU - Rutkoski, Jessica E.

AU - Poland, Jesse

AU - Jannink, Jean Luc

AU - Sorrells, Mark E.

PY - 2013/3

Y1 - 2013/3

N2 - Genomic selection, a breeding method that promises to accelerate rates of genetic gain, requires dense, genome-wide marker data. Genotyping-by-sequencing can generate a large number of de novo markers. However, without a reference genome, these markers are unordered and typically have a large proportion of missing data. Because marker imputation algorithms were developed for species with a reference genome, algorithms suited for unordered markers have not been rigorously evaluated. Using four empirical datasets, we evaluate and characterize four such imputation methods, referred to as k-nearest neighbors, singular value decomposition, random forest regression, and expectation maximization imputation, in terms of their imputation accuracies and the factors affecting accuracy. The effect of imputation method on the genomic selection accuracy is assessed in comparison with mean imputation. The effect of excluding markers with a large proportion of missing data on the genomic selection accuracy is also examined. Our results show that imputation of unordered markers can be accurate, especially when linkage disequilibrium between markers is high and genotyped individuals are related. Of the methods evaluated, random forest regression imputation produced superior accuracy. In comparison with mean imputation, all four imputation methods we evaluated led to greater genomic selection accuracies when the level of missing data was high. Including rather than excluding markers with a large proportion of missing data nearly always led to greater GS accuracies. We conclude that high levels of missing data in dense marker sets is not a major obstacle for genomic selection, even when marker order is not known.

AB - Genomic selection, a breeding method that promises to accelerate rates of genetic gain, requires dense, genome-wide marker data. Genotyping-by-sequencing can generate a large number of de novo markers. However, without a reference genome, these markers are unordered and typically have a large proportion of missing data. Because marker imputation algorithms were developed for species with a reference genome, algorithms suited for unordered markers have not been rigorously evaluated. Using four empirical datasets, we evaluate and characterize four such imputation methods, referred to as k-nearest neighbors, singular value decomposition, random forest regression, and expectation maximization imputation, in terms of their imputation accuracies and the factors affecting accuracy. The effect of imputation method on the genomic selection accuracy is assessed in comparison with mean imputation. The effect of excluding markers with a large proportion of missing data on the genomic selection accuracy is also examined. Our results show that imputation of unordered markers can be accurate, especially when linkage disequilibrium between markers is high and genotyped individuals are related. Of the methods evaluated, random forest regression imputation produced superior accuracy. In comparison with mean imputation, all four imputation methods we evaluated led to greater genomic selection accuracies when the level of missing data was high. Including rather than excluding markers with a large proportion of missing data nearly always led to greater GS accuracies. We conclude that high levels of missing data in dense marker sets is not a major obstacle for genomic selection, even when marker order is not known.

KW - Algorithms

KW - Genomic

KW - Genotyping-bysequencing

KW - Genpred

KW - Imputation

KW - Resources

KW - Selection

KW - Shared data

UR - http://www.scopus.com/inward/record.url?scp=84883302611&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84883302611&partnerID=8YFLogxK

U2 - 10.1534/g3.112.005363

DO - 10.1534/g3.112.005363

M3 - Article

C2 - 23449944

AN - SCOPUS:84883302611

VL - 3

SP - 427

EP - 439

JO - G3: Genes, Genomes, Genetics

JF - G3: Genes, Genomes, Genetics

SN - 2160-1836

IS - 3

ER -