The bench scientist's guide to statistical analysis of RNA-Seq data

Craig R. Yendrek, Elizabeth Ainsworth, Jyothi Thimmapuram

Research output: Contribution to journalArticle

Abstract

Background: RNA sequencing (RNA-Seq) is emerging as a highly accurate method to quantify transcript abundance. However, analyses of the large data sets obtained by sequencing the entire transcriptome of organisms have generally been performed by bioinformatics specialists. Here we provide a step-by-step guide and outline a strategy using currently available statistical tools that results in a conservative list of differentially expressed genes. We also discuss potential sources of error in RNA-Seq analysis that could alter interpretation of global changes in gene expression. Findings. When comparing statistical tools, the negative binomial distribution-based methods, edgeR and DESeq, respectively identified 11,995 and 11,317 differentially expressed genes from an RNA-seq dataset generated from soybean leaf tissue grown in elevated O 3 . However, the number of genes in common between these two methods was only 10,535, resulting in 2,242 genes determined to be differentially expressed by only one method. Upon analysis of the non-significant genes, several limitations of these analytic tools were revealed, including evidence for overly stringent parameters for determining statistical significance of differentially expressed genes as well as increased type II error for high abundance transcripts. Conclusions: Because of the high variability between methods for determining differential expression of RNA-Seq data, we suggest using several bioinformatics tools, as outlined here, to ensure that a conservative list of differentially expressed genes is obtained. We also conclude that despite these analytical limitations, RNA-Seq provides highly accurate transcript abundance quantification that is comparable to qRT-PCR.

Original languageEnglish (US)
Article number506
JournalBMC Research Notes
Volume5
DOIs
StatePublished - Sep 18 2012

Fingerprint

RNA Sequence Analysis
Statistical methods
Genes
RNA
Bioinformatics
Computational Biology
Binomial Distribution
Transcriptome
Soybeans
Gene expression
Research Design
Tissue
Gene Expression
Polymerase Chain Reaction

Keywords

  • Differential Expression
  • RNA-Seq
  • Statistical analysis

ASJC Scopus subject areas

  • Biochemistry, Genetics and Molecular Biology(all)

Cite this

The bench scientist's guide to statistical analysis of RNA-Seq data. / Yendrek, Craig R.; Ainsworth, Elizabeth; Thimmapuram, Jyothi.

In: BMC Research Notes, Vol. 5, 506, 18.09.2012.

Research output: Contribution to journalArticle

@article{5b90e759ef16456ca05ee2ab3fae188c,
title = "The bench scientist's guide to statistical analysis of RNA-Seq data",
abstract = "Background: RNA sequencing (RNA-Seq) is emerging as a highly accurate method to quantify transcript abundance. However, analyses of the large data sets obtained by sequencing the entire transcriptome of organisms have generally been performed by bioinformatics specialists. Here we provide a step-by-step guide and outline a strategy using currently available statistical tools that results in a conservative list of differentially expressed genes. We also discuss potential sources of error in RNA-Seq analysis that could alter interpretation of global changes in gene expression. Findings. When comparing statistical tools, the negative binomial distribution-based methods, edgeR and DESeq, respectively identified 11,995 and 11,317 differentially expressed genes from an RNA-seq dataset generated from soybean leaf tissue grown in elevated O 3 . However, the number of genes in common between these two methods was only 10,535, resulting in 2,242 genes determined to be differentially expressed by only one method. Upon analysis of the non-significant genes, several limitations of these analytic tools were revealed, including evidence for overly stringent parameters for determining statistical significance of differentially expressed genes as well as increased type II error for high abundance transcripts. Conclusions: Because of the high variability between methods for determining differential expression of RNA-Seq data, we suggest using several bioinformatics tools, as outlined here, to ensure that a conservative list of differentially expressed genes is obtained. We also conclude that despite these analytical limitations, RNA-Seq provides highly accurate transcript abundance quantification that is comparable to qRT-PCR.",
keywords = "Differential Expression, RNA-Seq, Statistical analysis",
author = "Yendrek, {Craig R.} and Elizabeth Ainsworth and Jyothi Thimmapuram",
year = "2012",
month = "9",
day = "18",
doi = "10.1186/1756-0500-5-506",
language = "English (US)",
volume = "5",
journal = "BMC Research Notes",
issn = "1756-0500",
publisher = "BioMed Central",

}

TY - JOUR

T1 - The bench scientist's guide to statistical analysis of RNA-Seq data

AU - Yendrek, Craig R.

AU - Ainsworth, Elizabeth

AU - Thimmapuram, Jyothi

PY - 2012/9/18

Y1 - 2012/9/18

N2 - Background: RNA sequencing (RNA-Seq) is emerging as a highly accurate method to quantify transcript abundance. However, analyses of the large data sets obtained by sequencing the entire transcriptome of organisms have generally been performed by bioinformatics specialists. Here we provide a step-by-step guide and outline a strategy using currently available statistical tools that results in a conservative list of differentially expressed genes. We also discuss potential sources of error in RNA-Seq analysis that could alter interpretation of global changes in gene expression. Findings. When comparing statistical tools, the negative binomial distribution-based methods, edgeR and DESeq, respectively identified 11,995 and 11,317 differentially expressed genes from an RNA-seq dataset generated from soybean leaf tissue grown in elevated O 3 . However, the number of genes in common between these two methods was only 10,535, resulting in 2,242 genes determined to be differentially expressed by only one method. Upon analysis of the non-significant genes, several limitations of these analytic tools were revealed, including evidence for overly stringent parameters for determining statistical significance of differentially expressed genes as well as increased type II error for high abundance transcripts. Conclusions: Because of the high variability between methods for determining differential expression of RNA-Seq data, we suggest using several bioinformatics tools, as outlined here, to ensure that a conservative list of differentially expressed genes is obtained. We also conclude that despite these analytical limitations, RNA-Seq provides highly accurate transcript abundance quantification that is comparable to qRT-PCR.

AB - Background: RNA sequencing (RNA-Seq) is emerging as a highly accurate method to quantify transcript abundance. However, analyses of the large data sets obtained by sequencing the entire transcriptome of organisms have generally been performed by bioinformatics specialists. Here we provide a step-by-step guide and outline a strategy using currently available statistical tools that results in a conservative list of differentially expressed genes. We also discuss potential sources of error in RNA-Seq analysis that could alter interpretation of global changes in gene expression. Findings. When comparing statistical tools, the negative binomial distribution-based methods, edgeR and DESeq, respectively identified 11,995 and 11,317 differentially expressed genes from an RNA-seq dataset generated from soybean leaf tissue grown in elevated O 3 . However, the number of genes in common between these two methods was only 10,535, resulting in 2,242 genes determined to be differentially expressed by only one method. Upon analysis of the non-significant genes, several limitations of these analytic tools were revealed, including evidence for overly stringent parameters for determining statistical significance of differentially expressed genes as well as increased type II error for high abundance transcripts. Conclusions: Because of the high variability between methods for determining differential expression of RNA-Seq data, we suggest using several bioinformatics tools, as outlined here, to ensure that a conservative list of differentially expressed genes is obtained. We also conclude that despite these analytical limitations, RNA-Seq provides highly accurate transcript abundance quantification that is comparable to qRT-PCR.

KW - Differential Expression

KW - RNA-Seq

KW - Statistical analysis

UR - http://www.scopus.com/inward/record.url?scp=84866145935&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84866145935&partnerID=8YFLogxK

U2 - 10.1186/1756-0500-5-506

DO - 10.1186/1756-0500-5-506

M3 - Article

C2 - 22980220

AN - SCOPUS:84866145935

VL - 5

JO - BMC Research Notes

JF - BMC Research Notes

SN - 1756-0500

M1 - 506

ER -