Efficient and scalable workflows for genomic analyses

Subho S. Banerjee, Arjun P. Athreya, Liudmila Sergeevna Mainzer, Cornelis Jongeneel, Wen-Mei W Hwu, Zbigniew T Kalbarczyk, Ravishankar K Iyer

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Recent growth in the volume of DNA sequence data and the associated computational costs of extracting meaningful information warrant the need for efficient computational systems at scale. In this work, we propose the Illinois Genomics Execution Environment (IGen), a framework for efficient and scalable genome analyses. The design philosophy of IGen is based on algorithmic analysis and extensive measurements on compute- and data-intensive genomic analyses workflows (such as variant discovery and genotyping analysis) executed on high-performance and cloud computing infrastructures. IGen leverages the advantages of existing designs and proposes new software improvements to overcome the inefficiencies we observe in our measurements. Based on these composite improvements, we demonstrate that IGen is able to accelerate the alignment from 13.1 hours to 10.8 hours (1.2×) and the variant from 10.1 hours to 1.25 hours (8×) calling on a single node, and its modular design scales efficiently in a parallel computing environment.

Original languageEnglish (US)
Title of host publicationDIDC 2016 - Proceedings of the ACM International Workshop on Data-Intensive Distributed Computing
PublisherAssociation for Computing Machinery, Inc
Pages27-36
Number of pages10
ISBN (Electronic)9781450343527
DOIs
StatePublished - Jun 1 2016
Event6th ACM International Workshop on Data-Intensive Distributed Computing, DIDC 2016 - Kyoto, Japan
Duration: Jun 1 2016 → …

Publication series

NameDIDC 2016 - Proceedings of the ACM International Workshop on Data-Intensive Distributed Computing

Other

Other6th ACM International Workshop on Data-Intensive Distributed Computing, DIDC 2016
CountryJapan
CityKyoto
Period6/1/16 → …

Fingerprint

Work Flow
Genomics
DNA sequences
Parallel processing systems
Cloud computing
Modular Design
Parallel Computing
Cloud Computing
Leverage
Genes
DNA Sequence
Accelerate
Computational Cost
Genome
Alignment
Infrastructure
High Performance
Composite
Composite materials
Software

Keywords

  • Bioinformatics
  • Design
  • Genomics
  • Measurement
  • Performance

ASJC Scopus subject areas

  • Computational Theory and Mathematics
  • Computer Science Applications
  • Applied Mathematics

Cite this

Banerjee, S. S., Athreya, A. P., Mainzer, L. S., Jongeneel, C., Hwu, W-M. W., Kalbarczyk, Z. T., & Iyer, R. K. (2016). Efficient and scalable workflows for genomic analyses. In DIDC 2016 - Proceedings of the ACM International Workshop on Data-Intensive Distributed Computing (pp. 27-36). (DIDC 2016 - Proceedings of the ACM International Workshop on Data-Intensive Distributed Computing). Association for Computing Machinery, Inc. https://doi.org/10.1145/2912152.2912156

Efficient and scalable workflows for genomic analyses. / Banerjee, Subho S.; Athreya, Arjun P.; Mainzer, Liudmila Sergeevna; Jongeneel, Cornelis; Hwu, Wen-Mei W; Kalbarczyk, Zbigniew T; Iyer, Ravishankar K.

DIDC 2016 - Proceedings of the ACM International Workshop on Data-Intensive Distributed Computing. Association for Computing Machinery, Inc, 2016. p. 27-36 (DIDC 2016 - Proceedings of the ACM International Workshop on Data-Intensive Distributed Computing).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Banerjee, SS, Athreya, AP, Mainzer, LS, Jongeneel, C, Hwu, W-MW, Kalbarczyk, ZT & Iyer, RK 2016, Efficient and scalable workflows for genomic analyses. in DIDC 2016 - Proceedings of the ACM International Workshop on Data-Intensive Distributed Computing. DIDC 2016 - Proceedings of the ACM International Workshop on Data-Intensive Distributed Computing, Association for Computing Machinery, Inc, pp. 27-36, 6th ACM International Workshop on Data-Intensive Distributed Computing, DIDC 2016, Kyoto, Japan, 6/1/16. https://doi.org/10.1145/2912152.2912156
Banerjee SS, Athreya AP, Mainzer LS, Jongeneel C, Hwu W-MW, Kalbarczyk ZT et al. Efficient and scalable workflows for genomic analyses. In DIDC 2016 - Proceedings of the ACM International Workshop on Data-Intensive Distributed Computing. Association for Computing Machinery, Inc. 2016. p. 27-36. (DIDC 2016 - Proceedings of the ACM International Workshop on Data-Intensive Distributed Computing). https://doi.org/10.1145/2912152.2912156
Banerjee, Subho S. ; Athreya, Arjun P. ; Mainzer, Liudmila Sergeevna ; Jongeneel, Cornelis ; Hwu, Wen-Mei W ; Kalbarczyk, Zbigniew T ; Iyer, Ravishankar K. / Efficient and scalable workflows for genomic analyses. DIDC 2016 - Proceedings of the ACM International Workshop on Data-Intensive Distributed Computing. Association for Computing Machinery, Inc, 2016. pp. 27-36 (DIDC 2016 - Proceedings of the ACM International Workshop on Data-Intensive Distributed Computing).
@inproceedings{c4f664bebe0248a9bed35f07c9ce5b4a,
title = "Efficient and scalable workflows for genomic analyses",
abstract = "Recent growth in the volume of DNA sequence data and the associated computational costs of extracting meaningful information warrant the need for efficient computational systems at scale. In this work, we propose the Illinois Genomics Execution Environment (IGen), a framework for efficient and scalable genome analyses. The design philosophy of IGen is based on algorithmic analysis and extensive measurements on compute- and data-intensive genomic analyses workflows (such as variant discovery and genotyping analysis) executed on high-performance and cloud computing infrastructures. IGen leverages the advantages of existing designs and proposes new software improvements to overcome the inefficiencies we observe in our measurements. Based on these composite improvements, we demonstrate that IGen is able to accelerate the alignment from 13.1 hours to 10.8 hours (1.2×) and the variant from 10.1 hours to 1.25 hours (8×) calling on a single node, and its modular design scales efficiently in a parallel computing environment.",
keywords = "Bioinformatics, Design, Genomics, Measurement, Performance",
author = "Banerjee, {Subho S.} and Athreya, {Arjun P.} and Mainzer, {Liudmila Sergeevna} and Cornelis Jongeneel and Hwu, {Wen-Mei W} and Kalbarczyk, {Zbigniew T} and Iyer, {Ravishankar K}",
year = "2016",
month = "6",
day = "1",
doi = "10.1145/2912152.2912156",
language = "English (US)",
series = "DIDC 2016 - Proceedings of the ACM International Workshop on Data-Intensive Distributed Computing",
publisher = "Association for Computing Machinery, Inc",
pages = "27--36",
booktitle = "DIDC 2016 - Proceedings of the ACM International Workshop on Data-Intensive Distributed Computing",

}

TY - GEN

T1 - Efficient and scalable workflows for genomic analyses

AU - Banerjee, Subho S.

AU - Athreya, Arjun P.

AU - Mainzer, Liudmila Sergeevna

AU - Jongeneel, Cornelis

AU - Hwu, Wen-Mei W

AU - Kalbarczyk, Zbigniew T

AU - Iyer, Ravishankar K

PY - 2016/6/1

Y1 - 2016/6/1

N2 - Recent growth in the volume of DNA sequence data and the associated computational costs of extracting meaningful information warrant the need for efficient computational systems at scale. In this work, we propose the Illinois Genomics Execution Environment (IGen), a framework for efficient and scalable genome analyses. The design philosophy of IGen is based on algorithmic analysis and extensive measurements on compute- and data-intensive genomic analyses workflows (such as variant discovery and genotyping analysis) executed on high-performance and cloud computing infrastructures. IGen leverages the advantages of existing designs and proposes new software improvements to overcome the inefficiencies we observe in our measurements. Based on these composite improvements, we demonstrate that IGen is able to accelerate the alignment from 13.1 hours to 10.8 hours (1.2×) and the variant from 10.1 hours to 1.25 hours (8×) calling on a single node, and its modular design scales efficiently in a parallel computing environment.

AB - Recent growth in the volume of DNA sequence data and the associated computational costs of extracting meaningful information warrant the need for efficient computational systems at scale. In this work, we propose the Illinois Genomics Execution Environment (IGen), a framework for efficient and scalable genome analyses. The design philosophy of IGen is based on algorithmic analysis and extensive measurements on compute- and data-intensive genomic analyses workflows (such as variant discovery and genotyping analysis) executed on high-performance and cloud computing infrastructures. IGen leverages the advantages of existing designs and proposes new software improvements to overcome the inefficiencies we observe in our measurements. Based on these composite improvements, we demonstrate that IGen is able to accelerate the alignment from 13.1 hours to 10.8 hours (1.2×) and the variant from 10.1 hours to 1.25 hours (8×) calling on a single node, and its modular design scales efficiently in a parallel computing environment.

KW - Bioinformatics

KW - Design

KW - Genomics

KW - Measurement

KW - Performance

UR - http://www.scopus.com/inward/record.url?scp=84978909732&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84978909732&partnerID=8YFLogxK

U2 - 10.1145/2912152.2912156

DO - 10.1145/2912152.2912156

M3 - Conference contribution

T3 - DIDC 2016 - Proceedings of the ACM International Workshop on Data-Intensive Distributed Computing

SP - 27

EP - 36

BT - DIDC 2016 - Proceedings of the ACM International Workshop on Data-Intensive Distributed Computing

PB - Association for Computing Machinery, Inc

ER -