Developing reproducible bioinformatics analysis workflows for heterogeneous computing environments to support African genomics

Shakuntala Baichoo, Yassine Souilmi, Sumir Panji, Gerrit Botha, Ayton Meintjes, Scott Hazelhurst, Hocine Bendou, Eugene de Beste, Phelelani T. Mpangase, Oussema Souiai, Mustafa Alghali, Long Yi, Brian D. O'Connor, Michael Crusoe, Don Armstrong, Shaun Aron, Fourie Joubert, Azza E. Ahmed, Mamana Mbiyavanga, Peter van Heusden & 6 others Lerato E. Magosi, Jennie Zermeno, Liudmila Sergeevna Mainzer, Faisal M. Fadlelmola, C. Victor Jongeneel, Nicola Mulder

Research output: Contribution to journalArticle

Abstract

Background: The Pan-African bioinformatics network, H3ABioNet, comprises 27 research institutions in 17 African countries. H3ABioNet is part of the Human Health and Heredity in Africa program (H3Africa), an African-led research consortium funded by the US National Institutes of Health and the UK Wellcome Trust, aimed at using genomics to study and improve the health of Africans. A key role of H3ABioNet is to support H3Africa projects by building bioinformatics infrastructure such as portable and reproducible bioinformatics workflows for use on heterogeneous African computing environments. Processing and analysis of genomic data is an example of a big data application requiring complex interdependent data analysis workflows. Such bioinformatics workflows take the primary and secondary input data through several computationally-intensive processing steps using different software packages, where some of the outputs form inputs for other steps. Implementing scalable, reproducible, portable and easy-to-use workflows is particularly challenging. Results: H3ABioNet has built four workflows to support (1) the calling of variants from high-throughput sequencing data; (2) the analysis of microbial populations from 16S rDNA sequence data; (3) genotyping and genome-wide association studies; and (4) single nucleotide polymorphism imputation. A week-long hackathon was organized in August 2016 with participants from six African bioinformatics groups, and US and European collaborators. Two of the workflows are built using the Common Workflow Language framework (CWL) and two using Nextflow. All the workflows are containerized for improved portability and reproducibility using Docker, and are publicly available for use by members of the H3Africa consortium and the international research community. Conclusion: The H3ABioNet workflows have been implemented in view of offering ease of use for the end user and high levels of reproducibility and portability, all while following modern state of the art bioinformatics data processing protocols. The H3ABioNet workflows will service the H3Africa consortium projects and are currently in use. All four workflows are also publicly available for research scientists worldwide to use and adapt for their respective needs. The H3ABioNet workflows will help develop bioinformatics capacity and assist genomics research within Africa and serve to increase the scientific output of H3Africa and its Pan-African Bioinformatics Network.

Original languageEnglish (US)
Article number457
JournalBMC bioinformatics
Volume19
Issue number1
DOIs
StatePublished - Nov 29 2018

Fingerprint

Heterogeneous Computing
Workflow
Bioinformatics
Genomics
Computational Biology
Work Flow
Health
Research
Portability
Reproducibility
Africa
Nucleotides
Processing
Ribosomal DNA
Polymorphism
Software packages
Heredity
Single nucleotide Polymorphism
Genes
Output

Keywords

  • Africa
  • Bioinformatics
  • Docker
  • Genomics
  • Pipeline
  • Reproducibility
  • Workflows

ASJC Scopus subject areas

  • Structural Biology
  • Biochemistry
  • Molecular Biology
  • Computer Science Applications
  • Applied Mathematics

Cite this

Developing reproducible bioinformatics analysis workflows for heterogeneous computing environments to support African genomics. / Baichoo, Shakuntala; Souilmi, Yassine; Panji, Sumir; Botha, Gerrit; Meintjes, Ayton; Hazelhurst, Scott; Bendou, Hocine; Beste, Eugene de; Mpangase, Phelelani T.; Souiai, Oussema; Alghali, Mustafa; Yi, Long; O'Connor, Brian D.; Crusoe, Michael; Armstrong, Don; Aron, Shaun; Joubert, Fourie; Ahmed, Azza E.; Mbiyavanga, Mamana; Heusden, Peter van; Magosi, Lerato E.; Zermeno, Jennie; Mainzer, Liudmila Sergeevna; Fadlelmola, Faisal M.; Jongeneel, C. Victor; Mulder, Nicola.

In: BMC bioinformatics, Vol. 19, No. 1, 457, 29.11.2018.

Research output: Contribution to journalArticle

Baichoo, S, Souilmi, Y, Panji, S, Botha, G, Meintjes, A, Hazelhurst, S, Bendou, H, Beste, ED, Mpangase, PT, Souiai, O, Alghali, M, Yi, L, O'Connor, BD, Crusoe, M, Armstrong, D, Aron, S, Joubert, F, Ahmed, AE, Mbiyavanga, M, Heusden, PV, Magosi, LE, Zermeno, J, Mainzer, LS, Fadlelmola, FM, Jongeneel, CV & Mulder, N 2018, 'Developing reproducible bioinformatics analysis workflows for heterogeneous computing environments to support African genomics', BMC bioinformatics, vol. 19, no. 1, 457. https://doi.org/10.1186/s12859-018-2446-1
Baichoo, Shakuntala ; Souilmi, Yassine ; Panji, Sumir ; Botha, Gerrit ; Meintjes, Ayton ; Hazelhurst, Scott ; Bendou, Hocine ; Beste, Eugene de ; Mpangase, Phelelani T. ; Souiai, Oussema ; Alghali, Mustafa ; Yi, Long ; O'Connor, Brian D. ; Crusoe, Michael ; Armstrong, Don ; Aron, Shaun ; Joubert, Fourie ; Ahmed, Azza E. ; Mbiyavanga, Mamana ; Heusden, Peter van ; Magosi, Lerato E. ; Zermeno, Jennie ; Mainzer, Liudmila Sergeevna ; Fadlelmola, Faisal M. ; Jongeneel, C. Victor ; Mulder, Nicola. / Developing reproducible bioinformatics analysis workflows for heterogeneous computing environments to support African genomics. In: BMC bioinformatics. 2018 ; Vol. 19, No. 1.
@article{1e2e29a58590413fa7beb78019873dff,
title = "Developing reproducible bioinformatics analysis workflows for heterogeneous computing environments to support African genomics",
abstract = "Background: The Pan-African bioinformatics network, H3ABioNet, comprises 27 research institutions in 17 African countries. H3ABioNet is part of the Human Health and Heredity in Africa program (H3Africa), an African-led research consortium funded by the US National Institutes of Health and the UK Wellcome Trust, aimed at using genomics to study and improve the health of Africans. A key role of H3ABioNet is to support H3Africa projects by building bioinformatics infrastructure such as portable and reproducible bioinformatics workflows for use on heterogeneous African computing environments. Processing and analysis of genomic data is an example of a big data application requiring complex interdependent data analysis workflows. Such bioinformatics workflows take the primary and secondary input data through several computationally-intensive processing steps using different software packages, where some of the outputs form inputs for other steps. Implementing scalable, reproducible, portable and easy-to-use workflows is particularly challenging. Results: H3ABioNet has built four workflows to support (1) the calling of variants from high-throughput sequencing data; (2) the analysis of microbial populations from 16S rDNA sequence data; (3) genotyping and genome-wide association studies; and (4) single nucleotide polymorphism imputation. A week-long hackathon was organized in August 2016 with participants from six African bioinformatics groups, and US and European collaborators. Two of the workflows are built using the Common Workflow Language framework (CWL) and two using Nextflow. All the workflows are containerized for improved portability and reproducibility using Docker, and are publicly available for use by members of the H3Africa consortium and the international research community. Conclusion: The H3ABioNet workflows have been implemented in view of offering ease of use for the end user and high levels of reproducibility and portability, all while following modern state of the art bioinformatics data processing protocols. The H3ABioNet workflows will service the H3Africa consortium projects and are currently in use. All four workflows are also publicly available for research scientists worldwide to use and adapt for their respective needs. The H3ABioNet workflows will help develop bioinformatics capacity and assist genomics research within Africa and serve to increase the scientific output of H3Africa and its Pan-African Bioinformatics Network.",
keywords = "Africa, Bioinformatics, Docker, Genomics, Pipeline, Reproducibility, Workflows",
author = "Shakuntala Baichoo and Yassine Souilmi and Sumir Panji and Gerrit Botha and Ayton Meintjes and Scott Hazelhurst and Hocine Bendou and Beste, {Eugene de} and Mpangase, {Phelelani T.} and Oussema Souiai and Mustafa Alghali and Long Yi and O'Connor, {Brian D.} and Michael Crusoe and Don Armstrong and Shaun Aron and Fourie Joubert and Ahmed, {Azza E.} and Mamana Mbiyavanga and Heusden, {Peter van} and Magosi, {Lerato E.} and Jennie Zermeno and Mainzer, {Liudmila Sergeevna} and Fadlelmola, {Faisal M.} and Jongeneel, {C. Victor} and Nicola Mulder",
year = "2018",
month = "11",
day = "29",
doi = "10.1186/s12859-018-2446-1",
language = "English (US)",
volume = "19",
journal = "BMC Bioinformatics",
issn = "1471-2105",
publisher = "BioMed Central",
number = "1",

}

TY - JOUR

T1 - Developing reproducible bioinformatics analysis workflows for heterogeneous computing environments to support African genomics

AU - Baichoo, Shakuntala

AU - Souilmi, Yassine

AU - Panji, Sumir

AU - Botha, Gerrit

AU - Meintjes, Ayton

AU - Hazelhurst, Scott

AU - Bendou, Hocine

AU - Beste, Eugene de

AU - Mpangase, Phelelani T.

AU - Souiai, Oussema

AU - Alghali, Mustafa

AU - Yi, Long

AU - O'Connor, Brian D.

AU - Crusoe, Michael

AU - Armstrong, Don

AU - Aron, Shaun

AU - Joubert, Fourie

AU - Ahmed, Azza E.

AU - Mbiyavanga, Mamana

AU - Heusden, Peter van

AU - Magosi, Lerato E.

AU - Zermeno, Jennie

AU - Mainzer, Liudmila Sergeevna

AU - Fadlelmola, Faisal M.

AU - Jongeneel, C. Victor

AU - Mulder, Nicola

PY - 2018/11/29

Y1 - 2018/11/29

N2 - Background: The Pan-African bioinformatics network, H3ABioNet, comprises 27 research institutions in 17 African countries. H3ABioNet is part of the Human Health and Heredity in Africa program (H3Africa), an African-led research consortium funded by the US National Institutes of Health and the UK Wellcome Trust, aimed at using genomics to study and improve the health of Africans. A key role of H3ABioNet is to support H3Africa projects by building bioinformatics infrastructure such as portable and reproducible bioinformatics workflows for use on heterogeneous African computing environments. Processing and analysis of genomic data is an example of a big data application requiring complex interdependent data analysis workflows. Such bioinformatics workflows take the primary and secondary input data through several computationally-intensive processing steps using different software packages, where some of the outputs form inputs for other steps. Implementing scalable, reproducible, portable and easy-to-use workflows is particularly challenging. Results: H3ABioNet has built four workflows to support (1) the calling of variants from high-throughput sequencing data; (2) the analysis of microbial populations from 16S rDNA sequence data; (3) genotyping and genome-wide association studies; and (4) single nucleotide polymorphism imputation. A week-long hackathon was organized in August 2016 with participants from six African bioinformatics groups, and US and European collaborators. Two of the workflows are built using the Common Workflow Language framework (CWL) and two using Nextflow. All the workflows are containerized for improved portability and reproducibility using Docker, and are publicly available for use by members of the H3Africa consortium and the international research community. Conclusion: The H3ABioNet workflows have been implemented in view of offering ease of use for the end user and high levels of reproducibility and portability, all while following modern state of the art bioinformatics data processing protocols. The H3ABioNet workflows will service the H3Africa consortium projects and are currently in use. All four workflows are also publicly available for research scientists worldwide to use and adapt for their respective needs. The H3ABioNet workflows will help develop bioinformatics capacity and assist genomics research within Africa and serve to increase the scientific output of H3Africa and its Pan-African Bioinformatics Network.

AB - Background: The Pan-African bioinformatics network, H3ABioNet, comprises 27 research institutions in 17 African countries. H3ABioNet is part of the Human Health and Heredity in Africa program (H3Africa), an African-led research consortium funded by the US National Institutes of Health and the UK Wellcome Trust, aimed at using genomics to study and improve the health of Africans. A key role of H3ABioNet is to support H3Africa projects by building bioinformatics infrastructure such as portable and reproducible bioinformatics workflows for use on heterogeneous African computing environments. Processing and analysis of genomic data is an example of a big data application requiring complex interdependent data analysis workflows. Such bioinformatics workflows take the primary and secondary input data through several computationally-intensive processing steps using different software packages, where some of the outputs form inputs for other steps. Implementing scalable, reproducible, portable and easy-to-use workflows is particularly challenging. Results: H3ABioNet has built four workflows to support (1) the calling of variants from high-throughput sequencing data; (2) the analysis of microbial populations from 16S rDNA sequence data; (3) genotyping and genome-wide association studies; and (4) single nucleotide polymorphism imputation. A week-long hackathon was organized in August 2016 with participants from six African bioinformatics groups, and US and European collaborators. Two of the workflows are built using the Common Workflow Language framework (CWL) and two using Nextflow. All the workflows are containerized for improved portability and reproducibility using Docker, and are publicly available for use by members of the H3Africa consortium and the international research community. Conclusion: The H3ABioNet workflows have been implemented in view of offering ease of use for the end user and high levels of reproducibility and portability, all while following modern state of the art bioinformatics data processing protocols. The H3ABioNet workflows will service the H3Africa consortium projects and are currently in use. All four workflows are also publicly available for research scientists worldwide to use and adapt for their respective needs. The H3ABioNet workflows will help develop bioinformatics capacity and assist genomics research within Africa and serve to increase the scientific output of H3Africa and its Pan-African Bioinformatics Network.

KW - Africa

KW - Bioinformatics

KW - Docker

KW - Genomics

KW - Pipeline

KW - Reproducibility

KW - Workflows

UR - http://www.scopus.com/inward/record.url?scp=85057383865&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85057383865&partnerID=8YFLogxK

U2 - 10.1186/s12859-018-2446-1

DO - 10.1186/s12859-018-2446-1

M3 - Article

VL - 19

JO - BMC Bioinformatics

JF - BMC Bioinformatics

SN - 1471-2105

IS - 1

M1 - 457

ER -