TY - JOUR
T1 - BioWorkbench
T2 - A high-performance framework for managing and analyzing bioinformatics experiments
AU - Mondelli, Maria Luiza
AU - Magalhães, Thiago
AU - Loss, Guilherme
AU - Wilde, Michael
AU - Foster, Ian
AU - Mattoso, Marta
AU - Katz, Daniel
AU - Barbosa, Helio
AU - de Vasconcelos, Ana Tereza R.
AU - Ocaña, Kary
AU - Gadelha, Luiz M.R.
N1 - Funding Information:
The following grant information was disclosed by the authors: Brazilian funding agencies CNPq, CAPES, and FAPERJ.
Publisher Copyright:
© 2018 Mondelli et al.
PY - 2018
Y1 - 2018
N2 - Advances in sequencing techniques have led to exponential growth in biological data, demanding the development of large-scale bioinformatics experiments. Because these experiments are computation- and data-intensive, they require high-performance computing techniques and can benefit from specialized technologies such as Scientific Workflow Management Systems and databases. In this work, we present BioWorkbench, a framework for managing and analyzing bioinformatics experiments. This framework automatically collects provenance data, including both performance data from workflow execution and data from the scientific domain of the workflow application. Provenance data can be analyzed through a web application that abstracts a set of queries to the provenance database, simplifying access to provenance information. We evaluate BioWorkbench using three case studies: SwiftPhylo, a phylogenetic tree assembly workflow; SwiftGECKO, a comparative genomics workflow; and RASflow, a RASopathy analysis workflow. We analyze each workflow from both computational and scientific domain perspectives, by using queries to a provenance and annotation database. Some of these queries are available as a pre-built feature of the BioWorkbench web application. Through the provenance data, we show that the framework is scalable and achieves high-performance, reducing up to 98% of the case studies execution time. We also show how the application of machine learning techniques can enrich the analysis process.
AB - Advances in sequencing techniques have led to exponential growth in biological data, demanding the development of large-scale bioinformatics experiments. Because these experiments are computation- and data-intensive, they require high-performance computing techniques and can benefit from specialized technologies such as Scientific Workflow Management Systems and databases. In this work, we present BioWorkbench, a framework for managing and analyzing bioinformatics experiments. This framework automatically collects provenance data, including both performance data from workflow execution and data from the scientific domain of the workflow application. Provenance data can be analyzed through a web application that abstracts a set of queries to the provenance database, simplifying access to provenance information. We evaluate BioWorkbench using three case studies: SwiftPhylo, a phylogenetic tree assembly workflow; SwiftGECKO, a comparative genomics workflow; and RASflow, a RASopathy analysis workflow. We analyze each workflow from both computational and scientific domain perspectives, by using queries to a provenance and annotation database. Some of these queries are available as a pre-built feature of the BioWorkbench web application. Through the provenance data, we show that the framework is scalable and achieves high-performance, reducing up to 98% of the case studies execution time. We also show how the application of machine learning techniques can enrich the analysis process.
KW - Bioinformatics
KW - Data analytics
KW - Profiling
KW - Provenance
KW - Scientific workflows
UR - http://www.scopus.com/inward/record.url?scp=85052689365&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85052689365&partnerID=8YFLogxK
U2 - 10.7717/peerj.5551
DO - 10.7717/peerj.5551
M3 - Article
C2 - 30186700
AN - SCOPUS:85052689365
SN - 2167-8359
VL - 2018
JO - PeerJ
JF - PeerJ
IS - 8
M1 - e5551
ER -