TY - GEN
T1 - Scientific workflow design 2.0
T2 - 2011 IEEE 27th International Conference on Data Engineering, ICDE 2011
AU - Dou, Lei
AU - Zinn, Daniel
AU - McPhillips, Timothy
AU - Kohler, Sven
AU - Riddle, Sean
AU - Bowers, Shawn
AU - Ludäscher, Bertram
PY - 2011
Y1 - 2011
N2 - Scientific workflow systems are used to integrate existing software components (actors) into larger analysis pipelines to perform in silico experiments. Current approaches for handling data in nested-collection structures, as required in many scientific domains, lead to many record-management actors (shims) that make the workflow structure overly complex, and as a consequence hard to construct, evolve and maintain. By constructing and executing workflows from bioinformatics and geosciences in the Kepler system, we will demonstrate how COMAD (Collection-Oriented Modeling and Design), an extension of conventional workflow design, addresses these shortcomings. In particular, COMAD provides a hierarchical data stream model (as in XML) and a novel declarative configuration language for actors that functions as a middleware layer between the workflow's data model (streaming nested collections) and the actor's data model (base data and lists thereof). Our approach allows actor developers to focus on the internal actor processing logic oblivious to the workflow structure. Actors can then be re-used in various workflows simply by adapting actor configurations. Due to streaming nested collections and declarative configurations, COMAD workflows can usually be realized as linear data processing pipelines, which often reflect the scientific data analysis intention better than conventional designs. This linear structure not only simplifies actor insertions and deletions (workflow evolution), but also decreases the overall complexity of the workflow, reducing future effort in maintenance.
AB - Scientific workflow systems are used to integrate existing software components (actors) into larger analysis pipelines to perform in silico experiments. Current approaches for handling data in nested-collection structures, as required in many scientific domains, lead to many record-management actors (shims) that make the workflow structure overly complex, and as a consequence hard to construct, evolve and maintain. By constructing and executing workflows from bioinformatics and geosciences in the Kepler system, we will demonstrate how COMAD (Collection-Oriented Modeling and Design), an extension of conventional workflow design, addresses these shortcomings. In particular, COMAD provides a hierarchical data stream model (as in XML) and a novel declarative configuration language for actors that functions as a middleware layer between the workflow's data model (streaming nested collections) and the actor's data model (base data and lists thereof). Our approach allows actor developers to focus on the internal actor processing logic oblivious to the workflow structure. Actors can then be re-used in various workflows simply by adapting actor configurations. Due to streaming nested collections and declarative configurations, COMAD workflows can usually be realized as linear data processing pipelines, which often reflect the scientific data analysis intention better than conventional designs. This linear structure not only simplifies actor insertions and deletions (workflow evolution), but also decreases the overall complexity of the workflow, reducing future effort in maintenance.
UR - http://www.scopus.com/inward/record.url?scp=79957845107&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=79957845107&partnerID=8YFLogxK
U2 - 10.1109/ICDE.2011.5767938
DO - 10.1109/ICDE.2011.5767938
M3 - Conference contribution
AN - SCOPUS:79957845107
SN - 9781424489589
T3 - Proceedings - International Conference on Data Engineering
SP - 1296
EP - 1299
BT - 2011 IEEE 27th International Conference on Data Engineering, ICDE 2011
Y2 - 11 April 2011 through 16 April 2011
ER -