TY - JOUR
T1 - Parallelizing XML data-streaming workflows via MapReduce
AU - Zinn, Daniel
AU - Bowers, Shawn
AU - Köhler, Sven
AU - Ludäscher, Bertram
N1 - This work was supported in part by NSF awards IIS-0612326, OCI-0722079, DBI-0619060, DE-FC02-07ER25811 and IIS-0630033. The authors also thank Timothy McPhillips who first suggested and subsequently developed and implemented an approach called COMAD (Collection-Oriented Modeling And Design) [48,49] based on an assembly line metaphor. Our XML Processing Pipelines are an abstract version of the COMAD idea. They also thank Jianwu Wang and the anonymous reviewers for valuable comments on an earlier draft.
PY - 2010
Y1 - 2010
N2 - In prior work it has been shown that the design of scientific workflows can benefit from a collection-oriented modeling paradigm which views scientific workflows as pipelines of XML stream processors. In this paper, we present approaches for exploiting data parallelism in XML processing pipelines through novel compilation strategies to the MapReduce framework. Pipelines in our approach consist of sequences of processing steps that receive XML-structured data and produce, often through calls to "black-box" (scientific) functions, modified (i.e., updated) XML structures. Our main contributions are (i) the development of a set of strategies for compiling scientific workflows, modeled as XML processing pipelines, into parallel MapReduce networks, and (ii) a discussion of their advantages and trade-offs, based on a thorough experimental evaluation of the various translation strategies. Our evaluation uses the Hadoop MapReduce system as an implementation platform. Our results show that execution times of XML workflow pipelines can be significantly reduced using our compilation strategies. These efficiency gains, together with the benefits of MapReduce (e.g., fault tolerance) make our approach ideal for executing large-scale, compute-intensive XML-based scientific workflows.
AB - In prior work it has been shown that the design of scientific workflows can benefit from a collection-oriented modeling paradigm which views scientific workflows as pipelines of XML stream processors. In this paper, we present approaches for exploiting data parallelism in XML processing pipelines through novel compilation strategies to the MapReduce framework. Pipelines in our approach consist of sequences of processing steps that receive XML-structured data and produce, often through calls to "black-box" (scientific) functions, modified (i.e., updated) XML structures. Our main contributions are (i) the development of a set of strategies for compiling scientific workflows, modeled as XML processing pipelines, into parallel MapReduce networks, and (ii) a discussion of their advantages and trade-offs, based on a thorough experimental evaluation of the various translation strategies. Our evaluation uses the Hadoop MapReduce system as an implementation platform. Our results show that execution times of XML workflow pipelines can be significantly reduced using our compilation strategies. These efficiency gains, together with the benefits of MapReduce (e.g., fault tolerance) make our approach ideal for executing large-scale, compute-intensive XML-based scientific workflows.
KW - Collection-Oriented Modeling and Design (COMAD)
KW - Data stream processing
KW - Grouping
KW - MapReduce
KW - Parallelization
KW - Static analysis
KW - Virtual Data Assembly Line (VDAL)
KW - XML processing pipelines
UR - http://www.scopus.com/inward/record.url?scp=77955309365&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=77955309365&partnerID=8YFLogxK
U2 - 10.1016/j.jcss.2009.11.006
DO - 10.1016/j.jcss.2009.11.006
M3 - Article
AN - SCOPUS:77955309365
SN - 0022-0000
VL - 76
SP - 447
EP - 463
JO - Journal of Computer and System Sciences
JF - Journal of Computer and System Sciences
IS - 6
ER -