Parallelizing XML data-streaming workflows via MapReduce

Daniel Zinn, Shawn Bowers, Sven Köhler, Bertram Ludäscher

Research output: Contribution to journalArticlepeer-review


In prior work it has been shown that the design of scientific workflows can benefit from a collection-oriented modeling paradigm which views scientific workflows as pipelines of XML stream processors. In this paper, we present approaches for exploiting data parallelism in XML processing pipelines through novel compilation strategies to the MapReduce framework. Pipelines in our approach consist of sequences of processing steps that receive XML-structured data and produce, often through calls to "black-box" (scientific) functions, modified (i.e., updated) XML structures. Our main contributions are (i) the development of a set of strategies for compiling scientific workflows, modeled as XML processing pipelines, into parallel MapReduce networks, and (ii) a discussion of their advantages and trade-offs, based on a thorough experimental evaluation of the various translation strategies. Our evaluation uses the Hadoop MapReduce system as an implementation platform. Our results show that execution times of XML workflow pipelines can be significantly reduced using our compilation strategies. These efficiency gains, together with the benefits of MapReduce (e.g., fault tolerance) make our approach ideal for executing large-scale, compute-intensive XML-based scientific workflows.

Original languageEnglish (US)
Pages (from-to)447-463
Number of pages17
JournalJournal of Computer and System Sciences
Issue number6
StatePublished - 2010
Externally publishedYes


  • Collection-Oriented Modeling and Design (COMAD)
  • Data stream processing
  • Grouping
  • MapReduce
  • Parallelization
  • Static analysis
  • Virtual Data Assembly Line (VDAL)
  • XML processing pipelines

ASJC Scopus subject areas

  • Theoretical Computer Science
  • Computer Science(all)
  • Computer Networks and Communications
  • Computational Theory and Mathematics
  • Applied Mathematics


Dive into the research topics of 'Parallelizing XML data-streaming workflows via MapReduce'. Together they form a unique fingerprint.

Cite this