Discovering similar workflows via provenance clustering: A case study

Abdussalam Alawini, Leshang Chen, Susan Davidson, Stephen Fisher, Junhyong Kim

Research output: Chapter in Book/Report/Conference proceedingConference contribution


Several workflow management systems and scripting languages have adopted provenance tracking, yet many researchers choose to manually capture or instrument their processing scripts to write provenance information to files. The Next Generation Sequencing (NGS) project we are associated with is tracking provenance in such manner. The NGS project is a collaboration between multiple groups at different sites, where each group is collecting and processing samples using an agreed-upon workflow. The workflow contains many stages with varying degrees of complexity. Over time workflow stages are modified, but data samples are only comparable when processed with identical versions of the workflow. However, for various reasons (including the distributed nature of the collaboration) it is not always clear which samples have been processed with which version of the workflow. In this paper, we introduce new techniques for clustering provenance datasets and attempt to discover the ones that are likely to be generated by same workflow. Based on the clustering result, users can identify similar provenance and would be able to categorize them into different clusters for debugging and zoom-in/zoom-out viewing.

Original languageEnglish (US)
Title of host publicationProvenance and Annotation of Data and Processes - 7th International Provenance and Annotation Workshop, IPAW 2018, Proceedings
EditorsKhalid Belhajjame, Ashish Gehani, Pinar Alper
Number of pages13
ISBN (Print)9783319983783
StatePublished - 2018
Externally publishedYes
Event7th International Provenance and Annotation Workshop, IPAW 2018 - London, United Kingdom
Duration: Jul 9 2018Jul 10 2018

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume11017 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349


Other7th International Provenance and Annotation Workshop, IPAW 2018
Country/TerritoryUnited Kingdom


  • Clustering
  • Document classification
  • K-Means
  • Structural features
  • Workflow provenance

ASJC Scopus subject areas

  • Theoretical Computer Science
  • General Computer Science


Dive into the research topics of 'Discovering similar workflows via provenance clustering: A case study'. Together they form a unique fingerprint.

Cite this