Detailed provenance metadata from statistical analysis software

George Alter, Jack Gager, Pascal Heus, Carson Hunter, Sanda Ionescu, Jeremy Iverson, H. V. Jagadish, Bertram Ludaescher, Jared Lyle, Timothy McPhillips, Alexander Mueller, Sigve Nordgaard, Ørnulf Risnes, Dan Smith, Jie Song, Thomas Thelen

Research output: Contribution to conferencePaperpeer-review


We have created a set of tools for automating the extraction of fine-grained provenance from statistical analysis software used for data management. Our tools create metadata about steps within programs and variables (columns) within data-frames in a way consistent with the ProvONE extension of the PROV model. Scripts from the most widely used statis-tical analysis programs are translated into Structured Data Transformation Language (SDTL), an intermediate language for describing data transformation commands. SDTL can be queried to create histories of each variable in a dataset. For example, we can ask, "Which commands modified variable X?" or "Which variables were affected by variable Y?" SDTL was created to solve several problems. First, research-ers are divided among a number of mutually unintelligible statistical languages. SDTL serves as a lingua franca provid-ing a common language for downstream applications. Sec-ond, SDTL is a structured language that can be serialized in JSON, XML, RDF, and other formats. Applications can read SDTL without specialized parsing, and relationships among elements in SDTL are not defined by an external grammar. Third, SDTL provides general descriptions for operations that are handled differently in each language. For example, the SDTL MergeDatasets command describes both earlier languages (SPSS, SAS, Stata), in which merging is based on sequentially sorted files, and recent languages (R, Python) modelled on SQL. In addition, we have developed a flexible tool that translates SDTL into natural language. Our tools also embed variable histories including both SDTL and natu-ral language translations into standard metadata files, such as Data Documentation Initiative (DDI) and Ecological Metadata Language (EML), which are used by data reposito-ries to inform data catalogs, data discovery services, and codebooks. Thus, users can receive detailed information about the effects of data transformation programs without un-derstanding the language in which they were written.

Original languageEnglish (US)
StatePublished - 2021
Event13th International Workshop on Theory and Practice of Provenance, TaPP 2021 - Virtual, Online
Duration: Jul 19 2021Jul 20 2021


Conference13th International Workshop on Theory and Practice of Provenance, TaPP 2021
CityVirtual, Online

ASJC Scopus subject areas

  • General Computer Science


Dive into the research topics of 'Detailed provenance metadata from statistical analysis software'. Together they form a unique fingerprint.

Cite this