TY - CONF
T1 - Detailed provenance metadata from statistical analysis software
AU - Alter, George
AU - Gager, Jack
AU - Heus, Pascal
AU - Hunter, Carson
AU - Ionescu, Sanda
AU - Iverson, Jeremy
AU - Jagadish, H. V.
AU - Ludaescher, Bertram
AU - Lyle, Jared
AU - McPhillips, Timothy
AU - Mueller, Alexander
AU - Nordgaard, Sigve
AU - Risnes, Ørnulf
AU - Smith, Dan
AU - Song, Jie
AU - Thelen, Thomas
N1 - Funding Information:
"Continuous Capture of Metadata for Statistical Data Pro-ject" is funded by National Science Foundation grant ACI-1640575. "Merging Science and Cyberinfrastructure Path-ways: The Whole Tale" is funded by National Science Foun-dation grant OAC 1541450.
Funding Information:
“Continuous Capture of Metadata for Statistical Data Project” is funded by National Science Foundation grant ACI-1640575. "Merging Science and Cyberinfrastructure Pathways: The Whole Tale" is funded by National Science Foundation grant OAC 1541450.
Publisher Copyright:
© 2021 TaPP 2021 - 13th International Workshop on Theory and Practice of Provenance. All rights reserved.
PY - 2021
Y1 - 2021
N2 - We have created a set of tools for automating the extraction of fine-grained provenance from statistical analysis software used for data management. Our tools create metadata about steps within programs and variables (columns) within data-frames in a way consistent with the ProvONE extension of the PROV model. Scripts from the most widely used statis-tical analysis programs are translated into Structured Data Transformation Language (SDTL), an intermediate language for describing data transformation commands. SDTL can be queried to create histories of each variable in a dataset. For example, we can ask, "Which commands modified variable X?" or "Which variables were affected by variable Y?" SDTL was created to solve several problems. First, research-ers are divided among a number of mutually unintelligible statistical languages. SDTL serves as a lingua franca provid-ing a common language for downstream applications. Sec-ond, SDTL is a structured language that can be serialized in JSON, XML, RDF, and other formats. Applications can read SDTL without specialized parsing, and relationships among elements in SDTL are not defined by an external grammar. Third, SDTL provides general descriptions for operations that are handled differently in each language. For example, the SDTL MergeDatasets command describes both earlier languages (SPSS, SAS, Stata), in which merging is based on sequentially sorted files, and recent languages (R, Python) modelled on SQL. In addition, we have developed a flexible tool that translates SDTL into natural language. Our tools also embed variable histories including both SDTL and natu-ral language translations into standard metadata files, such as Data Documentation Initiative (DDI) and Ecological Metadata Language (EML), which are used by data reposito-ries to inform data catalogs, data discovery services, and codebooks. Thus, users can receive detailed information about the effects of data transformation programs without un-derstanding the language in which they were written.
AB - We have created a set of tools for automating the extraction of fine-grained provenance from statistical analysis software used for data management. Our tools create metadata about steps within programs and variables (columns) within data-frames in a way consistent with the ProvONE extension of the PROV model. Scripts from the most widely used statis-tical analysis programs are translated into Structured Data Transformation Language (SDTL), an intermediate language for describing data transformation commands. SDTL can be queried to create histories of each variable in a dataset. For example, we can ask, "Which commands modified variable X?" or "Which variables were affected by variable Y?" SDTL was created to solve several problems. First, research-ers are divided among a number of mutually unintelligible statistical languages. SDTL serves as a lingua franca provid-ing a common language for downstream applications. Sec-ond, SDTL is a structured language that can be serialized in JSON, XML, RDF, and other formats. Applications can read SDTL without specialized parsing, and relationships among elements in SDTL are not defined by an external grammar. Third, SDTL provides general descriptions for operations that are handled differently in each language. For example, the SDTL MergeDatasets command describes both earlier languages (SPSS, SAS, Stata), in which merging is based on sequentially sorted files, and recent languages (R, Python) modelled on SQL. In addition, we have developed a flexible tool that translates SDTL into natural language. Our tools also embed variable histories including both SDTL and natu-ral language translations into standard metadata files, such as Data Documentation Initiative (DDI) and Ecological Metadata Language (EML), which are used by data reposito-ries to inform data catalogs, data discovery services, and codebooks. Thus, users can receive detailed information about the effects of data transformation programs without un-derstanding the language in which they were written.
UR - http://www.scopus.com/inward/record.url?scp=85114270006&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85114270006&partnerID=8YFLogxK
M3 - Paper
AN - SCOPUS:85114270006
T2 - 13th International Workshop on Theory and Practice of Provenance, TaPP 2021
Y2 - 19 July 2021 through 20 July 2021
ER -