TY - GEN
T1 - Helping scientists ReConnect their datasets
AU - Alawini, Abdussalam
AU - Maier, David
AU - Howe, Bill
AU - Tufte, Kristin
PY - 2014
Y1 - 2014
N2 - It seems inevitable that the datasets associated with a research project proliferate over time: collaborators may extend datasets with new measurements and new attributes, new experimental runs result in new files with similar structures, and subsets of data are extracted for independent analysis. As these "residual" datasets begin to accrete over time, scientists can lose track of the derivation history that connects them, complicating data sharing, provenance tracking, and scientific reproducibility. In this paper, focusing on data in spreadsheets, we consider how observable relationships between two datasets can help scientists recall their original derivation connection. For instance, if dataset A is wholly contained in dataset B, B may be a more recent version of A and should be preferred when archiving or publishing. We articulate a space of relevant relationships, develop a set of algorithms for efficient discovery of these relationships, and organize these algorithms into a new system called Re-Connect to assist scientists in relationship discovery. Our evaluation shows that existing approaches that rely on flagging differences between two spreadsheets are impractical for many relationship-discovery tasks, and a user study shows that ReConnect can improve scientists' ability to detect useful relationships and subsequently identify the best dataset for a given task.
AB - It seems inevitable that the datasets associated with a research project proliferate over time: collaborators may extend datasets with new measurements and new attributes, new experimental runs result in new files with similar structures, and subsets of data are extracted for independent analysis. As these "residual" datasets begin to accrete over time, scientists can lose track of the derivation history that connects them, complicating data sharing, provenance tracking, and scientific reproducibility. In this paper, focusing on data in spreadsheets, we consider how observable relationships between two datasets can help scientists recall their original derivation connection. For instance, if dataset A is wholly contained in dataset B, B may be a more recent version of A and should be preferred when archiving or publishing. We articulate a space of relevant relationships, develop a set of algorithms for efficient discovery of these relationships, and organize these algorithms into a new system called Re-Connect to assist scientists in relationship discovery. Our evaluation shows that existing approaches that rely on flagging differences between two spreadsheets are impractical for many relationship-discovery tasks, and a user study shows that ReConnect can improve scientists' ability to detect useful relationships and subsequently identify the best dataset for a given task.
KW - Relationship identification
KW - Scientific data management
KW - Spreadsheets
UR - http://www.scopus.com/inward/record.url?scp=84904419767&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84904419767&partnerID=8YFLogxK
U2 - 10.1145/2618243.2618263
DO - 10.1145/2618243.2618263
M3 - Conference contribution
AN - SCOPUS:84904419767
SN - 9781450327220
T3 - ACM International Conference Proceeding Series
BT - SSDBM 2014 - Proceedings of the 26th International Conference on Scientific and Statistical Database Management
PB - Association for Computing Machinery
T2 - 26th International Conference on Scientific and Statistical Database Management, SSDBM 2014
Y2 - 30 June 2014 through 2 July 2014
ER -