TY - GEN
T1 - Towards automated prediction of relationships among scientific datasets
AU - Alawini, Abdussalam
AU - Maier, David
AU - Tufte, Kristin
AU - Howe, Bill
AU - Nandikur, Rashmi
N1 - Publisher Copyright:
© 2015 ACM.
PY - 2015/6/29
Y1 - 2015/6/29
N2 - Before scientists can analyze, publish, or share their data, they often need to determine how their datasets are re-lated. Determining relationships helps scientists identify the most complete version of a dataset, detect versions of datasets that complement each other, and determine multi-ple datasets that overlap. In previous work, we showed how observable relationships between two datasets help scientists recall their original derivation connection. While that work helped with identifying relationships between two datasets, it is infeasible for scientists to use it for finding relationships between all possible pairs in a large collection of datasets. In order to deal with larger numbers of datasets, we are ex-Tending our methodology with a relationship-prediction sys-Tem, ReDiscover, a tool to identify pairs from a collection of datasets that are most likely related and the relationship between them. We report on the initial design of ReDis-cover, which uses machine-learning methods such as Condi-Tional Random Fields and Support Vector Machines to the relationship-discovery problem. Our preliminarily evalua-Tion shows that ReDiscover predicted relationships with an average accuracy of 87%.
AB - Before scientists can analyze, publish, or share their data, they often need to determine how their datasets are re-lated. Determining relationships helps scientists identify the most complete version of a dataset, detect versions of datasets that complement each other, and determine multi-ple datasets that overlap. In previous work, we showed how observable relationships between two datasets help scientists recall their original derivation connection. While that work helped with identifying relationships between two datasets, it is infeasible for scientists to use it for finding relationships between all possible pairs in a large collection of datasets. In order to deal with larger numbers of datasets, we are ex-Tending our methodology with a relationship-prediction sys-Tem, ReDiscover, a tool to identify pairs from a collection of datasets that are most likely related and the relationship between them. We report on the initial design of ReDis-cover, which uses machine-learning methods such as Condi-Tional Random Fields and Support Vector Machines to the relationship-discovery problem. Our preliminarily evalua-Tion shows that ReDiscover predicted relationships with an average accuracy of 87%.
UR - http://www.scopus.com/inward/record.url?scp=84959460159&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84959460159&partnerID=8YFLogxK
U2 - 10.1145/2791347.2791358
DO - 10.1145/2791347.2791358
M3 - Conference contribution
AN - SCOPUS:84959460159
T3 - ACM International Conference Proceeding Series
BT - SSDBM 2015 - Proceedings of the 27th International Conference on Scientific and Statistical Database Management
A2 - Gupta, Amarnath
A2 - Rathbun, Susan
PB - Association for Computing Machinery
T2 - 27th International Conference on Scientific and Statistical Database Management, SSDBM 2015
Y2 - 29 June 2015 through 1 July 2015
ER -