TY - CHAP
T1 - Collaborative data analytics with datahub
AU - Bhardwaj, Anant
AU - Deshpande, Amol
AU - Elmore, Aaron J.
AU - Karger, David
AU - Madden, Sam
AU - Parameswaran, Aditya
AU - Subramanyam, Harihar
AU - Wu, Eugene
AU - Zhang, Rebecca
PY - 2015
Y1 - 2015
N2 - While there have been many solutions proposed for storing and analyzing large volumes of data, all of these solutions have limited support for collaborative data analytics, especially given the many individuals and teams are simultaneously analyzing, modifying and exchanging datasets, employing a number of heterogeneous tools or languages for data analysis, and writing scripts to clean, preprocess, or query data. We demonstrate DataHub, a unified platform with the ability to load, store, query, collaboratively analyze, interactively visualize, interface with external applications, and share datasets. We will demonstrate the following aspects of the DataHub platform: (a) flexible data storage, sharing, and native versioning capabilities: multiple conference attendees can concurrently update the database and browse the different versions and inspect conflicts; (b) an app ecosystem that hosts apps for various dataprocessing activities: conference attendees will be able to effortlessly ingest, query, and visualize data using our existing apps; (c) thrift-based data serialization permits data analysis in any combination of 20+ languages, with DataHub as the common data store: conference attendees will be able to analyze datasets in R, Python, and Matlab, while the inputs and the results are still stored in DataHub. In particular, conference attendees will be able to use the DataHub notebook-an IPython-based notebook for analyzing data and storing the results of data analysis.
AB - While there have been many solutions proposed for storing and analyzing large volumes of data, all of these solutions have limited support for collaborative data analytics, especially given the many individuals and teams are simultaneously analyzing, modifying and exchanging datasets, employing a number of heterogeneous tools or languages for data analysis, and writing scripts to clean, preprocess, or query data. We demonstrate DataHub, a unified platform with the ability to load, store, query, collaboratively analyze, interactively visualize, interface with external applications, and share datasets. We will demonstrate the following aspects of the DataHub platform: (a) flexible data storage, sharing, and native versioning capabilities: multiple conference attendees can concurrently update the database and browse the different versions and inspect conflicts; (b) an app ecosystem that hosts apps for various dataprocessing activities: conference attendees will be able to effortlessly ingest, query, and visualize data using our existing apps; (c) thrift-based data serialization permits data analysis in any combination of 20+ languages, with DataHub as the common data store: conference attendees will be able to analyze datasets in R, Python, and Matlab, while the inputs and the results are still stored in DataHub. In particular, conference attendees will be able to use the DataHub notebook-an IPython-based notebook for analyzing data and storing the results of data analysis.
UR - http://www.scopus.com/inward/record.url?scp=84953870023&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84953870023&partnerID=8YFLogxK
U2 - 10.14778/2824032.2824100
DO - 10.14778/2824032.2824100
M3 - Chapter
C2 - 26844007
AN - SCOPUS:84953870023
T3 - Proceedings of the VLDB Endowment
SP - 1916
EP - 1919
BT - Proceedings of the VLDB Endowment
PB - Association for Computing Machinery
T2 - 3rd Workshop on Spatio-Temporal Database Management, STDBM 2006, Co-located with the 32nd International Conference on Very Large Data Bases, VLDB 2006
Y2 - 11 September 2006 through 11 September 2006
ER -