Clowder: Open source data management for long tail data

Luigi Marini, Sandeep Puthanveetil Satheesan, Todd Nicholson, Indira Gutierrez-Polo, Maxwell Burnette, Yan Zhao, Rob Kooper, Jong Sung Lee, Kenton Guadron McHenry

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Clowder is an open source data management system to support data curation of long tail data and metadata across multiple research domains and diverse data types. Institutions and labs can install and customize their own instance of the framework on local hardware or on remote cloud computing resources to provide a shared service to distributed communities of researchers. Data can be ingested directly from instruments or manually uploaded by users and then shared with remote collaborators using a web front end. We discuss some of the challenges encountered in designing and developing a system that can be easily adapted to different scientific areas including digital preservation, geoscience, material science, medicine, social science, cultural heritage and the arts. Some of these challenges include support for large amounts of data, horizontal scaling of domain specific preprocessing algorithms, ability to provide new data visualizations in the web browser, a comprehensive Web service API for automatic data ingestion and curation, a suite of social annotation and metadata management features to support data annotation by communities of users and algorithms, and a web based front-end to interact with code running on heterogeneous clusters, including HPC resources.

Original languageEnglish (US)
Title of host publicationPractice and Experience in Advanced Research Computing 2018
Subtitle of host publicationSeamless Creativity, PEARC 2018
PublisherAssociation for Computing Machinery
ISBN (Print)9781450364461
DOIs
StatePublished - Jul 22 2018
Event2018 Practice and Experience in Advanced Research Computing Conference: Seamless Creativity, PEARC 2018 - Pittsburgh, United States
Duration: Jul 22 2017Jul 26 2017

Publication series

NameACM International Conference Proceeding Series

Other

Other2018 Practice and Experience in Advanced Research Computing Conference: Seamless Creativity, PEARC 2018
CountryUnited States
CityPittsburgh
Period7/22/177/26/17

Fingerprint

Metadata
Information management
Data visualization
Web browsers
Social sciences
Materials science
Cloud computing
Application programming interfaces (API)
Web services
Medicine
Hardware

Keywords

  • Data curation
  • Data management
  • Linked data
  • Metadata management
  • Scientific gateways

ASJC Scopus subject areas

  • Software
  • Human-Computer Interaction
  • Computer Vision and Pattern Recognition
  • Computer Networks and Communications

Cite this

Marini, L., Satheesan, S. P., Nicholson, T., Gutierrez-Polo, I., Burnette, M., Zhao, Y., ... McHenry, K. G. (2018). Clowder: Open source data management for long tail data. In Practice and Experience in Advanced Research Computing 2018: Seamless Creativity, PEARC 2018 [a40] (ACM International Conference Proceeding Series). Association for Computing Machinery. https://doi.org/10.1145/3219104.3219159

Clowder : Open source data management for long tail data. / Marini, Luigi; Satheesan, Sandeep Puthanveetil; Nicholson, Todd; Gutierrez-Polo, Indira; Burnette, Maxwell; Zhao, Yan; Kooper, Rob; Lee, Jong Sung; McHenry, Kenton Guadron.

Practice and Experience in Advanced Research Computing 2018: Seamless Creativity, PEARC 2018. Association for Computing Machinery, 2018. a40 (ACM International Conference Proceeding Series).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Marini, L, Satheesan, SP, Nicholson, T, Gutierrez-Polo, I, Burnette, M, Zhao, Y, Kooper, R, Lee, JS & McHenry, KG 2018, Clowder: Open source data management for long tail data. in Practice and Experience in Advanced Research Computing 2018: Seamless Creativity, PEARC 2018., a40, ACM International Conference Proceeding Series, Association for Computing Machinery, 2018 Practice and Experience in Advanced Research Computing Conference: Seamless Creativity, PEARC 2018, Pittsburgh, United States, 7/22/17. https://doi.org/10.1145/3219104.3219159
Marini L, Satheesan SP, Nicholson T, Gutierrez-Polo I, Burnette M, Zhao Y et al. Clowder: Open source data management for long tail data. In Practice and Experience in Advanced Research Computing 2018: Seamless Creativity, PEARC 2018. Association for Computing Machinery. 2018. a40. (ACM International Conference Proceeding Series). https://doi.org/10.1145/3219104.3219159
Marini, Luigi ; Satheesan, Sandeep Puthanveetil ; Nicholson, Todd ; Gutierrez-Polo, Indira ; Burnette, Maxwell ; Zhao, Yan ; Kooper, Rob ; Lee, Jong Sung ; McHenry, Kenton Guadron. / Clowder : Open source data management for long tail data. Practice and Experience in Advanced Research Computing 2018: Seamless Creativity, PEARC 2018. Association for Computing Machinery, 2018. (ACM International Conference Proceeding Series).
@inproceedings{eb8e4f03099a49fda7c7e11839def015,
title = "Clowder: Open source data management for long tail data",
abstract = "Clowder is an open source data management system to support data curation of long tail data and metadata across multiple research domains and diverse data types. Institutions and labs can install and customize their own instance of the framework on local hardware or on remote cloud computing resources to provide a shared service to distributed communities of researchers. Data can be ingested directly from instruments or manually uploaded by users and then shared with remote collaborators using a web front end. We discuss some of the challenges encountered in designing and developing a system that can be easily adapted to different scientific areas including digital preservation, geoscience, material science, medicine, social science, cultural heritage and the arts. Some of these challenges include support for large amounts of data, horizontal scaling of domain specific preprocessing algorithms, ability to provide new data visualizations in the web browser, a comprehensive Web service API for automatic data ingestion and curation, a suite of social annotation and metadata management features to support data annotation by communities of users and algorithms, and a web based front-end to interact with code running on heterogeneous clusters, including HPC resources.",
keywords = "Data curation, Data management, Linked data, Metadata management, Scientific gateways",
author = "Luigi Marini and Satheesan, {Sandeep Puthanveetil} and Todd Nicholson and Indira Gutierrez-Polo and Maxwell Burnette and Yan Zhao and Rob Kooper and Lee, {Jong Sung} and McHenry, {Kenton Guadron}",
year = "2018",
month = "7",
day = "22",
doi = "10.1145/3219104.3219159",
language = "English (US)",
isbn = "9781450364461",
series = "ACM International Conference Proceeding Series",
publisher = "Association for Computing Machinery",
booktitle = "Practice and Experience in Advanced Research Computing 2018",

}

TY - GEN

T1 - Clowder

T2 - Open source data management for long tail data

AU - Marini, Luigi

AU - Satheesan, Sandeep Puthanveetil

AU - Nicholson, Todd

AU - Gutierrez-Polo, Indira

AU - Burnette, Maxwell

AU - Zhao, Yan

AU - Kooper, Rob

AU - Lee, Jong Sung

AU - McHenry, Kenton Guadron

PY - 2018/7/22

Y1 - 2018/7/22

N2 - Clowder is an open source data management system to support data curation of long tail data and metadata across multiple research domains and diverse data types. Institutions and labs can install and customize their own instance of the framework on local hardware or on remote cloud computing resources to provide a shared service to distributed communities of researchers. Data can be ingested directly from instruments or manually uploaded by users and then shared with remote collaborators using a web front end. We discuss some of the challenges encountered in designing and developing a system that can be easily adapted to different scientific areas including digital preservation, geoscience, material science, medicine, social science, cultural heritage and the arts. Some of these challenges include support for large amounts of data, horizontal scaling of domain specific preprocessing algorithms, ability to provide new data visualizations in the web browser, a comprehensive Web service API for automatic data ingestion and curation, a suite of social annotation and metadata management features to support data annotation by communities of users and algorithms, and a web based front-end to interact with code running on heterogeneous clusters, including HPC resources.

AB - Clowder is an open source data management system to support data curation of long tail data and metadata across multiple research domains and diverse data types. Institutions and labs can install and customize their own instance of the framework on local hardware or on remote cloud computing resources to provide a shared service to distributed communities of researchers. Data can be ingested directly from instruments or manually uploaded by users and then shared with remote collaborators using a web front end. We discuss some of the challenges encountered in designing and developing a system that can be easily adapted to different scientific areas including digital preservation, geoscience, material science, medicine, social science, cultural heritage and the arts. Some of these challenges include support for large amounts of data, horizontal scaling of domain specific preprocessing algorithms, ability to provide new data visualizations in the web browser, a comprehensive Web service API for automatic data ingestion and curation, a suite of social annotation and metadata management features to support data annotation by communities of users and algorithms, and a web based front-end to interact with code running on heterogeneous clusters, including HPC resources.

KW - Data curation

KW - Data management

KW - Linked data

KW - Metadata management

KW - Scientific gateways

UR - http://www.scopus.com/inward/record.url?scp=85051432133&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85051432133&partnerID=8YFLogxK

U2 - 10.1145/3219104.3219159

DO - 10.1145/3219104.3219159

M3 - Conference contribution

AN - SCOPUS:85051432133

SN - 9781450364461

T3 - ACM International Conference Proceeding Series

BT - Practice and Experience in Advanced Research Computing 2018

PB - Association for Computing Machinery

ER -