Making massive computational experiments painless

Hatef Monajemi, David L. Donoho, Victoria Stodden

Research output: Chapter in Book/Report/Conference proceedingConference contribution


The increasing availability of access to large-scale computing clusters -for example via the cloud -is changing the way scientific research can be conducted, enabling experiments of a scale and scope that would have been inconceivable several years ago. An ambitious data scientist today can carry out projects involving several million CPU hours. In the near future, we anticipate a typical Ph.D. in computational science may be expected or even required to offer findings based on at least 1 million CPU hours of computations. The massive scale of these soon-to-be-upon-us computational experiments demands that we change how we organize our experimental practices. Traditionally, and still the dominant paradigm today, the end-to-end process of experiment design and execution involves a significant amount of manual intervention and situational tweaking, cutting and pasting, and the use of disparate disconnected tools, much of which is undocumented and easily lost. This makes it difficult to detect and understand possible failure points in the computational workflow, making it virtually impossible to correct, let alone simply rerun the experiment. This is an amazing state of affairs, considering the ubiquity of error in scientific computation and in research generally. Following such unstructured and undocumented research practices limits the ability of the researcher to exploit cluster and cloud-based paradigms, as each increase in scale under the dominant paradigm is likely to lead to ever more errors and misunderstandings. A better paradigm will integrate the design of large experiments seamlessly with job management, output harvesting, data analysis, reporting, and publication of code and data. In particular such a paradigm would submerge the details of all the processing, harvesting, and management while exposing transparently the description of the discovery process itself, including details such as the parameter space exploration. Reproducing any job would be a push-button affair, and creating a new experiment from a previous one might involve only changes of a line or two of code followed again by push-button execution and reporting. Even though such experiments would be operating at a much greater scale than today, under such a paradigm they would be easier to conduct, obtain a lower error rate, and offer a much greater opportunity for 'outsiders' to understand the results. In this article, we discuss the challenges of massive computational experimentation and present a taxonomy of some of the desiderata which such paradigms should offer. We then present ClusterJob (CJ), an efficient computing environment that we and other researchers have used to conduct and share million-CPU-hour experiments in a painless and reproducible way.

Original languageEnglish (US)
Title of host publicationProceedings - 2016 IEEE International Conference on Big Data, Big Data 2016
EditorsRonay Ak, George Karypis, Yinglong Xia, Xiaohua Tony Hu, Philip S. Yu, James Joshi, Lyle Ungar, Ling Liu, Aki-Hiro Sato, Toyotaro Suzumura, Sudarsan Rachuri, Rama Govindaraju, Weijia Xu
PublisherInstitute of Electrical and Electronics Engineers Inc.
Number of pages6
ISBN (Electronic)9781467390040
StatePublished - Jan 1 2016
Event4th IEEE International Conference on Big Data, Big Data 2016 - Washington, United States
Duration: Dec 5 2016Dec 8 2016

Publication series

NameProceedings - 2016 IEEE International Conference on Big Data, Big Data 2016


Other4th IEEE International Conference on Big Data, Big Data 2016
Country/TerritoryUnited States


  • Big Data
  • High-throughput Computing
  • Reproducible Research

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Information Systems
  • Hardware and Architecture


Dive into the research topics of 'Making massive computational experiments painless'. Together they form a unique fingerprint.

Cite this