Monitoring large systems via statistical sampling

Celso L. Mendes, Daniel A. Reed

Research output: Contribution to journalArticlepeer-review

Abstract

As the trend in parallel systems scales toward petaflop performance tapped by advances in circuit density and by an increasingly available computational Grid, the development of efficient mechanisms for monitoring large systems becomes imperative. When computational components are coupled via dynamically shifting connections with various remote resources, the number of potential factors affecting system behavior is enormous. Yet the overhead of monitoring can be prohibitive. In this paper we present a new technique for monitoring large systems based on statistical sampling. Rather than monitoring each component, we select a statistically valid sample and measure the behavior of sample members. We describe the formal requirements of sample selection and verify the feasibility of our approach with experiments on large parallel systems and wide-area networks. Our results show that this technique can be a powerful tool to enable effective monitoring without incurring the large costs typically associated to exhaustive checking.

Original languageEnglish (US)
Pages (from-to)267-277
Number of pages11
JournalInternational Journal of High Performance Computing Applications
Volume18
Issue number2 SPEC. ISS.
DOIs
StatePublished - Jun 2004

Keywords

  • Large systems
  • Performance monitoring
  • Statistical sampling

ASJC Scopus subject areas

  • Software
  • Theoretical Computer Science
  • Hardware and Architecture

Fingerprint

Dive into the research topics of 'Monitoring large systems via statistical sampling'. Together they form a unique fingerprint.

Cite this