Generating request streams on Big Data using clustered renewal processes

Cristina L. Abad, Mindi Yuan, Chris X. Cai, Yi Lu, Nathan Roberts, Roy H. Campbell

Research output: Contribution to journalArticle

Abstract

The performance evaluation of large file systems, such as storage and media streaming, motivates scalable generation of representative traces. We focus on two key characteristics of traces, popularity and temporal locality. The common practice of using a system-wide distribution obscures per-object behavior, which is important for system evaluation. We propose a model based on delayed renewal processes which, by sampling interarrival times for each object, accurately reproduces popularity and temporal locality for the trace. A lightweight version reduces the dimension of the model with statistical clustering. It is workload-agnostic and object type-aware, suitable for testing emerging workloads and 'what-if' scenarios. We implemented a synthetic trace generator and validated it using: (1) a Big Data storage (HDFS) workload from Yahoo!, (2) a trace from a feature animation company, and (3) a streaming media workload. Two case studies in caching and replicated distributed storage systems show that our traces produce application-level results similar to the real workload. The trace generator is fast and readily scales to a system of 4.3 million files. It outperforms existing models in terms of accurately reproducing the characteristics of the real trace.

Original languageEnglish (US)
Pages (from-to)704-719
Number of pages16
JournalPerformance Evaluation
Volume70
Issue number10
DOIs
StatePublished - Jan 1 2013

Fingerprint

Clustered Data
Renewal Process
Media streaming
Trace
Workload
Animation
Locality
Sampling
Testing
Generator
Media Streaming
Streaming Media
Big data
File System
Industry
Caching
Data Storage
Storage System
Performance Evaluation
Distributed Systems

Keywords

  • Big Data
  • HDFS
  • Popularity
  • Storage
  • Temporal locality
  • Workload generation

ASJC Scopus subject areas

  • Software
  • Modeling and Simulation
  • Hardware and Architecture
  • Computer Networks and Communications

Cite this

Generating request streams on Big Data using clustered renewal processes. / Abad, Cristina L.; Yuan, Mindi; Cai, Chris X.; Lu, Yi; Roberts, Nathan; Campbell, Roy H.

In: Performance Evaluation, Vol. 70, No. 10, 01.01.2013, p. 704-719.

Research output: Contribution to journalArticle

Abad, Cristina L. ; Yuan, Mindi ; Cai, Chris X. ; Lu, Yi ; Roberts, Nathan ; Campbell, Roy H. / Generating request streams on Big Data using clustered renewal processes. In: Performance Evaluation. 2013 ; Vol. 70, No. 10. pp. 704-719.
@article{79202ae71d304d489db5669324a08931,
title = "Generating request streams on Big Data using clustered renewal processes",
abstract = "The performance evaluation of large file systems, such as storage and media streaming, motivates scalable generation of representative traces. We focus on two key characteristics of traces, popularity and temporal locality. The common practice of using a system-wide distribution obscures per-object behavior, which is important for system evaluation. We propose a model based on delayed renewal processes which, by sampling interarrival times for each object, accurately reproduces popularity and temporal locality for the trace. A lightweight version reduces the dimension of the model with statistical clustering. It is workload-agnostic and object type-aware, suitable for testing emerging workloads and 'what-if' scenarios. We implemented a synthetic trace generator and validated it using: (1) a Big Data storage (HDFS) workload from Yahoo!, (2) a trace from a feature animation company, and (3) a streaming media workload. Two case studies in caching and replicated distributed storage systems show that our traces produce application-level results similar to the real workload. The trace generator is fast and readily scales to a system of 4.3 million files. It outperforms existing models in terms of accurately reproducing the characteristics of the real trace.",
keywords = "Big Data, HDFS, Popularity, Storage, Temporal locality, Workload generation",
author = "Abad, {Cristina L.} and Mindi Yuan and Cai, {Chris X.} and Yi Lu and Nathan Roberts and Campbell, {Roy H.}",
year = "2013",
month = "1",
day = "1",
doi = "10.1016/j.peva.2013.08.006",
language = "English (US)",
volume = "70",
pages = "704--719",
journal = "Performance Evaluation",
issn = "0166-5316",
publisher = "Elsevier",
number = "10",

}

TY - JOUR

T1 - Generating request streams on Big Data using clustered renewal processes

AU - Abad, Cristina L.

AU - Yuan, Mindi

AU - Cai, Chris X.

AU - Lu, Yi

AU - Roberts, Nathan

AU - Campbell, Roy H.

PY - 2013/1/1

Y1 - 2013/1/1

N2 - The performance evaluation of large file systems, such as storage and media streaming, motivates scalable generation of representative traces. We focus on two key characteristics of traces, popularity and temporal locality. The common practice of using a system-wide distribution obscures per-object behavior, which is important for system evaluation. We propose a model based on delayed renewal processes which, by sampling interarrival times for each object, accurately reproduces popularity and temporal locality for the trace. A lightweight version reduces the dimension of the model with statistical clustering. It is workload-agnostic and object type-aware, suitable for testing emerging workloads and 'what-if' scenarios. We implemented a synthetic trace generator and validated it using: (1) a Big Data storage (HDFS) workload from Yahoo!, (2) a trace from a feature animation company, and (3) a streaming media workload. Two case studies in caching and replicated distributed storage systems show that our traces produce application-level results similar to the real workload. The trace generator is fast and readily scales to a system of 4.3 million files. It outperforms existing models in terms of accurately reproducing the characteristics of the real trace.

AB - The performance evaluation of large file systems, such as storage and media streaming, motivates scalable generation of representative traces. We focus on two key characteristics of traces, popularity and temporal locality. The common practice of using a system-wide distribution obscures per-object behavior, which is important for system evaluation. We propose a model based on delayed renewal processes which, by sampling interarrival times for each object, accurately reproduces popularity and temporal locality for the trace. A lightweight version reduces the dimension of the model with statistical clustering. It is workload-agnostic and object type-aware, suitable for testing emerging workloads and 'what-if' scenarios. We implemented a synthetic trace generator and validated it using: (1) a Big Data storage (HDFS) workload from Yahoo!, (2) a trace from a feature animation company, and (3) a streaming media workload. Two case studies in caching and replicated distributed storage systems show that our traces produce application-level results similar to the real workload. The trace generator is fast and readily scales to a system of 4.3 million files. It outperforms existing models in terms of accurately reproducing the characteristics of the real trace.

KW - Big Data

KW - HDFS

KW - Popularity

KW - Storage

KW - Temporal locality

KW - Workload generation

UR - http://www.scopus.com/inward/record.url?scp=84884700352&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84884700352&partnerID=8YFLogxK

U2 - 10.1016/j.peva.2013.08.006

DO - 10.1016/j.peva.2013.08.006

M3 - Article

AN - SCOPUS:84884700352

VL - 70

SP - 704

EP - 719

JO - Performance Evaluation

JF - Performance Evaluation

SN - 0166-5316

IS - 10

ER -