Towards scalable reliability frameworks for error prone CMPs

Joseph Sloan, Rakesh Kumar

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

As technology scales and the energy of computation continually approaches thermal equilibrium [1,2], parameter variations and noise levels will lead to larger error rates at various levels of the computation stack. The error rates would be especially high for post-CMOS and nanoelectronic systems as well as for probabilistic [3] and stochastic architectures [4]. N-modular redundancy (NMR) at the core-level has been proposed as a way to attain system reliability goals for multicore architectures. While core-level DMR and TMR have been shown to be effective when errors are rare, a large amount of core-level redundancy will be required for attaining system reliability goals in face of high error rates. This makes voting latency and bandwidth significant performance bottlenecks for such systems. In this paper, we present a scalable NMR framework for error prone chip multiprocessors(CMPs). The framework supports in-network fault tolerance where voting logic is integrated into routers to allow for truly distributed voting. The in-network fault tolerance router utilizes the expected redundancy in vote messages, to reduce some of the blocking overhead incurred at the leader, and also provide a mechanism to trade-off network bandwidth with latency. Our framework also supports proactive checkpoint deallocation which allows cores participating in voting to continue on with execution instead of waiting on notification from the voting logic. Finally, the framework supports dynamic constitution that allows an arbitrary core on this chip to be a part of an NMR group. This allows bypassing faulty cores as well as scheduling for performance. Our experiments show significant performance/bandwidth benefits from these optimizations.

Original languageEnglish (US)
Title of host publicationEmbedded Systems Week 2009 - 2009 International Conference on Compilers, Architecture, and Synthesis for Embedded Systems, CASES'09
Pages261-270
Number of pages10
DOIs
StatePublished - Dec 21 2009
EventEmbedded Systems Week 2009, ESWEEK 2009 - 2009 International Conference on Compilers, Architecture, and Synthesis for Embedded Systems, CASES'09 - Grenoble, France
Duration: Oct 11 2009Oct 16 2009

Publication series

NameEmbedded Systems Week 2009 - 2009 International Conference on Compilers, Architecture, and Synthesis for Embedded Systems, CASES'09

Other

OtherEmbedded Systems Week 2009, ESWEEK 2009 - 2009 International Conference on Compilers, Architecture, and Synthesis for Embedded Systems, CASES'09
CountryFrance
CityGrenoble
Period10/11/0910/16/09

    Fingerprint

Keywords

  • Dynamic constitution
  • In-network fault tolerance

ASJC Scopus subject areas

  • Hardware and Architecture
  • Software
  • Electrical and Electronic Engineering

Cite this

Sloan, J., & Kumar, R. (2009). Towards scalable reliability frameworks for error prone CMPs. In Embedded Systems Week 2009 - 2009 International Conference on Compilers, Architecture, and Synthesis for Embedded Systems, CASES'09 (pp. 261-270). (Embedded Systems Week 2009 - 2009 International Conference on Compilers, Architecture, and Synthesis for Embedded Systems, CASES'09). https://doi.org/10.1145/1629395.1629432