Architectures for online error detection and recovery in multicore processors

Dimitris Gizopoulos, Mihalis Psarakis, Sarita V Adve, Pradeep Ramachandran, Siva Kumar Sastry Hari, Daniel Sorin, Albert Meixner, Arijit Biswas, Xavier Vera

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

The huge investment in the design and production of multicore processors may be put at risk because the emerging highly miniaturized but unreliable fabrication technologies will impose significant barriers to the life-long reliable operation of future chips. Extremely complex, massively parallel, multi-core processor chips fabricated in these technologies will become more vulnerable to: (a) environmental disturbances that produce transient (or soft) errors, (b) latent manufacturing defects as well as aging/wearout phenomena that produce permanent (or hard) errors, and (c) verification inefficiencies that allow important design bugs to escape in the system. In an effort to cope with these reliability threats, several research teams have recently proposed multicore processor architectures that provide low-cost dependability guarantees against hardware errors and design bugs. This paper focuses on dependable multicore processor architectures that integrate solutions for online error detection, diagnosis, recovery, and repair during field operation. It discusses taxonomy of representative approaches and presents a qualitative comparison based on: hardware cost, performance overhead, types of faults detected, and detection latency. It also describes in more detail three recently proposed effective architectural approaches: a software-anomaly detection technique (SWAT), a dynamic verification technique (Argus), and a core salvaging methodology.

Original languageEnglish (US)
Title of host publicationProceedings - Design, Automation and Test in Europe Conference and Exhibition, DATE 2011
Pages533-538
Number of pages6
StatePublished - May 31 2011
Event14th Design, Automation and Test in Europe Conference and Exhibition, DATE 2011 - Grenoble, France
Duration: Mar 14 2011Mar 18 2011

Publication series

NameProceedings -Design, Automation and Test in Europe, DATE
ISSN (Print)1530-1591

Other

Other14th Design, Automation and Test in Europe Conference and Exhibition, DATE 2011
Country/TerritoryFrance
CityGrenoble
Period3/14/113/18/11

Keywords

  • dependable architectures
  • multicore microprocessors
  • online error detection/recovery/repair

ASJC Scopus subject areas

  • Engineering(all)

Fingerprint

Dive into the research topics of 'Architectures for online error detection and recovery in multicore processors'. Together they form a unique fingerprint.

Cite this