ECC Parity: A Technique for Efficient Memory Error Resilience for Multi-Channel Memory Systems

Xun Jian, Rakesh Kumar

Research output: Contribution to journalConference article

Abstract

Servers and HPC systems often use a strong memory error correction code, or ECC, to meet their reliability and availability requirements. However, these ECCs often require significant capacity and/or power overheads. We observe that since memory channels are independent from one another, error correction typically needs to be performed for one channel at a time. Based on this observation, we show that instead of always storing in memory the actual ECC correction bits as do existing systems, it is sufficient to store the bitwise parity of the ECC correction bits of different channels for fault-free memory regions, and store the actual ECC correction bits only for faulty memory regions. By trading off the resultant ECC capacity overhead reduction for improved memory energy efficiency, the proposed technique reduces memory energy per instruction by 54.4% and 20.6%, respectively, compared to a commercial chip kill correct ECC and a DIMM-kill correct ECC, while incurring similar or lower capacity overheads.

Original languageEnglish (US)
Article number7013071
Pages (from-to)1035-1046
Number of pages12
JournalInternational Conference for High Performance Computing, Networking, Storage and Analysis, SC
Volume2015-January
Issue numberJanuary
DOIs
StatePublished - Jan 16 2014
EventInternational Conference for High Performance Computing, Networking, Storage and Analysis, SC 2014 - New Orleans, United States
Duration: Nov 16 2014Nov 21 2014

    Fingerprint

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Computer Science Applications
  • Hardware and Architecture
  • Software

Cite this