Abstract
Servers and HPC systems often use a strong memory error correction code, or ECC, to meet their reliability and availability requirements. However, these ECCs often require significant capacity and/or power overheads. We observe that since memory channels are independent from one another, error correction typically needs to be performed for one channel at a time. Based on this observation, we show that instead of always storing in memory the actual ECC correction bits as do existing systems, it is sufficient to store the bitwise parity of the ECC correction bits of different channels for fault-free memory regions, and store the actual ECC correction bits only for faulty memory regions. By trading off the resultant ECC capacity overhead reduction for improved memory energy efficiency, the proposed technique reduces memory energy per instruction by 54.4% and 20.6%, respectively, compared to a commercial chip kill correct ECC and a DIMM-kill correct ECC, while incurring similar or lower capacity overheads.
| Original language | English (US) |
|---|---|
| Article number | 7013071 |
| Pages (from-to) | 1035-1046 |
| Number of pages | 12 |
| Journal | International Conference for High Performance Computing, Networking, Storage and Analysis, SC |
| Volume | 2015-January |
| Issue number | January |
| DOIs | |
| State | Published - Jan 16 2014 |
| Event | International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2014 - New Orleans, United States Duration: Nov 16 2014 → Nov 21 2014 |
ASJC Scopus subject areas
- Computer Networks and Communications
- Computer Science Applications
- Hardware and Architecture
- Software