TY - GEN
T1 - Analyzing reliability of memory sub-systems with double-chipkill detect/correct
AU - Jian, Xun
AU - Debardeleben, Nathan
AU - Blanchard, Sean
AU - Sridharan, Vilas
AU - Kumar, Rakesh
PY - 2013
Y1 - 2013
N2 - Chip kill correct is an advanced type of error correction used in memory sub-systems. Existing analytical approaches for modeling the reliability of memory sub-systems with chipkillcorrect are limited to those with chip kill-correct solutions that guarantee correction of errors in a single DRAM device. However, stronger chip kill correct solutions that are capable of guaranteeing the detection and even correction of errors in up to two DRAM devices have become common in existing HPC systems. Analytical reliability models are needed for such memory subsystems. This paper proposes analytical models for the reliability of double-chipkill detect and/or correct. Validation against Monte Carlo simulations shows that the output of our analytical models are within 3.9% of Monte Carlo simulations, on average. We used the analytical models to study various aspects of the reliability of memory sub-systems protected by double-chip kill detect and/or correct. Our studies provide several insights into the dependence of reliability of these systems on scale, device fault rate, memory organization, and memory-scrubbing policy.
AB - Chip kill correct is an advanced type of error correction used in memory sub-systems. Existing analytical approaches for modeling the reliability of memory sub-systems with chipkillcorrect are limited to those with chip kill-correct solutions that guarantee correction of errors in a single DRAM device. However, stronger chip kill correct solutions that are capable of guaranteeing the detection and even correction of errors in up to two DRAM devices have become common in existing HPC systems. Analytical reliability models are needed for such memory subsystems. This paper proposes analytical models for the reliability of double-chipkill detect and/or correct. Validation against Monte Carlo simulations shows that the output of our analytical models are within 3.9% of Monte Carlo simulations, on average. We used the analytical models to study various aspects of the reliability of memory sub-systems protected by double-chip kill detect and/or correct. Our studies provide several insights into the dependence of reliability of these systems on scale, device fault rate, memory organization, and memory-scrubbing policy.
KW - chipkill correct
KW - error correcting codes
KW - memory errors
KW - modeling
KW - reliability
UR - http://www.scopus.com/inward/record.url?scp=84906764050&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84906764050&partnerID=8YFLogxK
U2 - 10.1109/PRDC.2013.18
DO - 10.1109/PRDC.2013.18
M3 - Conference contribution
AN - SCOPUS:84906764050
SN - 9780769551302
T3 - Proceedings of IEEE Pacific Rim International Symposium on Dependable Computing, PRDC
SP - 88
EP - 97
BT - Proceedings - 2013 IEEE 19th Pacific Rim International Symposium on Dependable Computing, PRDC 2013
PB - IEEE Computer Society
T2 - 19th IEEE Pacific Rim International Symposium on Dependable Computing, PRDC 2013
Y2 - 2 December 2013 through 4 December 2013
ER -