TY - JOUR
T1 - Error Characterization, Mitigation, and Recovery in Flash-Memory-Based Solid-State Drives
AU - Cai, Yu
AU - Ghose, Saugata
AU - Haratsch, Erich F.
AU - Luo, Yixin
AU - Mutlu, Onur
N1 - Funding Information:
Manuscript received December 19, 2016; revised March 21, 2017; accepted April 20, 2017. Date of current version August 18, 2017. This work is partially supported by the CMU Data Storage Systems Center, the Intel Science and Technology Center, the NSF, and generous donations from various industrial partners, especially Intel and Seagate. (Corresponding author: Onur Mutlu.) Y. Cai, S. Ghose, and Y. Luo are with Carnegie Mellon University, Pittsburgh, PA 15213 USA. E. F. Haratsch is with Seagate Technology, Fremont, CA 94538 USA. O. Mutlu is with ETH Zurich, 8092 Zurich, Switzerland (e-mail: omutlu@gmail.com).
Publisher Copyright:
© 1963-2012 IEEE.
Copyright:
Copyright 2020 Elsevier B.V., All rights reserved.
PY - 2017/9
Y1 - 2017/9
N2 - NAND flash memory is ubiquitous in everyday life today because its capacity has continuously increased and cost has continuously decreased over decades. This positive growth is a result of two key trends: 1) effective process technology scaling; and 2) multi-level (e.g., MLC, TLC) cell data coding. Unfortunately, the reliability of raw data stored in flash memory has also continued to become more difficult to ensure, because these two trends lead to 1) fewer electrons in the flash memory cell floating gate to represent the data; and 2) larger cell-to-cell interference and disturbance effects. Without mitigation, worsening reliability can reduce the lifetime of NAND flash memory. As a result, flash memory controllers in solid-state drives (SSDs) have become much more sophisticated: they incorporate many effective techniques to ensure the correct interpretation of noisy data stored in flash memory cells. In this article, we review recent advances in SSD error characterization, mitigation, and data recovery techniques for reliability and lifetime improvement. We provide rigorous experimental data from state-of-the-art MLC and TLC NAND flash devices on various types of flash memory errors, to motivate the need for such techniques. Based on the understanding developed by the experimental characterization, we describe several mitigation and recovery techniques, including 1) cell-to-cell interference mitigation; 2) optimal multi-level cell sensing; 3) error correction using state-of-the-art algorithms and methods; and 4) data recovery when error correction fails. We quantify the reliability improvement provided by each of these techniques. Looking forward, we briefly discuss how flash memory and these techniques could evolve into the future.
AB - NAND flash memory is ubiquitous in everyday life today because its capacity has continuously increased and cost has continuously decreased over decades. This positive growth is a result of two key trends: 1) effective process technology scaling; and 2) multi-level (e.g., MLC, TLC) cell data coding. Unfortunately, the reliability of raw data stored in flash memory has also continued to become more difficult to ensure, because these two trends lead to 1) fewer electrons in the flash memory cell floating gate to represent the data; and 2) larger cell-to-cell interference and disturbance effects. Without mitigation, worsening reliability can reduce the lifetime of NAND flash memory. As a result, flash memory controllers in solid-state drives (SSDs) have become much more sophisticated: they incorporate many effective techniques to ensure the correct interpretation of noisy data stored in flash memory cells. In this article, we review recent advances in SSD error characterization, mitigation, and data recovery techniques for reliability and lifetime improvement. We provide rigorous experimental data from state-of-the-art MLC and TLC NAND flash devices on various types of flash memory errors, to motivate the need for such techniques. Based on the understanding developed by the experimental characterization, we describe several mitigation and recovery techniques, including 1) cell-to-cell interference mitigation; 2) optimal multi-level cell sensing; 3) error correction using state-of-the-art algorithms and methods; and 4) data recovery when error correction fails. We quantify the reliability improvement provided by each of these techniques. Looking forward, we briefly discuss how flash memory and these techniques could evolve into the future.
KW - Data storage systems
KW - error recovery
KW - fault tolerance
KW - flash memory
KW - reliability
KW - solid-state drives
UR - http://www.scopus.com/inward/record.url?scp=85029587112&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85029587112&partnerID=8YFLogxK
U2 - 10.1109/JPROC.2017.2713127
DO - 10.1109/JPROC.2017.2713127
M3 - Article
AN - SCOPUS:85029587112
VL - 105
SP - 1666
EP - 1704
JO - Proceedings of the IEEE
JF - Proceedings of the IEEE
SN - 0018-9219
IS - 9
M1 - 8013174
ER -