TY - CHAP
T1 - Reliability issues in flash-memory-based solid-state drives
T2 - Experimental analysis, mitigation, recovery
AU - Cai, Yu
AU - Ghose, Saugata
AU - Haratsch, Erich F.
AU - Luo, Yixin
AU - Mutlu, Onur
N1 - Funding Information:
Acknowledgements The authors would like to thank Rino Micheloni for his helpful feedback on earlier drafts of the chapter. They would also like to thank Seagate for their continued dedicated support. Special thanks also goes to our research group SAFARI’s industrial sponsors over the past six years, especially Facebook, Google, Huawei, Intel, Samsung, Seagate, VMware. This work was also partially supported by ETH Zürich, the Intel Science and Technology Center for Cloud Computing, the Data Storage Systems Center at Carnegie Mellon University, and NSF grants 1212962 and 1320531. An earlier, shorter version of this book chapter appears on arxiv.org [15] and in the Proceedings of the IEEE [16].
Publisher Copyright:
© Springer Nature Singapore Pte Ltd 2018.
PY - 2018
Y1 - 2018
N2 - NAND flash memory is ubiquitous in everyday life today because its capacity has continuously increased and cost has continuously decreased over decades. This positive growth is a result of two key trends: (1)effective process technology scaling; and (2)multi-level (e.g., MLC, TLC) cell data coding. Unfortunately, the reliability of raw data stored in flash memory has also continued to become more difficult to ensure, because these two trends lead to (1)fewer electrons in the flash memory cell floating gate to represent the data; and (2)larger cell-to-cell interference and disturbance effects. Without mitigation, worsening reliability can reduce the lifetime of NAND flash memory. As a result, flash memory controllers in solid-state drives (SSDs) have become much more sophisticated: they incorporate many effective techniques to ensure the correct interpretation of noisy data stored in flash memory cells. In this chapter, we review recent advances in SSD error characterization, mitigation, and data recovery techniques for reliability and lifetime improvement. We provide rigorous experimental data from state-of-the-art MLC and TLC NAND flash devices on various types of flash memory errors, to motivate the need for such techniques. Based on the understanding developed by the experimental characterization, we describe several mitigation and recovery techniques, including (1)cell-to-cell interference mitigation; (2)optimal multi-level cell sensing; (3)error correction using state-of-the-art algorithms and methods; and (4)data recovery when error correction fails. We quantify the reliability improvement provided by each of these techniques. Looking forward, we briefly discuss how flash memory and these techniques could evolve into the future.
AB - NAND flash memory is ubiquitous in everyday life today because its capacity has continuously increased and cost has continuously decreased over decades. This positive growth is a result of two key trends: (1)effective process technology scaling; and (2)multi-level (e.g., MLC, TLC) cell data coding. Unfortunately, the reliability of raw data stored in flash memory has also continued to become more difficult to ensure, because these two trends lead to (1)fewer electrons in the flash memory cell floating gate to represent the data; and (2)larger cell-to-cell interference and disturbance effects. Without mitigation, worsening reliability can reduce the lifetime of NAND flash memory. As a result, flash memory controllers in solid-state drives (SSDs) have become much more sophisticated: they incorporate many effective techniques to ensure the correct interpretation of noisy data stored in flash memory cells. In this chapter, we review recent advances in SSD error characterization, mitigation, and data recovery techniques for reliability and lifetime improvement. We provide rigorous experimental data from state-of-the-art MLC and TLC NAND flash devices on various types of flash memory errors, to motivate the need for such techniques. Based on the understanding developed by the experimental characterization, we describe several mitigation and recovery techniques, including (1)cell-to-cell interference mitigation; (2)optimal multi-level cell sensing; (3)error correction using state-of-the-art algorithms and methods; and (4)data recovery when error correction fails. We quantify the reliability improvement provided by each of these techniques. Looking forward, we briefly discuss how flash memory and these techniques could evolve into the future.
UR - http://www.scopus.com/inward/record.url?scp=85049984434&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85049984434&partnerID=8YFLogxK
U2 - 10.1007/978-981-13-0599-3_9
DO - 10.1007/978-981-13-0599-3_9
M3 - Chapter
AN - SCOPUS:85049984434
T3 - Springer Series in Advanced Microelectronics
SP - 233
EP - 341
BT - Springer Series in Advanced Microelectronics
PB - Springer
ER -