TY - JOUR
T1 - Optimization for Deep Learning
T2 - An Overview
AU - Sun, Ruo Yu
N1 - Publisher Copyright:
© 2020, Operations Research Society of China, Periodicals Agency of Shanghai University, Science Press, and Springer-Verlag GmbH Germany, part of Springer Nature.
PY - 2020/6/1
Y1 - 2020/6/1
N2 - Optimization is a critical component in deep learning. We think optimization for neural networks is an interesting topic for theoretical research due to various reasons. First, its tractability despite non-convexity is an intriguing question and may greatly expand our understanding of tractable problems. Second, classical optimization theory is far from enough to explain many phenomena. Therefore, we would like to understand the challenges and opportunities from a theoretical perspective and review the existing research in this field. First, we discuss the issue of gradient explosion/vanishing and the more general issue of undesirable spectrum and then discuss practical solutions including careful initialization, normalization methods and skip connections. Second, we review generic optimization methods used in training neural networks, such as stochastic gradient descent and adaptive gradient methods, and existing theoretical results. Third, we review existing research on the global issues of neural network training, including results on global landscape, mode connectivity, lottery ticket hypothesis and neural tangent kernel.
AB - Optimization is a critical component in deep learning. We think optimization for neural networks is an interesting topic for theoretical research due to various reasons. First, its tractability despite non-convexity is an intriguing question and may greatly expand our understanding of tractable problems. Second, classical optimization theory is far from enough to explain many phenomena. Therefore, we would like to understand the challenges and opportunities from a theoretical perspective and review the existing research in this field. First, we discuss the issue of gradient explosion/vanishing and the more general issue of undesirable spectrum and then discuss practical solutions including careful initialization, normalization methods and skip connections. Second, we review generic optimization methods used in training neural networks, such as stochastic gradient descent and adaptive gradient methods, and existing theoretical results. Third, we review existing research on the global issues of neural network training, including results on global landscape, mode connectivity, lottery ticket hypothesis and neural tangent kernel.
KW - Convergence
KW - Deep learning
KW - Landscape
KW - Neural networks
KW - Non-convex optimization
UR - http://www.scopus.com/inward/record.url?scp=85086476832&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85086476832&partnerID=8YFLogxK
U2 - 10.1007/s40305-020-00309-6
DO - 10.1007/s40305-020-00309-6
M3 - Article
AN - SCOPUS:85086476832
SN - 2194-668X
VL - 8
SP - 249
EP - 294
JO - Journal of the Operations Research Society of China
JF - Journal of the Operations Research Society of China
IS - 2
ER -