TY - GEN
T1 - Self-Regularity of Output Weights for Overparameterized Two-Layer Neural Networks
AU - Gamarnik, David
AU - Kizildag, Eren C.
AU - Zadik, Ilias
N1 - Publisher Copyright:
© 2021 IEEE.
PY - 2021/7/12
Y1 - 2021/7/12
N2 - We consider the problem of finding a two-layer neural network with sigmoid, rectified linear unit, or binary step activation functions that 'fits' a training data set as accurately as possible as quantified by the training error; and study the following question: does a low training error guarantee that the norm of the output layer (outer norm) itself is small? We address this question for the case of non-negative output weights. Using a simple covering number argument, we establish that under quite mild distributional assumptions on the input/label pairs; any such network achieving a small training error on polynomially many data necessarily has a well-controlled outer norm. Notably, our results (a) have a good sample complexity, (b) are independent of the number of hidden units, (c) are oblivious to the training algorithm; and (d) require quite mild assumptions on the data (in particular the input vector X Rd need not have independent coordinates). We then show how our bounds can be leveraged to yield generalization guarantees for such networks.
AB - We consider the problem of finding a two-layer neural network with sigmoid, rectified linear unit, or binary step activation functions that 'fits' a training data set as accurately as possible as quantified by the training error; and study the following question: does a low training error guarantee that the norm of the output layer (outer norm) itself is small? We address this question for the case of non-negative output weights. Using a simple covering number argument, we establish that under quite mild distributional assumptions on the input/label pairs; any such network achieving a small training error on polynomially many data necessarily has a well-controlled outer norm. Notably, our results (a) have a good sample complexity, (b) are independent of the number of hidden units, (c) are oblivious to the training algorithm; and (d) require quite mild assumptions on the data (in particular the input vector X Rd need not have independent coordinates). We then show how our bounds can be leveraged to yield generalization guarantees for such networks.
UR - http://www.scopus.com/inward/record.url?scp=85115053018&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85115053018&partnerID=8YFLogxK
U2 - 10.1109/ISIT45174.2021.9517811
DO - 10.1109/ISIT45174.2021.9517811
M3 - Conference contribution
AN - SCOPUS:85115053018
T3 - IEEE International Symposium on Information Theory - Proceedings
SP - 819
EP - 824
BT - 2021 IEEE International Symposium on Information Theory, ISIT 2021 - Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2021 IEEE International Symposium on Information Theory, ISIT 2021
Y2 - 12 July 2021 through 20 July 2021
ER -