TY - JOUR
T1 - Practical Accuracy Estimation for Efficient Deep Neural Network Testing
AU - Chen, Junjie
AU - Wu, Zhuo
AU - Wang, Zan
AU - You, Hanmo
AU - Zhang, Lingming
AU - Yan, Ming
N1 - Funding Information:
This work was partially supported by the National Natural Science Foundation of China under Grant No. 61872263, Intelligent Manufacturing Special Fund of Tianjin under Grant No. 20191012, Innovation Research Project of Tianjin University under Grant No. 2020XZC-0042. This work was also partially supported by National Science Foundation under Grant No. CCF-1763906 and Alibaba. Authors’ addresses: J. Chen, Z. Wang (corresponding author), H. You, and M. Yan, College of Intelligence and Computing, Tianjin University, Tianjin, 300350, China; emails: {junjiechen, wangzan, youhanmo, yanming}@tju.edu.cn; Z. Wu, Tianjin International Engineering Institute, Tianjin University, Tianjin, 300350, China; email: [email protected]; L. Zhang, Department of Computer Science, The University of Texas at Dallas, Richardson, TX, 75080, USA; email: [email protected]. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. © 2020 Association for Computing Machinery. 1049-331X/2020/10-ART30 $15.00 https://doi.org/10.1145/3394112
Publisher Copyright:
© 2020 ACM.
PY - 2020/10
Y1 - 2020/10
N2 - Deep neural network (DNN) has become increasingly popular and DNN testing is very critical to guarantee the correctness of DNN, i.e., the accuracy of DNN in this work. However, DNN testing suffers from a serious efficiency problem, i.e., it is costly to label each test input to know the DNN accuracy for the testing set, since labeling each test input involves multiple persons (even with domain-specific knowledge) in a manual way and the testing set is large-scale. To relieve this problem, we propose a novel and practical approach, called PACE (which is short for Practical ACcuracy Estimation), which selects a small set of test inputs that can precisely estimate the accuracy of the whole testing set. In this way, the labeling costs can be largely reduced by just labeling this small set of selected test inputs. Besides achieving a precise accuracy estimation, to make PACE more practical it is also required that it is interpretable, deterministic, and as efficient as possible. Therefore, PACE first incorporates clustering to interpretably divide test inputs with different testing capabilities (i.e., testing different functionalities of a DNN model) into different groups. Then, PACE utilizes the MMD-critic algorithm, a state-of-the-art example-based explanation algorithm, to select prototypes (i.e., the most representative test inputs) from each group, according to the group sizes, which can reduce the impact of noise due to clustering. Meanwhile, PACE also borrows the idea of adaptive random testing to select test inputs from the minority space (i.e., the test inputs that are not clustered into any group) to achieve great diversity under the required number of test inputs. The two parallel selection processes (i.e., selection from both groups and the minority space) compose the final small set of selected test inputs. We conducted an extensive study to evaluate the performance of PACE based on a comprehensive benchmark (i.e., 24 pairs of DNN models and testing sets) by considering different types of models (i.e., classification and regression models, high-accuracy and low-accuracy models, and CNN and RNN models) and different types of test inputs (i.e., original, mutated, and automatically generated test inputs). The results demonstrate that PACE is able to precisely estimate the accuracy of the whole testing set with only 1.181%∼2.302% deviations, on average, significantly outperforming the state-of-the-art approaches.
AB - Deep neural network (DNN) has become increasingly popular and DNN testing is very critical to guarantee the correctness of DNN, i.e., the accuracy of DNN in this work. However, DNN testing suffers from a serious efficiency problem, i.e., it is costly to label each test input to know the DNN accuracy for the testing set, since labeling each test input involves multiple persons (even with domain-specific knowledge) in a manual way and the testing set is large-scale. To relieve this problem, we propose a novel and practical approach, called PACE (which is short for Practical ACcuracy Estimation), which selects a small set of test inputs that can precisely estimate the accuracy of the whole testing set. In this way, the labeling costs can be largely reduced by just labeling this small set of selected test inputs. Besides achieving a precise accuracy estimation, to make PACE more practical it is also required that it is interpretable, deterministic, and as efficient as possible. Therefore, PACE first incorporates clustering to interpretably divide test inputs with different testing capabilities (i.e., testing different functionalities of a DNN model) into different groups. Then, PACE utilizes the MMD-critic algorithm, a state-of-the-art example-based explanation algorithm, to select prototypes (i.e., the most representative test inputs) from each group, according to the group sizes, which can reduce the impact of noise due to clustering. Meanwhile, PACE also borrows the idea of adaptive random testing to select test inputs from the minority space (i.e., the test inputs that are not clustered into any group) to achieve great diversity under the required number of test inputs. The two parallel selection processes (i.e., selection from both groups and the minority space) compose the final small set of selected test inputs. We conducted an extensive study to evaluate the performance of PACE based on a comprehensive benchmark (i.e., 24 pairs of DNN models and testing sets) by considering different types of models (i.e., classification and regression models, high-accuracy and low-accuracy models, and CNN and RNN models) and different types of test inputs (i.e., original, mutated, and automatically generated test inputs). The results demonstrate that PACE is able to precisely estimate the accuracy of the whole testing set with only 1.181%∼2.302% deviations, on average, significantly outperforming the state-of-the-art approaches.
KW - Deep neural network testing
KW - labeling
KW - test input selection
KW - test optimization
UR - http://www.scopus.com/inward/record.url?scp=85092739576&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85092739576&partnerID=8YFLogxK
U2 - 10.1145/3394112
DO - 10.1145/3394112
M3 - Article
AN - SCOPUS:85092739576
SN - 1049-331X
VL - 29
JO - ACM Transactions on Software Engineering and Methodology
JF - ACM Transactions on Software Engineering and Methodology
IS - 4
M1 - 3394112
ER -