TY - JOUR
T1 - Combining Group Contribution Method and Semisupervised Learning to Build Machine Learning Models for Predicting Hydroxyl Radical Rate Constants of Water Contaminants
AU - Liu, Zhao
AU - Shang, Lanyu
AU - Huang, Kuan
AU - Yue, Zhenrui
AU - Han, Alan Y.
AU - Wang, Dong
AU - Zhang, Huichun
N1 - This work was funded by the US National Science Foundation Grant # CHE-2105005.
PY - 2025/1/14
Y1 - 2025/1/14
N2 - Machine learning is an effective tool for predicting reaction rate constants for many organic compounds with the hydroxyl radical (HO•). Previously reported models have achieved relatively good performance, but due to scarce data (<1400 records), the applicability domain (AD) has been significantly limited. To address this limitation, we curated a much larger experimental data set (Primary data set), which contains 2358 kinetic records. We then employed both the group contribution method (GCM) and a semisupervised learning (SSL) strategy to add new data points, aiming to effectively expand the model’s AD while improving model performance. The results indicated that GCM improved the model’s performance for chemicals outside the AD, while SSL expanded the model’s AD. The final model, after incorporating 147,168 new data points, achieved an R2 = 0.77, root-mean-square-error = 0.32, and mean-absolute-error = 0.24 on the test set. Importantly, the AD was expanded by 117% compared to the model developed solely based on the Primary data set, and the final model can be reliably applied to more than 560,000 chemicals from the DSSTox database. Further model interpretation results indicated that the model made predictions based on a correct “understanding” of the impact of key substituents and reactive sites toward HO•. This research provides an effective method for augmenting data sets, which is important in improving ML model performance and expanding AD. The final model has been made widely accessible through a free online predictor.
AB - Machine learning is an effective tool for predicting reaction rate constants for many organic compounds with the hydroxyl radical (HO•). Previously reported models have achieved relatively good performance, but due to scarce data (<1400 records), the applicability domain (AD) has been significantly limited. To address this limitation, we curated a much larger experimental data set (Primary data set), which contains 2358 kinetic records. We then employed both the group contribution method (GCM) and a semisupervised learning (SSL) strategy to add new data points, aiming to effectively expand the model’s AD while improving model performance. The results indicated that GCM improved the model’s performance for chemicals outside the AD, while SSL expanded the model’s AD. The final model, after incorporating 147,168 new data points, achieved an R2 = 0.77, root-mean-square-error = 0.32, and mean-absolute-error = 0.24 on the test set. Importantly, the AD was expanded by 117% compared to the model developed solely based on the Primary data set, and the final model can be reliably applied to more than 560,000 chemicals from the DSSTox database. Further model interpretation results indicated that the model made predictions based on a correct “understanding” of the impact of key substituents and reactive sites toward HO•. This research provides an effective method for augmenting data sets, which is important in improving ML model performance and expanding AD. The final model has been made widely accessible through a free online predictor.
KW - applicability domain
KW - group contribution method
KW - hydroxyl radical
KW - machine learning
KW - reaction rate constant
KW - semisupervised learning
UR - http://www.scopus.com/inward/record.url?scp=85213225963&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85213225963&partnerID=8YFLogxK
U2 - 10.1021/acs.est.4c11950
DO - 10.1021/acs.est.4c11950
M3 - Article
C2 - 39723902
AN - SCOPUS:85213225963
SN - 0013-936X
VL - 59
SP - 857
EP - 868
JO - Environmental Science and Technology
JF - Environmental Science and Technology
IS - 1
ER -