TY - CONF
T1 - HOW MUCH SPACE HAS BEEN EXPLORED? MEASURING THE CHEMICAL SPACE COVERED BY DATABASES AND MACHINE-GENERATED MOLECULES
AU - Xie, Yutong
AU - Xu, Ziqiao
AU - Ma, Jiaqi
AU - Mei, Qiaozhu
N1 - We would like to thank Professor Kevyn Collins-Thompson, Professor Paramveer Dhillon, and Professor Aaron Frank for their helpful feedback and suggestions. We also thank the anonymous reviewers for their constructive comments, and in particular for pointing out the relation between #Circles and packing number. This work was in part supported by the National Science Foundation under grant number 1633370.
PY - 2023
Y1 - 2023
N2 - Forming a molecular candidate set that contains a wide range of potentially effective compounds is crucial to the success of drug discovery. While most databases and machine-learning-based generation models aim to optimize particular chemical properties, there is limited literature on how to properly measure the coverage of the chemical space by those candidates included or generated. This problem is challenging due to the lack of formal criteria to select good measures of the chemical space. In this paper, we propose a novel evaluation framework for measures of the chemical space based on two analyses: an axiomatic analysis with three intuitive axioms that a good measure should obey, and an empirical analysis on the correlation between a measure and a proxy gold standard. Using this framework, we are able to identify #Circles, a new measure of chemical space coverage, which is superior to existing measures both analytically and empirically. We further evaluate how well the existing databases and generation models cover the chemical space in terms of #Circles. The results suggest that many generation models fail to explore a larger space over existing databases, which leads to new opportunities for improving generation models by encouraging exploration.
AB - Forming a molecular candidate set that contains a wide range of potentially effective compounds is crucial to the success of drug discovery. While most databases and machine-learning-based generation models aim to optimize particular chemical properties, there is limited literature on how to properly measure the coverage of the chemical space by those candidates included or generated. This problem is challenging due to the lack of formal criteria to select good measures of the chemical space. In this paper, we propose a novel evaluation framework for measures of the chemical space based on two analyses: an axiomatic analysis with three intuitive axioms that a good measure should obey, and an empirical analysis on the correlation between a measure and a proxy gold standard. Using this framework, we are able to identify #Circles, a new measure of chemical space coverage, which is superior to existing measures both analytically and empirically. We further evaluate how well the existing databases and generation models cover the chemical space in terms of #Circles. The results suggest that many generation models fail to explore a larger space over existing databases, which leads to new opportunities for improving generation models by encouraging exploration.
UR - http://www.scopus.com/inward/record.url?scp=85159367788&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85159367788&partnerID=8YFLogxK
M3 - Paper
AN - SCOPUS:85159367788
T2 - 11th International Conference on Learning Representations, ICLR 2023
Y2 - 1 May 2023 through 5 May 2023
ER -