TY - JOUR
T1 - Machine Learning May Sometimes Simply Capture Literature Popularity Trends
T2 - A Case Study of Heterocyclic Suzuki-Miyaura Coupling
AU - Beker, Wiktor
AU - Roszak, Rafal
AU - Wolos, Agnieszka
AU - Angello, Nicholas H.
AU - Rathore, Vandana
AU - Burke, Martin D.
AU - Grzybowski, Bartosz A.
N1 - Publisher Copyright:
© 2022 American Chemical Society. All rights reserved.
PY - 2022/3/23
Y1 - 2022/3/23
N2 - Applications of machine learning (ML) to synthetic chemistry rely on the assumption that large numbers of literature-reported examples should enable construction of accurate and predictive models of chemical reactivity. This paper demonstrates that abundance of carefully curated literature data may be insufficient for this purpose. Using an example of Suzuki-Miyaura coupling with heterocyclic building blocks─and a carefully selected database of >10,000 literature examples─we show that ML models cannot offer any meaningful predictions of optimum reaction conditions, even if the search space is restricted to only solvents and bases. This result holds irrespective of the ML model applied (from simple feed-forward to state-of-the-art graph-convolution neural networks) or the representation to describe the reaction partners (various fingerprints, chemical descriptors, latent representations, etc.). In all cases, the ML methods fail to perform significantly better than naive assignments based on the sheer frequency of certain reaction conditions reported in the literature. These unsatisfactory results likely reflect subjective preferences of various chemists to use certain protocols, other biasing factors as mundane as availability of certain solvents/reagents, and/or a lack of negative data. These findings highlight the likely importance of systematically generating reliable and standardized data sets for algorithm training.
AB - Applications of machine learning (ML) to synthetic chemistry rely on the assumption that large numbers of literature-reported examples should enable construction of accurate and predictive models of chemical reactivity. This paper demonstrates that abundance of carefully curated literature data may be insufficient for this purpose. Using an example of Suzuki-Miyaura coupling with heterocyclic building blocks─and a carefully selected database of >10,000 literature examples─we show that ML models cannot offer any meaningful predictions of optimum reaction conditions, even if the search space is restricted to only solvents and bases. This result holds irrespective of the ML model applied (from simple feed-forward to state-of-the-art graph-convolution neural networks) or the representation to describe the reaction partners (various fingerprints, chemical descriptors, latent representations, etc.). In all cases, the ML methods fail to perform significantly better than naive assignments based on the sheer frequency of certain reaction conditions reported in the literature. These unsatisfactory results likely reflect subjective preferences of various chemists to use certain protocols, other biasing factors as mundane as availability of certain solvents/reagents, and/or a lack of negative data. These findings highlight the likely importance of systematically generating reliable and standardized data sets for algorithm training.
UR - http://www.scopus.com/inward/record.url?scp=85126567677&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85126567677&partnerID=8YFLogxK
U2 - 10.1021/jacs.1c12005
DO - 10.1021/jacs.1c12005
M3 - Article
C2 - 35258973
AN - SCOPUS:85126567677
SN - 0002-7863
VL - 144
SP - 4819
EP - 4827
JO - Journal of the American Chemical Society
JF - Journal of the American Chemical Society
IS - 11
ER -