Cautionary Guidelines for Machine Learning Studies with Combinatorial Datasets

Andrew F. Zahrt, Jeremy J. Henle, Scott E. Denmark

Research output: Contribution to journalArticlepeer-review

Abstract

Regression modeling is becoming increasingly prevalent in organic chemistry as a tool for reaction outcome prediction and mechanistic interrogation. Frequently, to acquire the requisite amount of data for such studies, researchers employ combinatorial datasets to maximize the number of data points while limiting the number of discrete chemical entities required. An often-overlooked problem in modeling studies using combinatorial datasets is the tendency to fit on patterns in the datasets (i.e., the presence or absence of a reactant or catalyst) rather than to identify meaningful trends between descriptors and the response variable. Consequently, the generality and interpretability of such models suffer. This report illustrates these well-known pitfalls in a case study, demonstrates the necessary control experiments to identify when this property will be problematic, and suggests how to perform further validation to assess general applicability and interpretability of models trained using combinatorial datasets.

Original languageEnglish (US)
Pages (from-to)586-591
Number of pages6
JournalACS Combinatorial Science
Volume22
Issue number11
DOIs
StatePublished - Nov 9 2020

Keywords

  • enantioselective catalysis
  • machine learning

ASJC Scopus subject areas

  • Chemistry(all)

Fingerprint

Dive into the research topics of 'Cautionary Guidelines for Machine Learning Studies with Combinatorial Datasets'. Together they form a unique fingerprint.

Cite this