TY - JOUR
T1 - Evaluation of crowdsourced mortality prediction models as a framework for assessing artificial intelligence in medicine
AU - the Patient Mortality Prediction DREAM Challenge Consortium
AU - Bergquist, Timothy
AU - Schaffter, Thomas
AU - Yan, Yao
AU - Yu, Thomas
AU - Prosser, Justin
AU - Gao, Jifan
AU - Chen, Guanhua
AU - Łukasz, Charzewski
AU - Nawalany, Zofia
AU - Brugere, Ivan
AU - Retkute, Renata
AU - Prusokas, Alidivinas
AU - Prusokas, Augustinas
AU - Choi, Yonghwa
AU - Lee, Sanghoon
AU - Choe, Junseok
AU - Lee, Inggeol
AU - Kim, Sunkyu
AU - Kang, Jaewoo
AU - Mooney, Sean D.
AU - Guinney, Justin
AU - Lee, Aaron
AU - Salehzadeh-Yazdi, Ali
AU - Basu, Anand
AU - Belouali, Anas
AU - Becker, Ann Kristin
AU - Israel, Ariel
AU - Winter, B.
AU - Moreno, Carlos Vega
AU - Kurz, Christoph
AU - Waltemath, Dagmar
AU - Schweinoch, Darius
AU - Glaab, Enrico
AU - Luo, Gang
AU - Zacharias, Helena U.
AU - Qiao, Hezhe
AU - Truthmann, Julia
AU - Stephens, Kari A.
AU - Kaderali, Lars
AU - Varshney, Lav R.
AU - Vollmer, Marcus
AU - Pandi, Maria Theodora
AU - Gunn, Martin L.
AU - Yetisgen, Meliha
AU - Nath, Neetika
AU - Hammarlund, Noah
AU - Müller-Stricker, Oliver
AU - Togias, Panagiotis
AU - Heagerty, Patrick J.
AU - Muir, Peter
N1 - Publisher Copyright:
VC The Author(s) 2023.
PY - 2024/1/1
Y1 - 2024/1/1
N2 - Objective: Applications of machine learning in healthcare are of high interest and have the potential to improve patient care. Yet, the real-world accuracy of these models in clinical practice and on different patient subpopulations remains unclear. To address these important questions, we hosted a community challenge to evaluate methods that predict healthcare outcomes. We focused on the prediction of all-cause mortality as the community challenge question. Materials and methods: Using a Model-to-Data framework, 345 registered participants, coalescing into 25 independent teams, spread over 3 continents and 10 countries, generated 25 accurate models all trained on a dataset of over 1.1 million patients and evaluated on patients prospectively collected over a 1-year observation of a large health system. Results: The top performing team achieved a final area under the receiver operator curve of 0.947 (95% CI, 0.942-0.951) and an area under the precision-recall curve of 0.487 (95% CI, 0.458-0.499) on a prospectively collected patient cohort. Discussion: Post hoc analysis after the challenge revealed that models differ in accuracy on subpopulations, delineated by race or gender, even when they are trained on the same data. Conclusion: This is the largest community challenge focused on the evaluation of state-of-the-art machine learning methods in a healthcare system performed to date, revealing both opportunities and pitfalls of clinical AI.
AB - Objective: Applications of machine learning in healthcare are of high interest and have the potential to improve patient care. Yet, the real-world accuracy of these models in clinical practice and on different patient subpopulations remains unclear. To address these important questions, we hosted a community challenge to evaluate methods that predict healthcare outcomes. We focused on the prediction of all-cause mortality as the community challenge question. Materials and methods: Using a Model-to-Data framework, 345 registered participants, coalescing into 25 independent teams, spread over 3 continents and 10 countries, generated 25 accurate models all trained on a dataset of over 1.1 million patients and evaluated on patients prospectively collected over a 1-year observation of a large health system. Results: The top performing team achieved a final area under the receiver operator curve of 0.947 (95% CI, 0.942-0.951) and an area under the precision-recall curve of 0.487 (95% CI, 0.458-0.499) on a prospectively collected patient cohort. Discussion: Post hoc analysis after the challenge revealed that models differ in accuracy on subpopulations, delineated by race or gender, even when they are trained on the same data. Conclusion: This is the largest community challenge focused on the evaluation of state-of-the-art machine learning methods in a healthcare system performed to date, revealing both opportunities and pitfalls of clinical AI.
KW - evaluation
KW - health informatics
KW - machine learning
UR - http://www.scopus.com/inward/record.url?scp=85181176180&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85181176180&partnerID=8YFLogxK
U2 - 10.1093/jamia/ocad159
DO - 10.1093/jamia/ocad159
M3 - Article
C2 - 37604111
AN - SCOPUS:85181176180
SN - 1067-5027
VL - 31
SP - 35
EP - 44
JO - Journal of the American Medical Informatics Association
JF - Journal of the American Medical Informatics Association
IS - 1
ER -