TY - GEN
T1 - MediTab
T2 - 33rd International Joint Conference on Artificial Intelligence, IJCAI 2024
AU - Wang, Zifeng
AU - Gao, Chufan
AU - Xiao, Cao
AU - Sun, Jimeng
N1 - This work was supported by NSF award SCH-2205289, SCH-2014438, and IIS-2034479.
PY - 2024
Y1 - 2024
N2 - Tabular data prediction has been employed in medical applications such as patient health risk prediction. However, existing methods usually revolve around the algorithm design while overlooking the significance of data engineering. Medical tabular datasets frequently exhibit significant heterogeneity across different sources, with limited sample sizes per source. As such, previous predictors are often trained on manually curated small datasets that struggle to generalize across different tabular datasets during inference. This paper proposes to scale medical tabular data predictors (MediTab) to various tabular inputs with varying features. The method uses a data engine that leverages large language models (LLMs) to consolidate tabular samples to overcome the barrier across tables with distinct schema. It also aligns out-domain data with the target task using a “learn, annotate, and refinement” pipeline. The expanded training data then enables the pre-trained MediTab to infer for arbitrary tabular input in the domain without finetuning, resulting in significant improvements over supervised baselines: it reaches an average ranking of 1.57 and 1.00 on 7 patient outcome prediction datasets and 3 trial outcome prediction datasets, respectively. In addition, MediTab exhibits impressive zero-shot performances: it outperforms supervised XGBoost models by 8.9% and 17.2% on average in two prediction tasks, respectively.
AB - Tabular data prediction has been employed in medical applications such as patient health risk prediction. However, existing methods usually revolve around the algorithm design while overlooking the significance of data engineering. Medical tabular datasets frequently exhibit significant heterogeneity across different sources, with limited sample sizes per source. As such, previous predictors are often trained on manually curated small datasets that struggle to generalize across different tabular datasets during inference. This paper proposes to scale medical tabular data predictors (MediTab) to various tabular inputs with varying features. The method uses a data engine that leverages large language models (LLMs) to consolidate tabular samples to overcome the barrier across tables with distinct schema. It also aligns out-domain data with the target task using a “learn, annotate, and refinement” pipeline. The expanded training data then enables the pre-trained MediTab to infer for arbitrary tabular input in the domain without finetuning, resulting in significant improvements over supervised baselines: it reaches an average ranking of 1.57 and 1.00 on 7 patient outcome prediction datasets and 3 trial outcome prediction datasets, respectively. In addition, MediTab exhibits impressive zero-shot performances: it outperforms supervised XGBoost models by 8.9% and 17.2% on average in two prediction tasks, respectively.
UR - http://www.scopus.com/inward/record.url?scp=85204312817&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85204312817&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85204312817
T3 - IJCAI International Joint Conference on Artificial Intelligence
SP - 6062
EP - 6070
BT - Proceedings of the 33rd International Joint Conference on Artificial Intelligence, IJCAI 2024
A2 - Larson, Kate
PB - International Joint Conferences on Artificial Intelligence
Y2 - 3 August 2024 through 9 August 2024
ER -