TY - JOUR
T1 - Synthesize high-dimensional longitudinal electronic health records via hierarchical autoregressive language model
AU - Theodorou, Brandon
AU - Xiao, Cao
AU - Sun, Jimeng
N1 - Publisher Copyright:
© 2023, Springer Nature Limited.
PY - 2023/12
Y1 - 2023/12
N2 - Synthetic electronic health records (EHRs) that are both realistic and privacy-preserving offer alternatives to real EHRs for machine learning (ML) and statistical analysis. However, generating high-fidelity EHR data in its original, high-dimensional form poses challenges for existing methods. We propose Hierarchical Autoregressive Language mOdel (HALO) for generating longitudinal, high-dimensional EHR, which preserve the statistical properties of real EHRs and can train accurate ML models without privacy concerns. HALO generates a probability density function over medical codes, clinical visits, and patient records, allowing for generating realistic EHR data without requiring variable selection or aggregation. Extensive experiments demonstrated that HALO can generate high-fidelity data with high-dimensional disease code probabilities closely mirroring (above 0.9 R 2 correlation) real EHR data. HALO also enhances the accuracy of predictive modeling and enables downstream ML models to attain similar accuracy as models trained on genuine data.
AB - Synthetic electronic health records (EHRs) that are both realistic and privacy-preserving offer alternatives to real EHRs for machine learning (ML) and statistical analysis. However, generating high-fidelity EHR data in its original, high-dimensional form poses challenges for existing methods. We propose Hierarchical Autoregressive Language mOdel (HALO) for generating longitudinal, high-dimensional EHR, which preserve the statistical properties of real EHRs and can train accurate ML models without privacy concerns. HALO generates a probability density function over medical codes, clinical visits, and patient records, allowing for generating realistic EHR data without requiring variable selection or aggregation. Extensive experiments demonstrated that HALO can generate high-fidelity data with high-dimensional disease code probabilities closely mirroring (above 0.9 R 2 correlation) real EHR data. HALO also enhances the accuracy of predictive modeling and enables downstream ML models to attain similar accuracy as models trained on genuine data.
UR - http://www.scopus.com/inward/record.url?scp=85169348718&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85169348718&partnerID=8YFLogxK
U2 - 10.1038/s41467-023-41093-0
DO - 10.1038/s41467-023-41093-0
M3 - Article
C2 - 37652934
AN - SCOPUS:85169348718
SN - 2041-1723
VL - 14
JO - Nature communications
JF - Nature communications
IS - 1
M1 - 5305
ER -