Skip to main content

Machine Learning Based Prediction of Incident Cases of Crohn’s Disease Using Electronic Health Records from a Large Integrated Health System

  • Conference paper
  • First Online:
Artificial Intelligence in Medicine (AIME 2023)

Abstract

Early diagnosis and treatment of Crohn’s Disease (CD) is associated with decreased risk of surgery and complications. However, diagnostic delay is common in clinical practice. In order to better understand CD risk factors and disease indicators, we identified incident CD patients and controls within the Mount Sinai Data Warehouse (MSDW) and developed machine learning (ML) models for disease prediction.

CD incident cases were defined based on CD diagnosis codes, medication prescriptions, healthcare utilization before first CD diagnosis, and clinical text, using structured Electronic Health Records (EHR) and clinical notes from MSDW. Cases were matched to controls based on sex, age and healthcare utilization. Thus, we identified 249 incident CD cases and 1,242 matched controls in MSDW. We excluded data from 180 days before first CD diagnosis for cohort characterization and predictive modeling. Clinical text was encoded by term frequency-inverse document frequency and structured EHR features were aggregated. We compared three ML models: Logistic Regression, Random Forest, and XGBoost.

Gastrointestinal symptoms, for instance anal fistula and irritable bowel syndrome, are significantly overrepresented in cases at least 180 days before the first CD code (prevalence of 33% in cases compared to 12% in controls). XGBoost is the best performing model to predict CD with an AUROC of 0.72 based on structured EHR data only. Features with highest predictive importance from structured EHR include anemia lab values and race (white). The results suggest that ML algorithms could enable earlier diagnosis of CD and reduce the diagnostic delay.

J. Hugo and S. Ibing—These authors contributed equally to this work.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
eBook
USD 64.99
Price excludes VAT (USA)
Softcover Book
USD 84.99
Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Beaulieu-Jones, B.K., Lavage, D.R., Snyder, J.W., Moore, J.H., Pendergrass, S.A., et al.: Characterizing and managing missing structured data in electronic health records: Data analysis. JMIR Med. Inf. 6(1), e11 (2018)

    Article  Google Scholar 

  2. Belbin, G.M., et al.: Toward a fine-scale population health monitoring system. Cell 184(8), 2068–2083 (2021)

    Article  Google Scholar 

  3. Bergstra, J., Yamins, D., Cox, D.: Making a science of model search: hyperparameter optimization in hundreds of dimensions for vision architectures. In: Proceedings of the 30th International Conference on Machine Learning, pp. 115–123. PMLR (2013)

    Google Scholar 

  4. Blackwell, J., et al.: Prevalence and duration of gastrointestinal symptoms before diagnosis of inflammatory bowel disease and predictors of timely specialist review: a population-based study. JCC 15(2), 203–211 (2021)

    Google Scholar 

  5. Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)

    Article  MATH  Google Scholar 

  6. Castro, V.M., et al.: Evaluation of matched control algorithms in EHR-based phenotyping studies: a case study of inflammatory bowel disease comorbidities. J. Biomed. Inform. 52, 105–111 (2014)

    Article  Google Scholar 

  7. Chen, T., Guestrin, C.: XGBoost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 785–794 (2016)

    Google Scholar 

  8. Danese, S., et al.: Development of red flags index for early referral of adults with symptoms and signs suggestive of Crohn’s disease: an IOIBD initiative. JCC 9(8), 601–606 (2015)

    Google Scholar 

  9. Fiorino, G., et al.: Validation of the red flags index for early diagnosis of Crohn’s disease: a prospective observational IG-IBD study among general practitioners. JCC 14(12), 1777–1779 (2020)

    Google Scholar 

  10. Goldstein, B.A., Navar, A.M., Pencina, M.J., Ioannidis, J.P.A.: Opportunities and challenges in developing risk prediction models with electronic health records data: a systematic review. J. Am. Med. Inf. Assoc. 24(1), 198–208 (2017)

    Article  Google Scholar 

  11. Ibing, S., Cho, J.H., Böttinger, E.P., Ungaro, R.C.: Second line biologic therapy following tumor necrosis factor antagonist failure: a real world propensity score weighted analysis. CGH (2023, in press)

    Google Scholar 

  12. Jayasooriya, N., et al.: Systematic review with meta-analysis: time to diagnosis and the impact of delayed diagnosis on clinical outcomes in inflammatory bowel disease. Aliment. Pharmacol. Ther. 57(6), 635–652 (2023)

    Google Scholar 

  13. Kursa, M.B., Rudnicki, W.R.: Feature selection with the Boruta package. J. Stat. Soft. 36(11), 1–13 (2010)

    Article  Google Scholar 

  14. Lauritsen, S.M., et al.: The framing of machine learning risk prediction models illustrated by evaluation of sepsis in general wards. NPJ Digit. Med. 4(1), 1–12 (2021)

    Google Scholar 

  15. Lundberg, S.M., et al.: From local explanations to global understanding with explainable AI for trees. Nat. Mach. Intell. 2(1), 56–67 (2020)

    Article  Google Scholar 

  16. Nguyen, N.H., et al.: Machine learning-based prediction models for diagnosis and prognosis in inflammatory bowel diseases: a systematic review. JCC 16(3), 398–413 (2022)

    MathSciNet  Google Scholar 

  17. Seinen, T.M., et al.: Use of unstructured text in prognostic clinical prediction models: a systematic review. J. Am. Med. Inf. Assoc. 29(7), 1292–1302 (2022)

    Article  Google Scholar 

  18. Torres, J., Mehandru, S., Colombel, J.F., Peyrin-Biroulet, L.: Crohn’s disease. Lancet 389(10080), 1741–1755 (2017)

    Article  Google Scholar 

  19. Ungaro, R., Mehandru, S., Allen, P.B., Peyrin-Biroulet, L., Colombel, J.F.: Ulcerative colitis. Lancet 389(10080), 1756–1770 (2017)

    Article  Google Scholar 

Download references

Acknowledgements

This work is supported in part through the use of the research platform AI-Ready Mount Sinai (AIR.MS), and through the MSDW resources and staff expertise provided by Scientific Computing and Data at the Icahn School of Medicine at Mount Sinai. The research leading to these results has received funding from the Horizon 2020 Programme of the European Commission under Grant Agreement No. 826117 and by the Joachim-Herz foundation (to SI).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Susanne Ibing .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Hugo, J. et al. (2023). Machine Learning Based Prediction of Incident Cases of Crohn’s Disease Using Electronic Health Records from a Large Integrated Health System. In: Juarez, J.M., Marcos, M., Stiglic, G., Tucker, A. (eds) Artificial Intelligence in Medicine. AIME 2023. Lecture Notes in Computer Science(), vol 13897. Springer, Cham. https://doi.org/10.1007/978-3-031-34344-5_35

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-34344-5_35

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-34343-8

  • Online ISBN: 978-3-031-34344-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics