Skip to main content

Advertisement

Log in

Natural language processing pipeline to extract prostate cancer-related information from clinical notes

  • Imaging Informatics and Artificial Intelligence
  • Published:
European Radiology Aims and scope Submit manuscript

Abstract

Objectives

To develop an automated pipeline for extracting prostate cancer-related information from clinical notes.

Materials and methods

This retrospective study included 23,225 patients who underwent prostate MRI between 2017 and 2022. Cancer risk factors (family history of cancer and digital rectal exam findings), pre-MRI prostate pathology, and treatment history of prostate cancer were extracted from free-text clinical notes in English as binary or multi-class classification tasks. Any sentence containing pre-defined keywords was extracted from clinical notes within one year before the MRI. After manually creating sentence-level datasets with ground truth, Bidirectional Encoder Representations from Transformers (BERT)-based sentence-level models were fine-tuned using the extracted sentence as input and the category as output. The patient-level output was determined by compilation of multiple sentence-level outputs using tree-based models. Sentence-level classification performance was evaluated using the area under the receiver operating characteristic curve (AUC) on 15% of the sentence-level dataset (sentence-level test set). The patient-level classification performance was evaluated on the patient-level test set created by radiologists by reviewing the clinical notes of 603 patients. Accuracy and sensitivity were compared between the pipeline and radiologists.

Results

Sentence-level AUCs were ≥ 0.94. The pipeline showed higher patient-level sensitivity for extracting cancer risk factors (e.g., family history of prostate cancer, 96.5% vs. 77.9%, p < 0.001), but lower accuracy in classifying pre-MRI prostate pathology (92.5% vs. 95.9%, p = 0.002) and treatment history of prostate cancer (95.5% vs. 97.7%, p = 0.03) than radiologists, respectively.

Conclusion

The proposed pipeline showed promising performance, especially for extracting cancer risk factors from patient’s clinical notes.

Clinical relevance statement

The natural language processing pipeline showed a higher sensitivity for extracting prostate cancer risk factors than radiologists and may help efficiently gather relevant text information when interpreting prostate MRI.

Key Points

  • When interpreting prostate MRI, it is necessary to extract prostate cancer-related information from clinical notes.

  • This pipeline extracted the presence of prostate cancer risk factors with higher sensitivity than radiologists.

  • Natural language processing may help radiologists efficiently gather relevant prostate cancer-related text information.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Abbreviations

AUC:

Area under the receiver operating characteristic curve

BERT:

Bidirectional Encoder Representations from Transformers

CI:

Confidence intervals

LLM:

Large language model

NLP:

Natural language processing

ROC:

Receiver operating characteristic

References

  1. Siegel RL, Miller KD, Wagle NS, Jemal A (2023) Cancer statistics, 2023. CA Cancer J Clin 73:17–48

    Article  PubMed  Google Scholar 

  2. Pinsky PF, Parnes H (2023) Screening for prostate cancer. N Engl J Med 388:1405–1414

    Article  PubMed  Google Scholar 

  3. Messina C, Cattrini C, Soldato D et al (2020) BRCA mutations in prostate cancer: prognostic and predictive implications. J Oncol 2020:4986365

    Article  PubMed  PubMed Central  Google Scholar 

  4. Patel HD, Koehne EL, Shea SM et al (2022) Risk of prostate cancer for men with prior negative biopsies undergoing magnetic resonance imaging compared with biopsy-naive men: a prospective evaluation of the PLUM cohort. Cancer 128:75–84

    Article  PubMed  Google Scholar 

  5. Naji L, Randhawa H, Sohani Z et al (2018) Digital rectal examination for prostate cancer screening in primary care: a systematic review and meta-analysis. Ann Fam Med 16:149–154

    Article  PubMed  PubMed Central  Google Scholar 

  6. Mottet N, van den Bergh RCN, Briers E et al (2021) EAU-EANM-ESTRO-ESUR-SIOG guidelines on prostate cancer—2020 update. Part 1: screening, diagnosis, and local treatment with curative intent. Eur Urol 79:243–262

    Article  CAS  PubMed  Google Scholar 

  7. Turkbey B, Rosenkrantz AB, Haider MA et al (2019) Prostate imaging reporting and data system version 2.1: 2019 update of prostate imaging reporting and data system version 2. Eur Urol 76:340–351

    Article  PubMed  Google Scholar 

  8. American College of Radiology (2019) PI-RADS: prostate imaging – reporting and data system. Version 2.1. Report-Template https://www.acr.org/-/media/acr/files/rads/pi-rads/pirads-v2-1report-template.pdf. Accessed March 4, 2024

  9. Patel P, Mathew MS, Trilisky I, Oto A (2018) Multiparametric MR imaging of the prostate after treatment of prostate cancer. Radiographics 38:437–449

    Article  PubMed  Google Scholar 

  10. López-Úbeda P, Martín-Noguerol T, Juluru K, Luna A (2022) Natural language processing in radiology: update on clinical applications. J Am Coll Radiol 19:1271–1285

    Article  PubMed  Google Scholar 

  11. Mozayan A, Fabbri AR, Maneevese M et al (2021) Practical guide to natural language processing for radiology. Radiographics 41:1446–1453

    Article  PubMed  Google Scholar 

  12. Devlin J, Chang M-W, Lee K, Toutanova K (2018) BERT: pre-training of deep bidirectional transformers for language understanding. https://arxiv.org/abs/1810.04805, Accessed March 12, 2024

  13. Dada A, Ufer TL, Kim M et al (2023) Information extraction from weakly structured radiological reports with natural language queries. Eur Radiol. https://doi.org/10.1007/s00330-023-09977-3

  14. Yan A, McAuley J, Lu X et al (2022) RadBERT: adapting transformer-based language models to radiology. Radiol Artif Intell 4:e210258

    Article  PubMed  PubMed Central  Google Scholar 

  15. Rasmy L, Xiang Y, Xie Z et al (2021) Med-BERT: pretrained contextualized embeddings on large-scale structured electronic health records for disease prediction. NPJ Digit Med 4:86

    Article  PubMed  PubMed Central  Google Scholar 

  16. van Leeuwen PJ, Hayen A, Thompson JE et al (2017) A multiparametric magnetic resonance imaging-based risk model to determine the risk of significant prostate cancer prior to biopsy. BJU Int 120:774–781

    Article  PubMed  Google Scholar 

  17. Mehralivand S, Shih JH, Rais-Bahrami S et al (2018) A magnetic resonance imaging-based prediction model for prostate biopsy risk stratification. JAMA Oncol 4:678–685

    Article  PubMed  PubMed Central  Google Scholar 

  18. Alberts AR, Roobol MJ, Verbeek JFM et al (2019) Prediction of high-grade prostate cancer following multiparametric magnetic resonance imaging: improving the rotterdam European randomized study of screening for prostate cancer risk calculators. Eur Urol 75:310–318

    Article  PubMed  Google Scholar 

  19. Bozkurt S, Magnani CJ, Seneviratne MG et al (2022) Expanding the secondary use of prostate cancer real world data: automated classifiers for clinical and pathological stage. Front Digit Health 4:793316

    Article  PubMed  PubMed Central  Google Scholar 

  20. Yu S, Le A, Feld E et al (2021) A natural language processing-assisted extraction system for Gleason scores: development and usability study. JMIR Cancer 7:e27970

    Article  PubMed  PubMed Central  Google Scholar 

  21. Banerjee I, Li K, Seneviratne M et al (2019) Weakly supervised natural language processing for assessing patient-centered outcome following prostate cancer treatment. JAMIA Open 2:150–159

    Article  PubMed  PubMed Central  Google Scholar 

  22. Liu H, Bielinski SJ, Sohn S et al (2013) An information extraction framework for cohort identification using electronic health records. AMIA Jt Summits Transl Sci Proc 2013:149–153

    PubMed  PubMed Central  Google Scholar 

  23. Bird S, Klein E, Loper E (2009) Natural language processing with python: analyzing text with the natural language toolkit. “O’Reilly Media, Inc.”

  24. Wolf T, Debut L, Sanh V et al (2019) HuggingFace’s transformers: state-of-the-art natural language processing. http://arxiv.org/abs/1910.03771, Accessed March 12, 2024

  25. Paszke A, Gross S, Massa F et al (2019) PyTorch: an imperative style, high-performance deep learning library. http://arxiv.org/abs/1912.01703, Accessed March 12, 2024

  26. Pedregosa F, Varoquaux G, Gramfort A et al (2011) Scikit-learn: machine learning in Python. J Mach Learning Res 12:2825–2830

    Google Scholar 

  27. Van Rossum G, Drake FL (2009) Python 3 Reference Manual: (Python Documentation Manual Part 2). CreateSpace, Scotts Valley, CA

  28. Seabold S, Josef P (2010) Statsmodels: Econometric and statistical modeling with Python. In Proceedings of the 9th Python in Science Conference

  29. Herberts C, Wyatt AW, Nguyen PL, Cheng HH (2023) Genetic and genomic testing for prostate cancer: beyond DNA repair. Am Soc Clin Oncol Educ Book 43:e390384

  30. Radford A, Narasimhan K, Salimans T, Sutskever I. Improving language understanding by generative pre-training. https://www.mikecaptain.com/resources/pdf/GPT-1.pdf. Accessed 26 Oct 2023

  31. Singhal K, Azizi S, Tu T et al (2023) Large language models encode clinical knowledge. Nature 620:172–180

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  32. Zhong Q, Ding L, Liu J et al (2023) Can ChatGPT understand too? A comparative study on ChatGPT and fine-tuned BERT. http://arxiv.org/abs/2302.10198, Accessed March 12, 2024

  33. Periti F, Dubossarsky H, Tahmasebi N (2024) (Chat)GPT v BERT: dawn of justice for semantic change detection. http://arxiv.org/abs/2401.14040, Accessed March 12, 2024

Download references

Acknowledgements

The authors wish to thank Desiree Lanzino, PhD, for her assistance in editing the manuscript.

Funding

The authors state that this work has not received any funding.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Naoki Takahashi.

Ethics declarations

Guarantor

The scientific guarantor of this publication Naoki Takahashi.

Conflict of interest

The authors of this manuscript declare no relationships with any companies, whose products or services may be related to the subject matter of the article.

Statistics and biometry

No complex statistical methods were necessary for this paper.

Informed consent

Written informed consent was waived by the Institutional Review Board. “The Reviewer approved waiver of the requirement to obtain informed consent in accordance with 45 CFR 46.116 as justified by the Investigator, and waiver of HIPAA authorization in accordance with applicable HIPAA regulations”.

Ethical approval

Institutional Review Board approval was obtained. (#23-008038). “IRB Application #: 23-008038. Title: Development of Machine Learning Model of Prostate Cancer Using Prostate MRI and Clinical Data. IRB Approval Date: 8/30/2023. IRB Expiration Date: The above referenced application was reviewed by expedited review procedures and is determined to be exempt from the requirement for IRB approval (45 CFR 46.104d, category 4). Continued IRB review of this study is not required as it is currently written. However, requests for modifications to the study design or procedures must be submitted to the IRB to determine whether the study continues to be exempt. The Reviewer approved waiver of HIPAA authorization in accordance with applicable HIPAA regulations. As the principal investigator of this project, you are responsible for the following relating to this study. (1) When applicable, use only IRB approved materials which are located under the documents tab of the IRBe workspace. Materials include consent forms, HIPAA, questionnaires, contact letters, advertisements, etc. (2) Submission to the IRB of any modifications to approved research along with any supporting documents for review and approval prior to initiation of the changes. (3) Submission to the IRB of all Unanticipated Problems Involving Risks to Subjects or Others (UPIRTSO) and major protocol violations/deviations within five working days of becoming aware of the occurrence. (4) Compliance with applicable regulations for the protection of human subjects and with Mayo Clinic Institutional Policies. Mayo Clinic Institutional Reviewer”.

Study subjects or cohorts overlap

Thousands of patients included in this study overlapped with previously published works that evaluated cancer detection rates of prostate MRI in various different populations and a study that developed deep learning models for detecting clinically significant prostate cancer.

  1. 1.

    Nagayama H, Nakai H, Takahashi H, et al Cancer detection rate and abnormal interpretation rate of prostate MRI performed for clinical suspicion of prostate cancer. J Am Coll Radiol. 2023; https://doi.org/10.1016/j.jacr.2023.07.031.

  2. 2.

    Nakai H, Nagayama H, Takahashi H, et al Cancer detection rate and abnormal interpretation rate of prostate MRI in patients with low-grade cancer. J Am Coll Radiol. 2023; https://doi.org/10.1016/j.jacr.2023.07.030.3.

  3. 3.

    Nakai H, Takahashi H, Adamo DA, et al Decreased cancer detection rate of the prostate MRI in patients with moderate to severe susceptibility artifacts from hip prosthesis. Eur Radiol. 2023; https://doi.org/10.1007/s00330-023-10345-4.

  4. 4.

    Cai JC, Nakai H, Kuanar S, et al A fully automated deep learning model to detect clinically significant prostate cancer on multiparametric MRI. (Manuscript under review).

Methodology

  • Retrospective

  • Diagnostic or prognostic study

  • Performed at one institution

Additional information

Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Nakai, H., Suman, G., Adamo, D.A. et al. Natural language processing pipeline to extract prostate cancer-related information from clinical notes. Eur Radiol (2024). https://doi.org/10.1007/s00330-024-10812-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s00330-024-10812-6

Keywords

Navigation