Rule-based and machine learning algorithms identify patients with systemic sclerosis accurately in the electronic health record
- PMID: 31888720
- PMCID: PMC6937803
- DOI: 10.1186/s13075-019-2092-7
Rule-based and machine learning algorithms identify patients with systemic sclerosis accurately in the electronic health record
Abstract
Background: Systemic sclerosis (SSc) is a rare disease with studies limited by small sample sizes. Electronic health records (EHRs) represent a powerful tool to study patients with rare diseases such as SSc, but validated methods are needed. We developed and validated EHR-based algorithms that incorporate billing codes and clinical data to identify SSc patients in the EHR.
Methods: We used a de-identified EHR with over 3 million subjects and identified 1899 potential SSc subjects with at least 1 count of the SSc ICD-9 (710.1) or ICD-10-CM (M34*) codes. We randomly selected 200 as a training set for chart review. A subject was a case if diagnosed with SSc by a rheumatologist, dermatologist, or pulmonologist. We selected the following algorithm components based on clinical knowledge and available data: SSc ICD-9 and ICD-10-CM codes, positive antinuclear antibody (ANA) (titer ≥ 1:80), and a keyword of Raynaud's phenomenon (RP). We performed both rule-based and machine learning techniques for algorithm development. Positive predictive values (PPVs), sensitivities, and F-scores (which account for PPVs and sensitivities) were calculated for the algorithms.
Results: PPVs were low for algorithms using only 1 count of the SSc ICD-9 code. As code counts increased, the PPVs increased. PPVs were higher for algorithms using ICD-10-CM codes versus the ICD-9 code. Adding a positive ANA and RP keyword increased the PPVs of algorithms only using ICD billing codes. Algorithms using ≥ 3 or ≥ 4 counts of the SSc ICD-9 or ICD-10-CM codes and ANA positivity had the highest PPV at 100% but a low sensitivity at 50%. The algorithm with the highest F-score of 91% was ≥ 4 counts of the ICD-9 or ICD-10-CM codes with an internally validated PPV of 90%. A machine learning method using random forests yielded an algorithm with a PPV of 84%, sensitivity of 92%, and F-score of 88%. The most important feature was RP keyword.
Conclusions: Algorithms using only ICD-9 codes did not perform well to identify SSc patients. The highest performing algorithms incorporated clinical data with billing codes. EHR-based algorithms can identify SSc patients across a healthcare system, enabling researchers to examine important outcomes.
Keywords: Algorithms; Bioinformatics; Electronic health records; Systemic sclerosis.
Conflict of interest statement
The authors declare that they have no competing interests.
Figures
![Fig. 1](https://cdn.statically.io/img/www.ncbi.nlm.nih.gov/pmc/articles/instance/6937803/bin/13075_2019_2092_Fig1_HTML.gif)
Similar articles
-
Developing Electronic Health Record Algorithms That Accurately Identify Patients With Systemic Lupus Erythematosus.Arthritis Care Res (Hoboken). 2017 May;69(5):687-693. doi: 10.1002/acr.22989. Epub 2017 Apr 10. Arthritis Care Res (Hoboken). 2017. PMID: 27390187 Free PMC article.
-
Development and validation of algorithms to build an electronic health record based cohort of patients with systemic sclerosis.PLoS One. 2023 Apr 13;18(4):e0283775. doi: 10.1371/journal.pone.0283775. eCollection 2023. PLoS One. 2023. PMID: 37053291 Free PMC article.
-
Developing and Validating Methods to Assemble Systemic Lupus Erythematosus Births in the Electronic Health Record.Arthritis Care Res (Hoboken). 2022 May;74(5):849-857. doi: 10.1002/acr.24522. Epub 2022 Mar 16. Arthritis Care Res (Hoboken). 2022. PMID: 33253488 Free PMC article.
-
A systematic review of validated methods for identifying patients with rheumatoid arthritis using administrative or claims data.Vaccine. 2013 Dec 30;31 Suppl 10:K41-61. doi: 10.1016/j.vaccine.2013.03.075. Vaccine. 2013. PMID: 24331074 Review.
-
Comprehensive review of ICD-9 code accuracies to measure multimorbidity in administrative data.BMC Health Serv Res. 2020 Jun 1;20(1):489. doi: 10.1186/s12913-020-05207-4. BMC Health Serv Res. 2020. PMID: 32487087 Free PMC article. Review.
Cited by
-
Unsupervised Feature Selection to Identify Important ICD-10 and ATC Codes for Machine Learning on a Cohort of Patients With Coronary Heart Disease: Retrospective Study.JMIR Med Inform. 2024 Jul 26;12:e52896. doi: 10.2196/52896. JMIR Med Inform. 2024. PMID: 39087585 Free PMC article.
-
Assessing the diagnostic utility of the Gaucher Earlier Diagnosis Consensus (GED-C) scoring system using real-world data.Orphanet J Rare Dis. 2024 Feb 16;19(1):71. doi: 10.1186/s13023-024-03042-y. Orphanet J Rare Dis. 2024. PMID: 38365689 Free PMC article.
-
Are there more acute cardiac hospitalizations in winter in patients with systemic sclerosis? An analysis from the National Inpatient Sample.J Scleroderma Relat Disord. 2024 Feb;9(1):59-66. doi: 10.1177/23971983231197268. Epub 2023 Sep 5. J Scleroderma Relat Disord. 2024. PMID: 38333525
-
Using natural language processing to explore characteristics and management of patients with axial spondyloarthritis and psoriatic arthritis treated under real-world conditions in Spain: SpAINET study.Ther Adv Musculoskelet Dis. 2023 Dec 24;15:1759720X231220818. doi: 10.1177/1759720X231220818. eCollection 2023. Ther Adv Musculoskelet Dis. 2023. PMID: 38146537 Free PMC article.
-
Automatically pre-screening patients for the rare disease aromatic l-amino acid decarboxylase deficiency using knowledge engineering, natural language processing, and machine learning on a large EHR population.J Am Med Inform Assoc. 2024 Feb 16;31(3):692-704. doi: 10.1093/jamia/ocad244. J Am Med Inform Assoc. 2024. PMID: 38134953 Free PMC article.
References
Publication types
MeSH terms
Grants and funding
LinkOut - more resources
Full Text Sources
Medical
Research Materials