This is a preprint.
The Diagnostic and Triage Accuracy of the GPT-3 Artificial Intelligence Model
- PMID: 36778449
- PMCID: PMC9915829
- DOI: 10.1101/2023.01.30.23285067
The Diagnostic and Triage Accuracy of the GPT-3 Artificial Intelligence Model
Abstract
Importance: Artificial intelligence (AI) applications in health care have been effective in many areas of medicine, but they are often trained for a single task using labeled data, making deployment and generalizability challenging. Whether a general-purpose AI language model can perform diagnosis and triage is unknown.
Objective: Compare the general-purpose Generative Pre-trained Transformer 3 (GPT-3) AI model's diagnostic and triage performance to attending physicians and lay adults who use the Internet.
Design: We compared the accuracy of GPT-3's diagnostic and triage ability for 48 validated case vignettes of both common (e.g., viral illness) and severe (e.g., heart attack) conditions to lay people and practicing physicians. Finally, we examined how well calibrated GPT-3's confidence was for diagnosis and triage.
Setting and participants: The GPT-3 model, a nationally representative sample of lay people, and practicing physicians.
Exposure: Validated case vignettes (<60 words; <6th grade reading level).
Main outcomes and measures: Correct diagnosis, correct triage.
Results: Among all cases, GPT-3 replied with the correct diagnosis in its top 3 for 88% (95% CI, 75% to 94%) of cases, compared to 54% (95% CI, 53% to 55%) for lay individuals (p<0.001) and 96% (95% CI, 94% to 97%) for physicians (p=0.0354). GPT-3 triaged (71% correct; 95% CI, 57% to 82%) similarly to lay individuals (74%; 95% CI, 73% to 75%; p=0.73); both were significantly worse than physicians (91%; 95% CI, 89% to 93%; p<0.001). As measured by the Brier score, GPT-3 confidence in its top prediction was reasonably well-calibrated for diagnosis (Brier score = 0.18) and triage (Brier score = 0.22).
Conclusions and relevance: A general-purpose AI language model without any content-specific training could perform diagnosis at levels close to, but below physicians and better than lay individuals. The model was performed less well on triage, where its performance was closer to that of lay individuals.
Keywords: artificial intelligence; diagnosis; machine learning; triage.
Figures
Similar articles
-
The Accuracy and Potential Racial and Ethnic Biases of GPT-4 in the Diagnosis and Triage of Health Conditions: Evaluation Study.JMIR Med Educ. 2023 Nov 2;9:e47532. doi: 10.2196/47532. JMIR Med Educ. 2023. PMID: 37917120 Free PMC article.
-
Exploring the effectiveness of artificial intelligence, machine learning and deep learning in trauma triage: A systematic review and meta-analysis.Digit Health. 2023 Oct 9;9:20552076231205736. doi: 10.1177/20552076231205736. eCollection 2023 Jan-Dec. Digit Health. 2023. PMID: 37822960 Free PMC article. Review.
-
Artificial intelligence chatbot performance in triage of ophthalmic conditions.Can J Ophthalmol. 2023 Aug 9:S0008-4182(23)00234-X. doi: 10.1016/j.jcjo.2023.07.016. Online ahead of print. Can J Ophthalmol. 2023. PMID: 37572695
-
Triage and Diagnostic Accuracy of Online Symptom Checkers: Systematic Review.J Med Internet Res. 2023 Jun 2;25:e43803. doi: 10.2196/43803. J Med Internet Res. 2023. PMID: 37266983 Free PMC article. Review.
-
Assessment of Diagnosis and Triage in Validated Case Vignettes Among Nonphysicians Before and After Internet Search.JAMA Netw Open. 2021 Mar 1;4(3):e213287. doi: 10.1001/jamanetworkopen.2021.3287. JAMA Netw Open. 2021. PMID: 33779741 Free PMC article.
References
-
- Rojas-Gualdron DF. Artificial Intelligence in Health Care: The Hope, the Hype, the Promise, the Peril. National academy of medicine. Una reseña. CES Med. 2022;36(1):76–78. doi:10.21615/cesmedicina.6571 - DOI
Publication types
Grants and funding
LinkOut - more resources
Full Text Sources
Research Materials