This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

[Preprint]. 2023 Feb 1:2023.01.30.23285067.

doi: 10.1101/2023.01.30.23285067.

The Diagnostic and Triage Accuracy of the GPT-3 Artificial Intelligence Model

David M Levine^{1

2}, Rudraksh Tuwani^{3

4}, Benjamin Kompa^{4

5}, Amita Varma^{3

4}, Samuel G Finlayson⁶, Ateev Mehrotra⁷, Andrew Beam^{4

5}

Affiliations

¹ Division of General Internal Medicine and Primary Care, Brigham and Women's Hospital; Boston, MA, USA.
² Harvard Medical School; Boston, MA, USA.
³ Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA.
⁴ Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA.
⁵ Department of Biomedical Informatics, Harvard Medical School, Boston, MA.
⁶ Harvard-MIT Program in Health Sciences and Technology, Harvard Medical School, Boston, MA.
⁷ Department of Health Care Policy, Harvard Medical School.

PMID: 36778449
PMCID: PMC9915829
DOI: 10.1101/2023.01.30.23285067

The Diagnostic and Triage Accuracy of the GPT-3 Artificial Intelligence Model

David M Levine et al. medRxiv. 2023.

[Preprint]. 2023 Feb 1:2023.01.30.23285067.

doi: 10.1101/2023.01.30.23285067.

Authors

David M Levine^{1

2}, Rudraksh Tuwani^{3

4}, Benjamin Kompa^{4

5}, Amita Varma^{3

4}, Samuel G Finlayson⁶, Ateev Mehrotra⁷, Andrew Beam^{4

5}

Affiliations

¹ Division of General Internal Medicine and Primary Care, Brigham and Women's Hospital; Boston, MA, USA.
² Harvard Medical School; Boston, MA, USA.
³ Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA.
⁴ Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA.
⁵ Department of Biomedical Informatics, Harvard Medical School, Boston, MA.
⁶ Harvard-MIT Program in Health Sciences and Technology, Harvard Medical School, Boston, MA.
⁷ Department of Health Care Policy, Harvard Medical School.

PMID: 36778449
PMCID: PMC9915829
DOI: 10.1101/2023.01.30.23285067

Abstract

Importance: Artificial intelligence (AI) applications in health care have been effective in many areas of medicine, but they are often trained for a single task using labeled data, making deployment and generalizability challenging. Whether a general-purpose AI language model can perform diagnosis and triage is unknown.

Objective: Compare the general-purpose Generative Pre-trained Transformer 3 (GPT-3) AI model's diagnostic and triage performance to attending physicians and lay adults who use the Internet.

Design: We compared the accuracy of GPT-3's diagnostic and triage ability for 48 validated case vignettes of both common (e.g., viral illness) and severe (e.g., heart attack) conditions to lay people and practicing physicians. Finally, we examined how well calibrated GPT-3's confidence was for diagnosis and triage.

Setting and participants: The GPT-3 model, a nationally representative sample of lay people, and practicing physicians.

Exposure: Validated case vignettes (<60 words; <6^th grade reading level).

Main outcomes and measures: Correct diagnosis, correct triage.

Results: Among all cases, GPT-3 replied with the correct diagnosis in its top 3 for 88% (95% CI, 75% to 94%) of cases, compared to 54% (95% CI, 53% to 55%) for lay individuals (p<0.001) and 96% (95% CI, 94% to 97%) for physicians (p=0.0354). GPT-3 triaged (71% correct; 95% CI, 57% to 82%) similarly to lay individuals (74%; 95% CI, 73% to 75%; p=0.73); both were significantly worse than physicians (91%; 95% CI, 89% to 93%; p<0.001). As measured by the Brier score, GPT-3 confidence in its top prediction was reasonably well-calibrated for diagnosis (Brier score = 0.18) and triage (Brier score = 0.22).

Conclusions and relevance: A general-purpose AI language model without any content-specific training could perform diagnosis at levels close to, but below physicians and better than lay individuals. The model was performed less well on triage, where its performance was closer to that of lay individuals.

Keywords: artificial intelligence; diagnosis; machine learning; triage.

PubMed Disclaimer

Figures

**Figure 1:**
An illustrative example of GPT-3 prompting.

**Figure 2.**
Diagnostic and triage accuracy of lay individuals with Internet access, GPT-3, and primary care attending physicians. **Panel A:** Correct diagnosis was listed among the top-3, stratified by case acuity from most acute (emergent) to least acute (self-care). **Panel B:** Correct dichotomized triage (same-day vs not same-day), stratified by case acuity from most acute (emergent) to least acute (self-care).

**Figure 3.**
Calibration analysis of GPT-3 confidence scores for diagnosis and triage. Each panel shows a calibration curve displaying the relationship between GPT-3 confidence and accuracy. Each dot represents “binned” confidence scores that fall within the same range (e.g. 90-100%), while the y-axis shows the average accuracy for vignettes in that bin. In the case of good calibration, GPT-3 would get approximately 90% accuracy on vignettes that have a confidence score of approximately 90%. The Brier score is an overall summary of the calibration curve that ranges between 0 and 1, with a score of 0 implying perfect calibration. **Panel A**: Calibration curve and Brier score for GPT-3 diagnostic confidence. Overall, the model appears reasonably well-calibrated (Brier score=0.18). **Panel B**: Calibration curve and Brier score for triage prediction, again displayed evidence of reasonably good calibration (Brier score=0.22).

See this image and copyright information in PMC

References

1. Bates DW, Levine D, Syrowatka A, et al. The potential of artificial intelligence to improve patient safety: a scoping review. NPJ Digit Med. 2021;4(1):54. doi:10.1038/s41746-021-00423-6 - DOI - PMC - PubMed
1. Kompa B, Hakim JB, Palepu A, et al. Artificial Intelligence Based on Machine Learning in Pharmacovigilance: A Scoping Review. Drug Safety. 2022;45(5):477–491. doi: 10.1007/s40264-022-01176-1 - DOI - PMC - PubMed
1. Matheny ME, Whicher D, Thadaney Israni S. Artificial intelligence in health care: A report from the national academy of medicine. JAMA. 2020;323(6):509–510. doi:10.1001/jama.2019.21579 - DOI - PubMed
1. Rojas-Gualdron DF. Artificial Intelligence in Health Care: The Hope, the Hype, the Promise, the Peril. National academy of medicine. Una reseña. CES Med. 2022;36(1):76–78. doi:10.21615/cesmedicina.6571 - DOI
1. Beam AL, Kohane IS. Big data and machine learning in health care. JAMA - Journal of the American Medical Association. 2018;319(13):1317–1318. doi:10.1001/jama.2017.18391 - DOI - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

This is a preprint.

The Diagnostic and Triage Accuracy of the GPT-3 Artificial Intelligence Model

Affiliations

The Diagnostic and Triage Accuracy of the GPT-3 Artificial Intelligence Model

Authors

Affiliations

Abstract

Figures

Similar articles

References

Publication types

Grants and funding

LinkOut - more resources

Full Text Sources

Research Materials