This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

[Preprint]. 2023 Feb 28:rs.3.rs-2566942.

doi: 10.21203/rs.3.rs-2566942/v1.

Assessing the Accuracy and Reliability of AI-Generated Medical Responses: An Evaluation of the Chat-GPT Model

Douglas Johnson¹, Rachel Goodman², J Patrinely¹, Cosby Stone³, Eli Zimmerman¹, Rebecca Donald¹, Sam Chang¹, Sean Berkowitz¹, Avni Finn¹, Eiman Jahangir¹, Elizabeth Scoville¹, Tyler Reese¹, Debra Friedman¹, Julie Bastarache¹, Yuri van der Heijden¹, Jordan Wright¹, Nicholas Carter¹, Matthew Alexander¹, Jennifer Choe¹, Cody Chastain¹, John Zic¹, Sara Horst¹, Isik Turker¹, Rajiv Agarwal¹, Evan Osmundson¹, Kamran Idrees¹, Colleen Kieman¹, Chandrasekhar Padmanabhan¹, Christina Bailey¹, Cameron Schlegel¹, Lola Chambless⁴, Mike Gibson¹, Travis Osterman¹, Lee Wheless¹

Affiliations

¹ Vanderbilt University Medical Center.
² Vanderbilt University School of Medicine.
³ Vanderbilt University Medical Center, Nashville, Tennessee.
⁴ Vanderbilt University.

PMID: 36909565
PMCID: PMC10002821
DOI: 10.21203/rs.3.rs-2566942/v1

Assessing the Accuracy and Reliability of AI-Generated Medical Responses: An Evaluation of the Chat-GPT Model

Douglas Johnson et al. Res Sq. 2023.

[Preprint]. 2023 Feb 28:rs.3.rs-2566942.

doi: 10.21203/rs.3.rs-2566942/v1.

Authors

Affiliations

¹ Vanderbilt University Medical Center.
² Vanderbilt University School of Medicine.
³ Vanderbilt University Medical Center, Nashville, Tennessee.
⁴ Vanderbilt University.

PMID: 36909565
PMCID: PMC10002821
DOI: 10.21203/rs.3.rs-2566942/v1

Abstract

Background: Natural language processing models such as ChatGPT can generate text-based content and are poised to become a major information source in medicine and beyond. The accuracy and completeness of ChatGPT for medical queries is not known.

Methods: Thirty-three physicians across 17 specialties generated 284 medical questions that they subjectively classified as easy, medium, or hard with either binary (yes/no) or descriptive answers. The physicians then graded ChatGPT-generated answers to these questions for accuracy (6-point Likert scale; range 1 - completely incorrect to 6 - completely correct) and completeness (3-point Likert scale; range 1 - incomplete to 3 - complete plus additional context). Scores were summarized with descriptive statistics and compared using Mann-Whitney U or Kruskal-Wallis testing.

Results: Across all questions (n=284), median accuracy score was 5.5 (between almost completely and completely correct) with mean score of 4.8 (between mostly and almost completely correct). Median completeness score was 3 (complete and comprehensive) with mean score of 2.5. For questions rated easy, medium, and hard, median accuracy scores were 6, 5.5, and 5 (mean 5.0, 4.7, and 4.6; p=0.05). Accuracy scores for binary and descriptive questions were similar (median 6 vs. 5; mean 4.9 vs. 4.7; p=0.07). Of 36 questions with scores of 1-2, 34 were re-queried/re-graded 8-17 days later with substantial improvement (median 2 vs. 4; p<0.01).

Conclusions: ChatGPT generated largely accurate information to diverse medical queries as judged by academic physician specialists although with important limitations. Further research and model development are needed to correct inaccuracies and for validation.

Keywords: ChatGPT; artificial intelligence; clinical decision making; deep learning; knowledge dissemination; large language model; medical education; natural language processing.

PubMed Disclaimer

Figures

**Figure 1**
Methodology *DBJ and LEW scored two separate datasets of melanoma and immunotherapy and common conditions questions.

**Figure 2**
Accuracy of Chat-GPT-Generated Answers Accuracy of AI answers from multispecialty questions (A-C) or all questions (multispecialty, melanoma and immunotherapy, and common medical conditions; D-F). *p < 0.01, ** p = 0.03

See this image and copyright information in PMC

Cited by

Is ChatGPT a trusted source of information for total hip and knee arthroplasty patients?
Wright BM, Bodnar MS, Moore AD, Maseda MC, Kucharik MP, Diaz CC, Schmidt CM, Mir HR. Wright BM, et al. Bone Jt Open. 2024 Feb 15;5(2):139-146. doi: 10.1302/2633-1462.52.BJO-2023-0113.R1. Bone Jt Open. 2024. PMID: 38354748 Free PMC article.
ChatGPT is not ready yet for use in providing mental health assessment and interventions.
Dergaa I, Fekih-Romdhane F, Hallit S, Loch AA, Glenn JM, Fessi MS, Ben Aissa M, Souissi N, Guelmami N, Swed S, El Omri A, Bragazzi NL, Ben Saad H. Dergaa I, et al. Front Psychiatry. 2024 Jan 4;14:1277756. doi: 10.3389/fpsyt.2023.1277756. eCollection 2023. Front Psychiatry. 2024. PMID: 38239905 Free PMC article.
The opportunities and challenges of adopting ChatGPT in medical research.
Alsadhan A, Al-Anezi F, Almohanna A, Alnaim N, Alzahrani H, Shinawi R, AboAlsamh H, Bakhshwain A, Alenazy M, Arif W, Alyousef S, Alhamidi S, Alghamdi A, AlShrayfi N, Rubaian NB, Alanzi T, AlSahli A, Alturki R, Herzallah N. Alsadhan A, et al. Front Med (Lausanne). 2023 Dec 22;10:1259640. doi: 10.3389/fmed.2023.1259640. eCollection 2023. Front Med (Lausanne). 2023. PMID: 38188345 Free PMC article.
Applications of machine learning in familial hypercholesterolemia.
Luo RF, Wang JH, Hu LJ, Fu QA, Zhang SY, Jiang L. Luo RF, et al. Front Cardiovasc Med. 2023 Sep 26;10:1237258. doi: 10.3389/fcvm.2023.1237258. eCollection 2023. Front Cardiovasc Med. 2023. PMID: 37823179 Free PMC article. Review.

References

1. Shen Y, Heacock L, Elias J, Hentel KD, Reig B, Shih G, et al. ChatGPT and Other Large Language Models Are Double-edged Swords. Radiological Society of North America; 2023. p. 230163. - PubMed
1. Brown T, Mann B, Ryder N, Subbiah M, Kaplan JD, Dhariwal R et al. Language models are few-shot learners. Advances in neural information processing systems. 2020;33:1877–901.
1. Christiano PF, Leike J, Brown T, Martic M, Legg S, Amodei D. Deep reinforcement learning from human preferences. Advances in neural information processing systems. 2017;30.
1. Kung TH, Cheatham M, Medenilla A, Sillos C, De Leon L, Elepaño C, et al. Performance of ChatGPT on USMLE: Potential for AI-Assisted Medical Education Using Large Language Models. medRxiv. 2022:2022.12.19.22283643. - PMC - PubMed
1. Gilson A, Safranek C, Huang T, Socrates V, Chi L, Taylor RA, et al. How Well Does ChatGPT Do When Taking the Medical Licensing Exams? The Implications of Large Language Models for Medical Education and Knowledge Assessment. medRxiv. 2022:2022.12. 23.22283901. - PMC - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

This is a preprint.

Assessing the Accuracy and Reliability of AI-Generated Medical Responses: An Evaluation of the Chat-GPT Model

Affiliations

Assessing the Accuracy and Reliability of AI-Generated Medical Responses: An Evaluation of the Chat-GPT Model

Authors

Affiliations

Abstract

Figures

Similar articles

Cited by

References

Publication types

Grants and funding

LinkOut - more resources

Full Text Sources