This is a preprint.
Assessing the Accuracy and Reliability of AI-Generated Medical Responses: An Evaluation of the Chat-GPT Model
- PMID: 36909565
- PMCID: PMC10002821
- DOI: 10.21203/rs.3.rs-2566942/v1
Assessing the Accuracy and Reliability of AI-Generated Medical Responses: An Evaluation of the Chat-GPT Model
Abstract
Background: Natural language processing models such as ChatGPT can generate text-based content and are poised to become a major information source in medicine and beyond. The accuracy and completeness of ChatGPT for medical queries is not known.
Methods: Thirty-three physicians across 17 specialties generated 284 medical questions that they subjectively classified as easy, medium, or hard with either binary (yes/no) or descriptive answers. The physicians then graded ChatGPT-generated answers to these questions for accuracy (6-point Likert scale; range 1 - completely incorrect to 6 - completely correct) and completeness (3-point Likert scale; range 1 - incomplete to 3 - complete plus additional context). Scores were summarized with descriptive statistics and compared using Mann-Whitney U or Kruskal-Wallis testing.
Results: Across all questions (n=284), median accuracy score was 5.5 (between almost completely and completely correct) with mean score of 4.8 (between mostly and almost completely correct). Median completeness score was 3 (complete and comprehensive) with mean score of 2.5. For questions rated easy, medium, and hard, median accuracy scores were 6, 5.5, and 5 (mean 5.0, 4.7, and 4.6; p=0.05). Accuracy scores for binary and descriptive questions were similar (median 6 vs. 5; mean 4.9 vs. 4.7; p=0.07). Of 36 questions with scores of 1-2, 34 were re-queried/re-graded 8-17 days later with substantial improvement (median 2 vs. 4; p<0.01).
Conclusions: ChatGPT generated largely accurate information to diverse medical queries as judged by academic physician specialists although with important limitations. Further research and model development are needed to correct inaccuracies and for validation.
Keywords: ChatGPT; artificial intelligence; clinical decision making; deep learning; knowledge dissemination; large language model; medical education; natural language processing.
Figures
Similar articles
-
Accuracy of Information given by ChatGPT for patients with Inflammatory Bowel Disease in relation to ECCO Guidelines.J Crohns Colitis. 2024 Mar 23:jjae040. doi: 10.1093/ecco-jcc/jjae040. Online ahead of print. J Crohns Colitis. 2024. PMID: 38520394
-
Investigating the Accuracy and Completeness of an Artificial Intelligence Large Language Model About Uveitis: An Evaluation of ChatGPT.Ocul Immunol Inflamm. 2024 Feb 23:1-4. doi: 10.1080/09273948.2024.2317417. Online ahead of print. Ocul Immunol Inflamm. 2024. PMID: 38394625
-
Assessing the Accuracy of Information on Medication Abortion: A Comparative Analysis of ChatGPT and Google Bard AI.Cureus. 2024 Jan 2;16(1):e51544. doi: 10.7759/cureus.51544. eCollection 2024 Jan. Cureus. 2024. PMID: 38318564 Free PMC article.
-
Is ChatGPT accurate and reliable in answering questions regarding head and neck cancer?Front Oncol. 2023 Dec 1;13:1256459. doi: 10.3389/fonc.2023.1256459. eCollection 2023. Front Oncol. 2023. PMID: 38107064 Free PMC article.
-
Accuracy and Reliability of Chatbot Responses to Physician Questions.JAMA Netw Open. 2023 Oct 2;6(10):e2336483. doi: 10.1001/jamanetworkopen.2023.36483. JAMA Netw Open. 2023. PMID: 37782499 Free PMC article.
Cited by
-
Is ChatGPT a trusted source of information for total hip and knee arthroplasty patients?Bone Jt Open. 2024 Feb 15;5(2):139-146. doi: 10.1302/2633-1462.52.BJO-2023-0113.R1. Bone Jt Open. 2024. PMID: 38354748 Free PMC article.
-
ChatGPT is not ready yet for use in providing mental health assessment and interventions.Front Psychiatry. 2024 Jan 4;14:1277756. doi: 10.3389/fpsyt.2023.1277756. eCollection 2023. Front Psychiatry. 2024. PMID: 38239905 Free PMC article.
-
The opportunities and challenges of adopting ChatGPT in medical research.Front Med (Lausanne). 2023 Dec 22;10:1259640. doi: 10.3389/fmed.2023.1259640. eCollection 2023. Front Med (Lausanne). 2023. PMID: 38188345 Free PMC article.
-
Applications of machine learning in familial hypercholesterolemia.Front Cardiovasc Med. 2023 Sep 26;10:1237258. doi: 10.3389/fcvm.2023.1237258. eCollection 2023. Front Cardiovasc Med. 2023. PMID: 37823179 Free PMC article. Review.
References
-
- Shen Y, Heacock L, Elias J, Hentel KD, Reig B, Shih G, et al. ChatGPT and Other Large Language Models Are Double-edged Swords. Radiological Society of North America; 2023. p. 230163. - PubMed
-
- Brown T, Mann B, Ryder N, Subbiah M, Kaplan JD, Dhariwal R et al. Language models are few-shot learners. Advances in neural information processing systems. 2020;33:1877–901.
-
- Christiano PF, Leike J, Brown T, Martic M, Legg S, Amodei D. Deep reinforcement learning from human preferences. Advances in neural information processing systems. 2017;30.
Publication types
Grants and funding
LinkOut - more resources
Full Text Sources