Evaluating large language models on a highly-specialized topic, radiation oncology physics
- PMID: 37529688
- PMCID: PMC10388568
- DOI: 10.3389/fonc.2023.1219326
Evaluating large language models on a highly-specialized topic, radiation oncology physics
Abstract
Purpose: We present the first study to investigate Large Language Models (LLMs) in answering radiation oncology physics questions. Because popular exams like AP Physics, LSAT, and GRE have large test-taker populations and ample test preparation resources in circulation, they may not allow for accurately assessing the true potential of LLMs. This paper proposes evaluating LLMs on a highly-specialized topic, radiation oncology physics, which may be more pertinent to scientific and medical communities in addition to being a valuable benchmark of LLMs.
Methods: We developed an exam consisting of 100 radiation oncology physics questions based on our expertise. Four LLMs, ChatGPT (GPT-3.5), ChatGPT (GPT-4), Bard (LaMDA), and BLOOMZ, were evaluated against medical physicists and non-experts. The performance of ChatGPT (GPT-4) was further explored by being asked to explain first, then answer. The deductive reasoning capability of ChatGPT (GPT-4) was evaluated using a novel approach (substituting the correct answer with "None of the above choices is the correct answer."). A majority vote analysis was used to approximate how well each group could score when working together.
Results: ChatGPT GPT-4 outperformed all other LLMs and medical physicists, on average, with improved accuracy when prompted to explain before answering. ChatGPT (GPT-3.5 and GPT-4) showed a high level of consistency in its answer choices across a number of trials, whether correct or incorrect, a characteristic that was not observed in the human test groups or Bard (LaMDA). In evaluating deductive reasoning ability, ChatGPT (GPT-4) demonstrated surprising accuracy, suggesting the potential presence of an emergent ability. Finally, although ChatGPT (GPT-4) performed well overall, its intrinsic properties did not allow for further improvement when scoring based on a majority vote across trials. In contrast, a team of medical physicists were able to greatly outperform ChatGPT (GPT-4) using a majority vote.
Conclusion: This study suggests a great potential for LLMs to work alongside radiation oncology experts as highly knowledgeable assistants.
Keywords: ChatGPT; artificial intelligence; large language model; medical physics; natural language processing.
Copyright © 2023 Holmes, Liu, Zhang, Ding, Sio, McGee, Ashman, Li, Liu, Shen and Liu.
Conflict of interest statement
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Figures
Similar articles
-
Performance of artificial intelligence in bariatric surgery: comparative analysis of ChatGPT-4, Bing, and Bard in the American Society for Metabolic and Bariatric Surgery textbook of bariatric surgery questions.Surg Obes Relat Dis. 2024 Jul;20(7):609-613. doi: 10.1016/j.soard.2024.04.014. Epub 2024 May 8. Surg Obes Relat Dis. 2024. PMID: 38782611 Review.
-
Artificial Intelligence for Anesthesiology Board-Style Examination Questions: Role of Large Language Models.J Cardiothorac Vasc Anesth. 2024 May;38(5):1251-1259. doi: 10.1053/j.jvca.2024.01.032. Epub 2024 Feb 1. J Cardiothorac Vasc Anesth. 2024. PMID: 38423884 Review.
-
Evaluating Large Language Models for the National Premedical Exam in India: Comparative Analysis of GPT-3.5, GPT-4, and Bard.JMIR Med Educ. 2024 Feb 21;10:e51523. doi: 10.2196/51523. JMIR Med Educ. 2024. PMID: 38381486 Free PMC article.
-
Learning to Make Rare and Complex Diagnoses With Generative AI Assistance: Qualitative Study of Popular Large Language Models.JMIR Med Educ. 2024 Feb 13;10:e51391. doi: 10.2196/51391. JMIR Med Educ. 2024. PMID: 38349725 Free PMC article.
-
Performance evaluation of ChatGPT, GPT-4, and Bard on the official board examination of the Japan Radiology Society.Jpn J Radiol. 2024 Feb;42(2):201-207. doi: 10.1007/s11604-023-01491-2. Epub 2023 Oct 4. Jpn J Radiol. 2024. PMID: 37792149 Free PMC article.
Cited by
-
ChatGPT-based Biological and Psychological Data Imputation.Meta Radiol. 2023 Nov;1(3):100034. doi: 10.1016/j.metrad.2023.100034. Epub 2023 Nov 11. Meta Radiol. 2023. PMID: 38784385 Free PMC article.
-
Evaluating Peer Review of Palliative Radiation Plans at a Canadian Tertiary Care Cancer Center.Cureus. 2024 Apr 8;16(4):e57839. doi: 10.7759/cureus.57839. eCollection 2024 Apr. Cureus. 2024. PMID: 38721176 Free PMC article.
-
A Systematic Review of ChatGPT and Other Conversational Large Language Models in Healthcare.medRxiv [Preprint]. 2024 Apr 27:2024.04.26.24306390. doi: 10.1101/2024.04.26.24306390. medRxiv. 2024. PMID: 38712148 Free PMC article. Preprint.
-
Leveraging Large Language Models for Improved Patient Access and Self-Management: Assessor-Blinded Comparison Between Expert- and AI-Generated Content.J Med Internet Res. 2024 Apr 25;26:e55847. doi: 10.2196/55847. J Med Internet Res. 2024. PMID: 38663010 Free PMC article.
-
Surviving ChatGPT in healthcare.Front Radiol. 2024 Feb 23;3:1224682. doi: 10.3389/fradi.2023.1224682. eCollection 2023. Front Radiol. 2024. PMID: 38464946 Free PMC article. Review.
References
-
- Zhao L, Zhang L, Wu Z, Chen Y, Dai H, Yu X, et al. . When brain-inspired ai meets agi. arXiv preprint (2023), 2303.15935.
-
- Devlin J, Chang M-W, Lee K, Toutanova K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint (2018), arXiv:1810.04805.
-
- Gu Y, Tinn R, Cheng H, Lucas M, Usuyama N, Liu X, et al. . Domain-specific language model pretraining for biomedical natural language processing. ACM Trans Comput Healthc (HEALTH) (2021) 3(1):1–23.
-
- Liu Z, He M, Jiang Z, Wu Z, Dai H, Zhang L, et al. . Survey on natural language processing in medical image analysis. Zhong nan da xue xue bao. Yi xue ban= J Cent South University Med Sci (2022) 47(8):981–93. - PubMed
-
- Rezayi S, Liu Z, Wu Z, Dhakal C, Ge B, Zhen C, et al. . Agribert: knowledge-infused agricultural language models for matching food and nutrition. IJCAI (2022), 5150–6. doi: 10.24963/ijcai.2022/715 - DOI
Grants and funding
LinkOut - more resources
Full Text Sources