Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Jul 17:13:1219326.
doi: 10.3389/fonc.2023.1219326. eCollection 2023.

Evaluating large language models on a highly-specialized topic, radiation oncology physics

Affiliations

Evaluating large language models on a highly-specialized topic, radiation oncology physics

Jason Holmes et al. Front Oncol. .

Abstract

Purpose: We present the first study to investigate Large Language Models (LLMs) in answering radiation oncology physics questions. Because popular exams like AP Physics, LSAT, and GRE have large test-taker populations and ample test preparation resources in circulation, they may not allow for accurately assessing the true potential of LLMs. This paper proposes evaluating LLMs on a highly-specialized topic, radiation oncology physics, which may be more pertinent to scientific and medical communities in addition to being a valuable benchmark of LLMs.

Methods: We developed an exam consisting of 100 radiation oncology physics questions based on our expertise. Four LLMs, ChatGPT (GPT-3.5), ChatGPT (GPT-4), Bard (LaMDA), and BLOOMZ, were evaluated against medical physicists and non-experts. The performance of ChatGPT (GPT-4) was further explored by being asked to explain first, then answer. The deductive reasoning capability of ChatGPT (GPT-4) was evaluated using a novel approach (substituting the correct answer with "None of the above choices is the correct answer."). A majority vote analysis was used to approximate how well each group could score when working together.

Results: ChatGPT GPT-4 outperformed all other LLMs and medical physicists, on average, with improved accuracy when prompted to explain before answering. ChatGPT (GPT-3.5 and GPT-4) showed a high level of consistency in its answer choices across a number of trials, whether correct or incorrect, a characteristic that was not observed in the human test groups or Bard (LaMDA). In evaluating deductive reasoning ability, ChatGPT (GPT-4) demonstrated surprising accuracy, suggesting the potential presence of an emergent ability. Finally, although ChatGPT (GPT-4) performed well overall, its intrinsic properties did not allow for further improvement when scoring based on a majority vote across trials. In contrast, a team of medical physicists were able to greatly outperform ChatGPT (GPT-4) using a majority vote.

Conclusion: This study suggests a great potential for LLMs to work alongside radiation oncology experts as highly knowledgeable assistants.

Keywords: ChatGPT; artificial intelligence; large language model; medical physics; natural language processing.

PubMed Disclaimer

Conflict of interest statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Figures

Figure 1
Figure 1
Raw marks for each test where the rows are separate tests and the columns are the test questions. Dark shaded squares represent correct answers.
Figure 2
Figure 2
Overall performance and uncertainty in test results. (A) Mean test scores for each LLM by category. (B) Standard deviation in total scores. (C) Average correlation between trials.
Figure 3
Figure 3
Confidence in answers. The number of correct answer occurrences per-question for each LLM and human group. The dashed red curve indicates the expected distribution if the answers were randomly selected based on the Poisson distribution.
Figure 4
Figure 4
Scores by category, tabulated by majority vote among trials for LLMs and within the group for humans.
Figure 5
Figure 5
The improvement for Trial 1 as due to using the explain first, then answer method.
Figure 6
Figure 6
The scores for Trial 1 after replacing the correct answer with “None of the above choices is the correct answer. ”, a method for testing for deductive reasoning, and subsequent improvement as due to using the explain first, then answer method.

Similar articles

Cited by

References

    1. Zhao L, Zhang L, Wu Z, Chen Y, Dai H, Yu X, et al. . When brain-inspired ai meets agi. arXiv preprint (2023), 2303.15935.
    1. Devlin J, Chang M-W, Lee K, Toutanova K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint (2018), arXiv:1810.04805.
    1. Gu Y, Tinn R, Cheng H, Lucas M, Usuyama N, Liu X, et al. . Domain-specific language model pretraining for biomedical natural language processing. ACM Trans Comput Healthc (HEALTH) (2021) 3(1):1–23.
    1. Liu Z, He M, Jiang Z, Wu Z, Dai H, Zhang L, et al. . Survey on natural language processing in medical image analysis. Zhong nan da xue xue bao. Yi xue ban= J Cent South University Med Sci (2022) 47(8):981–93. - PubMed
    1. Rezayi S, Liu Z, Wu Z, Dhakal C, Ge B, Zhen C, et al. . Agribert: knowledge-infused agricultural language models for matching food and nutrition. IJCAI (2022), 5150–6. doi: 10.24963/ijcai.2022/715 - DOI

LinkOut - more resources