S1320
Clinical - Urology
ESTRO 2026
Subsequent stepwise exclusion analyses showed that the significance disappeared when the Deepseek and Copilot were removed from the comparison, while it persisted when any other model was excluded. These findings indicate that the overall difference was mainly driven by the lower performance of these two LLMs. Conclusion: Among the evaluated large language models, Deepseek and Copilot demonstrated limited accuracy specifically in survival prediction, whereas their recommendations in treatment decision-making and overall management guidance were largely appropriate. The other LLMs demonstrated high consistency and clinical reliability across all evaluated treatment decision scenarios. Although our results are encouraging, they should be considered carefully, and, patient safety, and evidence-based decision-making of LLMs should remain under expert supervision. Keywords: AI, Prostate Cancer, Clinical Decision Evaluation
radiotherapy. Keywords: prostate cancer, dose accumulation
Digital Poster 5158
Can Large Language Models Think Like Radiation Oncologists? An Evaluation in Prostate Cancer Cases Berna Akkus Yildirim 1 , Emre Uysal 1 , Baver Tutun 1 , Emre Batuhan Yildirim 2 , Gorkem Durak 3 , Halil Akboru 1 1 Department of Radiation Oncology, University of Health Science Prof. Dr. Cemil Tascioglu City Hospital, Istanbul, Turkey. 2 Computer Science, Ozyegin University, Istanbul, Turkey. 3 Machine & Hybrid
Intelligence Lab, Department of Radiology, Northwestern University, Chicago, USA
Purpose/Objective: Recent advances in large language models (LLMs) have enabled their potential application in clinical oncology decision-making. However, the reliability and clinical alignment of LLM-generated recommendations in patient-specific contexts remain largely unexplored. In our study, we aimed to evaluate the concordance of LLMs with expert radiation oncologists' opinions on treatment decisions based on synthetic data of prostate cancer patients. Material/Methods: Twelve synthetic clinical cases covering different oncological scenarios were used to assess the reasoning performance of six LLMs (GPT-4o, Gemini 2.5 Flash, Copilot, Deepseek v3, Claude 4.5 Sonnet, GPT-5). Each model was prompted with clinical data of prostate cancer patients to evaluate treatment prioritization, treatment planning, possible side effects, expected treatment response, and survival prediction, respectively. Responses were independently generated by each LLM and rated by two radiation oncologists on a three-point ordinal scale (1 = incorrect, 2 = partially correct, 3 = fully correct). The mean of the two expert ratings was used as the final accuracy score for each LLM. Comparative performance among the five models was evaluated using the Friedman test for related ordinal data. When the overall test reached statistical significance, stepwise exclusion analyses were performed to identify which models primarily contributed to the observed difference. Statistical analyses were conducted in SPSS (version 26), and a p-value < 0.05 was considered significant. Results: Across the first four question categories, no significant differences were observed among the six LLMs (all p > 0.05). In contrast, for the survival prediction, the Friedman test demonstrated a statistically significant difference in accuracy scores ( χ² (5)=12.684, p=0.026).
Made with FlippingBook - Share PDF online