S1327
Interdisciplinary - Education in radiation oncology
ESTRO 206
analyses were performed; data distribution was assessed using the Kolmogorov–Smirnov test. Depending on normality, comparisons employed Kruskal–Wallis, one-way ANOVA, and post-hoc tests (p < 0.05). Analyses were conducted with IBM SPSS Statistics.
and Enhancing Collaboration with ESTRO: A Report of the National Societies Survey 2024. Radiother Oncol. 2025;210:111006. doi:10.1016/j.radonc.2025.111006 Keywords: RT education, inequalities, South-East Europe Digital Poster 1655 Assessing Large Language Models Through Stepwise Case Vignettes of Non-Small Cell Lung Cancer for Cost Efficiency Clinical Accuracy and Explainability Mehmet Halici 1 , Burak Ertan 2 , Kimia Cepni 1 , Tanju Kapagan 3 , Cumhur Yildirim 4 , Gokmen Umut Erdem 5 , Serkan Salturk 6 , Ibrahim Cem Balci 7 , Huriye Senay Kiziltan 1 , Huseyin Uvet 7 1 Radiation Oncology, Ba ş ak ş ehir Çam and Sakura City Hospital, Istanbul, Turkey. 2 Computer Engineering, Yıldız Technical University, Istanbul, Turkey. 3 Medical Oncology, Çorlu State Hospital, Tekirdag, Turkey. 4 Radiation Oncology, Istanbul University-Cerrahpa ş a, Faculty of Medicine, Istanbul, Turkey. 5 Medical Oncology, Ba ş ak ş ehir Çam and Sakura City Hospital, Istanbul, Turkey. 6 Electronics and Communication Engineering, Yıldız Technical University, Istanbul, Turkey. 7 Mechatronics Engineering, Yıldız Technical University, Istanbul, Turkey Purpose/Objective: This study aimed to evaluate the clinical decision- making performance and cost-efficiency of three widely used large language models (LLMs) using comprehensive, stepwise clinical case vignettes designed for non-small cell lung cancer (NSCLC). The objective was to determine their reliability, interpretability, and economic feasibility as potential clinical decision-support tools in oncology. Material/Methods: Five fictional NSCLC case vignettes encompassing diagnostic, therapeutic, and follow-up stages were developed by a radiation oncologist and a senior resident.Each stage included two sub-steps, generating 30 open-ended questions guiding the clinical decision-making process.Evidence-based reference answers were produced according to international clinical guidelines and validated by independent radiation and medical oncology specialists.The finalized vignettes were submitted via API to Google Gemini 2.5, ChatGPT-5, and Claude Opus 4.1, using a waterfall sequential prompting approach.For each response, input–output token counts, response time, and token-based cost (USD) were recorded.Model outputs were evaluated by two independent experts using a five-point Likert scale based on adherence to evidence-based answers and clinical interpretability (Figure 1).Descriptive
Results: Thirty open-ended, evidence-based questions from five structured vignettes were analyzed, totaling 98,543 tokens (mean 6,569 per case).The mean total cost per complete patient simulation was 0.38 USD.Significant differences were observed among models in token-based cost, reasoning time, and interpretability, while clinical accuracy was comparable.Gemini 2.5 showed the lowest mean token-based cost (0.028 ± 0.005 USD/question), followed by ChatGPT-5 (0.045 ± 0.006) and Claude Opus 4.1 (0.063 ± 0.007; p < 0.001).ChatGPT-5 achieved the highest interpretability score (p = 0.033), whereas reasoning time was shortest for Gemini 2.5 (32.6 s) and longest for ChatGPT-5 (52.6 s; p < 0.001).Stage-specific analyses across diagnostic, therapeutic, and follow-up phases showed consistent trends.
Made with FlippingBook - Share PDF online