2Department of Radiology Ankara Mamak State Hospital, Ankara-Türkiye
3Department of Radiology, Kırıkkale High Specialty Hospital, Kırıkkale-Türkiye DOI : 10.5505/tjo.2026.4733
Summary
OBJECTIVETo evaluate the diagnostic performance of eight current large language models (LLMs) in applying the RECIST 1.1 guidelines for oncologic treatment response imaging and to compare their performance with that of board-certified radiologists. This study explores the potential of LLMs as supportive adjuncts in cancer follow-up imaging.
METHODS
In this observational cross-sectional study, 50 text-based and 30 case-based multiple-choice questions
derived from RECIST 1.1 were administered to eight LLMs with three different prompts and two junior
radiologists with seven years of experience. Responses were independently scored as correct or incorrect,
and non-parametric statistical analyses were performed to compare performance across groups.
RESULTS
LLMs demonstrated promising performance in text-based interpretation about RECIST, with only minor
performance variations. Claude 3.5 Sonnet had the most successful performance, achieving 83.3%
accuracy on case-based and 90% on text-based questions. Other models exhibited robust performance,
with no significant differences in case-based assessments between LLMs and radiologists. LLMs achieved
similar results across the three different prompts with minor variations.
CONCLUSION
LLMs have great potential for response evaluation in oncological imaging and not only support radiologists
but may soon redefine clinical workflows, setting a new benchmark for diagnostic excellence in radiology.
Introduction
Large language models (LLMs) represent a remarkable breakthrough in natural language processing, capable of performing specific tasks in radiology without additional training.[1-4] This positions LLMs as transformative forces poised to significantly reshape radiology practice. They have the potential to usher in a new era of efficiency and excellence, both as supportive diagnostic tools and in facilitating the reporting process. Consequently, there has been a rapid increase in studies investigating the radiological knowledge of LLMs and their potential applications and contributions to radiology.[3-7] Although there are many studies evaluating the radiological knowledge of LLMs in different fields, the lack of studies evaluating their knowledge in oncology radiology is an important gap in this regard.The radiology report is vital in guiding patient management in oncology, requiring meticulous comparison with prior studies and assessment. Response Evaluation Criteria in Solid Tumors (RECIST) guideline, revised in 2009 to RECIST 1.1, was developed to address this need. RECIST guideline comprises criteria such as defining measurable lesions (i.e., which measurement defines a measurable lymph node), identifying target lesions (i.e., which criteria the target lesion must meet), and categorizing response types (regression, stable disease, or progression). It provides a standardized approach to reporting solid tumor measurements and defines objective criteria for assessing changes in tumor size, ensuring a consistent and reliable approach to reporting.[8]
Previous studies have evaluated the proficiency and knowledge of various LLMs in different specific types of cancer.[2,9-12] Güneş et al.[13] tested the performance of current LLMs, in particular Claude 3.5 Sonnet, in interpreting BI-RADS categories via text-based questions and found that these models achieved remarkable accuracy (up to 90%), approaching the level of expertise of breast radiologists. In another study, Kaba et al.[14] demonstrated that advanced LLMs, especially ChatGPT-4, showed high accuracy (93%) in text-based questions in interpreting thyroid imaging guidelines based on the K-TIRADS classification system and emphasized the competence of LLMs in this field.
To the best of our knowledge, no study has compared the performance of LLMs in relation to RECIST 1.1, a critical guideline in the radiological reporting of follow-up imaging in cancer patients. We aimed to fill this gap by evaluating the knowledge of various LLMs in the RECIST 1.1 guideline and comparing them with that of radiologists.
Methods
Study DesignThis experimental study utilized a cross-sectional design to assess the accuracy of eight different LLMs compared with radiologists in answering text-based and case-based MCQs pertaining to RECIST 1.1. Their answers were benchmarked against responses from two board-certified radiologists (European Diploma in Radiology-EDiR): Radiologist 1 (Y.C.G.)(R1) and Radiologist 2 (T.C.)(R2), both with seven years of experience in general radiology. The text-based and casebased MCQs were designed based on the RECIST 1.1 guideline by a board-certified radiologist (EDiR), Radiologist 3 (E.Ç.)(R3), also with seven years of experience in radiology.
The MCQs did not include any authentic patient data or images; therefore, ethical committee approval was neither required nor applicable for this study. Methodological transparency and reproducibility were ensured by adhering to the Standards for Reporting Diagnostic Accuracy Studies (STARD) guideline.[15] An overview of the flowchart is presented in Figure 1.
Fig. 1. The workflow of the study.
Data Collection for Text-based and Case-based
Multiple-choice Questions
A total of 50 text-based MCQs and 30 case-based
MCQs were utilized in the study. These questions comprehensively
covered the all sections of RECIST 1.1
and tested the application of the information therein.
Each question was carefully constructed to focus on a
single, specific, and critical concept relevant to radiological
practice under this guideline. Each MCQ had
5 choices and only one choice was correct. A complete
list of text-based and case-based MCQs and dataset of
the study are available in the Appendix.
Design of Input-output Procedures for LLMs
The three different input prompts provided to the
LLMs were: Prompt 1: "Act like a professor of radiology
who has 30 years of experience in oncological
imaging, especially with studies on RECIST 1.1. Give
just the letter of the most correct choice of multiplechoice
questions that I will ask you. Each question has
only one correct answer." Prompt 2: "You are a senior
academic radiologist. I have some questions about
RECIST 1.1. I will ask you multiple-choice questions
with a single correct answer. Provide only the
letter of the most accurate choice for each." Prompt
3: "I have a few questions about RECIST 1.1 criteria.
Some of them are text-based, and some of them are
case-based questions. I will present you with multiple-
choice questions, and each has only one correct
answer. Please reply with the letter corresponding to
the best choice only, without any explanation." These
prompts were consistently employed across eight
distinct platforms with default hyperparameters by
R3 in February 2025: Claude 3 Opus and 3.5 Sonnet
(https://claude.ai.com), ChatGPT-o1, ChatGPT-4o
(https://chat.openai.com), Gemini 1.5 Pro (https:// gemini.google.com), Mistral Large 2 (https://mistral.
ai), Llama 3.1 405B (https://metaai.com), and Perplexity
Pro (https://perplexity.ai). In order to assess
the consistency in the responses of the three different
prompts within each model, the responses of each
prompt and the model were evaluated carefully, and
the prompt with the most successful responses for all
models was recorded by R3.
The MCQs were administered sequentially within a single conversation session per LLM to maintain uniformity. None of the LLMs underwent additional pre-training or fine-tuning by the study authors, and no supplementary details that could potentially affect the study results were provided (Fig. 2). R3 reviewed the LLM responses and categorized them as correct (1) or incorrect (0).
Fig. 2. The example of chat session with ChatGPT-o1.
Radiologist Performance Evaluation
R1 and R2 independently answered the MCQs in a
blinded manner in January 2025 using their personal
computers. They completed text-based MCQs first, immediately
followed by case-based MCQs without any
interval. R3 separately evaluated their answers and categorized
them as correct (1) or incorrect (0).
Statistical Analysis
The Kolmogorov-Smirnov test assessed data distribution.
Descriptive statistics (minimum, maximum, median,
interquartile range, percentages) were calculated. As
the data were non-normally distributed, non-parametric
tests were used. Consistency and performance across
three prompts were evaluated with the Friedman and
McNemar tests; the latter compared correct response
rates between LLMs and radiologists. Chi-square tests
assessed differences by question type. Bonferroni correction
was applied for pairwise comparisons (p ≤0.028
significant), while p ≤0.05 indicated significance for
consistency and prompt-related analyses.
Results
Case-Based MCQsAmong the three different prompts (Prompt 1, 2, 3), all models achieved their highest performance on casebased MCQs with "Prompt 1". However, the variation in performance across the different prompts did not reach statistical significance, and the responses generated by all models were consistent across the three prompts (p>0.05) (Table 1).
Table 1 The consistency of LLMs responses with different prompts (Prompt 1, Prompt 2 and Prompt 3)
With "Prompt 1", Claude 3.5 Sonnet demonstrated the highest accuracy at 83.3%, followed by R2 and Gemini 1.5 Pro, both of which achieved 80.0% (p>0.028). R1 closely followed with 76.7%, while ChatGPT-4o, Llama 3.1 405B, and Mistral Large 2 each recorded 73.3% (p>0.028). Claude 3 Opus and ChatGPT-o1 shared an accuracy of 66.7%. Perplexity Pro exhibited the lowest accuracy among LLMs and radiologists, with 60.0% (p>0.028) (Fig. 3).
Fig. 3. The accuracy of LLMs and radiologists on multiple choice questions.
There was no significant difference in accuracy on case-based MCQs among LLMs and between LLMs and radiologists (p>0.028) (Table 2).
Text-Based MCQs
Similar to case-based MCQ, all models reached the
highest performance with "Prompt 1" among the three
different prompts. The answers given by all models
to the questions with these prompts were consistent, and there were no significant performance differences
among models with them (p>0.05) (Table 1).
With "Prompt 1", Claude 3.5 Sonnet achieved the highest accuracy at 90.0%, followed by Claude 3 Opus and ChatGPT-o1, both scored 84.0% (p>0.028). ChatGPT-4o recorded an accuracy of 82.0%. Gemini 1.5 Pro attained 74.0%, the accuracy of Mistral Large 2 and Llama 3.1 405B at 72.0%. R2 (T.C.) had a slightly lower accuracy of 70.0% (p>0.028). Perplexity Pro demonstrated the lowest performance among all models, with an accuracy of 68.0% (Fig. 3).
Claude 3.5 Sonnet outperformed Mistral Large 2, Llama 3.1 405B, and Perplexity Pro, achieving the highest scores on text-based questions (p=0.012, p=0.022, p=0.007). It also demonstrated superior performance according to R1 and R2 (p=0.021, p=0.021).
When other LLMs were compared among themselves and with radiologists, there was no significant difference in performance between them (p>0.028) (Table 3).
Discussion
The most striking result of our study is that LLMs included in the study demonstrated promising performance in text-based interpretation of RECIST. Our study uniquely examines the performance of several LLMs regarding the RECIST guideline, comparing their performance with that of radiologists. This approach not only identifies which LLM demonstrates a more comprehensive grasp of RECIST 1.1 but also offers insights into how radiologists" performance stacks up against that of LLMs.Coskun et al.[16] evaluated ChatGPT using 59 prostate cancer questions from the European Urology Patient Information Society and reported suboptimal accuracy (mean: 3.62±0.49) requiring improvement. Similarly, Lombardo et al.[17] tested ChatGPT (August 2023) with 195 questions from the EAU 2023 prostate cancer guidelines; expert review showed only 26% completely correct answers, with accuracy varying by section (best in follow-up/quality of life, poorest in diagnosis/ treatment. Similar to our results, in a recent study assessing LLMs in breast cancer care, three models- GPT-3.5, GPT-4, and Gemini (formerly Bard)-were evaluated using 60 MCQs covering treatment, diagnostic techniques, imaging interpretation, and pathology in breast cancer. GPT-4 achieved a 95% accuracy rate, outperforming GPT-3.5 (90%) and Gemini (80%), with statistically significant differences observed among the models (p=0.010). Furthermore, the models performed consistently across questions sourced from public databases and those formulated by radiologists.[18] Also, Cao at al.[19] evaluated LLMs in hepatocellular carcinoma diagnosis and management questions and found that ChatGPT-3.5, Gemini, and Bing answered only 45%, 60%, and 30% of basic clinical questions accurately, respectively, with even fewer responses deemed both accurate and reliable. Although there are still different results in the literature about the competence of LLMs in radiology, our results indicate that LLMs have quite a theoretical knowledge about RECIST 1.1.
Another important result of our study is that LLMs responded as well as radiologists on text-based and case-based MCQs that require analysis of the findings and data obtained. This result suggests that LLMs are successful in analyzing and reasoning texts such as radiology reports and providing the status of the disease (progression, stable, or regression) according to the RECIST guideline, which is the most critical for clinicians. To our best knowledge, there are no studies evaluating the performance of LLMs on case-based questions about cancer. Previous studies have evaluated LLMs" knowledge of cancer and cancer-related guidelines, which were only text-based. Çıtır reported that ChatGPT-3.5 gave largely correct answers to questions about oral cancer, 51.25% gave "very good" and 46.25% gave "good" answers, and the overall reliability was 97.5%.[20] Similarly, Yurtcu et al.[21] demonstrated that ChatGPT has strong accuracy in answering frequently asked questions about cervical cancer. Beyond these studies, our study uniquely demonstrates that LLMs perform quite adequately on case-based MCQs in line with the correct analysis. With this finding, we believe that our study may be a leading point for further multicenter studies that evaluate the performance of LLMs in real-patient scenarios.
In our study, all LLMs with three different prompts showed great consistency with minor differences in responses. Due to the nature of LLMs, it is a surprising result that these models, which largely determine their answers according to the given prompt, perform similarly with different prompts to the questions about RECIST 1.1.[21,22] In contrast to our results, Russe et al.[22] demonstrated the prompt effect, transforming a generic request into a precision prompt increased ChatGPT-4's factual correctness and decreased hallucinations, while a zero-shot chain-of-thought format further improved explain ability and user trust. Nguyen et al.[23] tested ChatGPT mixed text-and-image multiple- choice questions from the 2022 ACR in-training examination; an "encouraging" prompt boosted overall accuracy to 61%, whereas a "threatening" prompt reduced accuracy to 48% and tripled the non-response rate. These studies suggest that when a task is openended or cognitively complex, the model"s probabilistic reasoning is highly sensitive to contextual cues embedded in the prompt. RECIST 1.1 evaluation, however, is a narrowly defined, rule-based exercise, so the knowledge retrieved is limited, and the model"s decision space is tightly constrained; once the required rules are invoked, rewording the prompt affords little additional leverage. Therefore, the prompting effect could diminish as the task approaches deterministic guideline application rather than broad clinical reasoning.
The impressive performance of Claude 3.5 Sonnet- achieving 83.3% accuracy on case-based MCQs and 90% on text-based MCQs-indicates great potential of its model in this field. The observed minor variations in accuracy among LLMs can be largely attributed to differences in their underlying architectures. Models with real-time web access capability, such as Gemini 1.5 Pro and Perplexity Pro, frequently derive their responses from non-scientific sources, which may account for the comparatively lower performance of web-enabled LLMs relative to those without internet access. In contrast, some of the ChatGPT and Claude models are trained on closed datasets, potentially contributing to their enhanced reliability.
Limitations of the Study
Our study has a few limitations. First, the number of
questions was limited, and the assessment relied solely on MCQs. The performance of LLMs on open-ended
questions was not evaluated in the study, which may
have led to exaggerated LLM performances.
Second, we compared the accuracy of LLMs against two general radiologists with seven years of experience. It is likely that more experienced senior radiologists, particularly those with more specialized knowledge about oncological imaging, would achieve higher performance. Senior radiologists who are specialized and/ or subspecialized in this field may perform even better than LLMs, but since follow-up images of cancer patients are often evaluated by general radiologists in daily practice for many different reasons, the radiologists included in this study are general radiologists to better reflect real-life practice.
Lastly, this study assessed the performance of LLMs about RECIST 1.1 textually, while visual evaluation remains an integral component of radiological assessment. As such, the results of our study may not fully reflect the real-world applicability of LLMs in this field. It is important to emphasize that while LLMs performed well on structured MCQs, this could not directly reflect their ability in actual imaging interpretation for RECIST.
Conclusion
Radiologists can benefit from understanding how LLMs interpret RECIST 1.1, as these models may soon assist in standardized follow-up reporting. Incorporating LLM-based educational modules and decision-support tools can enhance consistency and reduce interpretive variability. Ongoing evaluation of model accuracy and bias is essential before clinical deployment.Informed Consent: The authors declare that this study was conducted without any commercial or financial relationships that could be construed as a potential conflict of interest.
Conflict of Interest Statement: The authors have no conflicts of interest to declare.
Funding: No funding was received for this study.
Use of AI for Writing Assistance: No AI technologies utilized.
Author Contributions: Concept - E.Ç.; Design - E.Ç., T.C.; Supervision - Y.C.G.; Materials - E.Ç.; Data collection and/ or processing - E.Ç.; Data analysis and/or interpretation - E.Ç., T.C.; Literature search - E.Ç., T.C.; Writing - E.Ç.; Critical review - E.Ç., T.C., Y.C.G.
Peer-review: Externally peer-reviewed.
References
1) Nakaura T, Ito R, Ueda D, Nozaki T, Fushimi Y, Matsui
Y, et al. The impact of large language models on radiology:
A guide for radiologists on the latest innovations
in AI. Jpn J Radiol 2024;42(3):1-12
2) Chung EM, Zhang SC, Nguyen AT, Atkins KM, Sandler
HM, Kamrava M. Feasibility and acceptability of
ChatGPT-generated radiology report summaries for
cancer patients. Digit Health 2023;9:1-7
3) Keshavarz P, Bagherieh S, Nabipoorashrafi SA,
Chalian H, Rahsepar AA, Kim GHJ, et al. ChatGPT
in radiology: A systematic review of performance, pitfalls,
and future perspectives. Diagn Interv Imaging
2024;105(7?8):251-65
4) Bhayana R. Chatbots and large language models in radiology:
A practical primer for clinical and research
applications. Radiology 2024;310(1):1-8
5) Bera K, O?Connor G, Jiang S, Tirumani SH, Ramaiya N.
Analysis of ChatGPT publications in radiology: Literature
so far. Curr Probl Diagn Radiol 2024;53(2):215-25
6) Akinci D?Antonoli T, Stanzione A, Bluethgen C, Vernuccio
F, Ugga L, Klontzas ME, et al. Large language
models in radiology: Fundamentals, applications, ethical
considerations, risks, and future directions. Diagn
Interv Radiol 2024;30(2):80-90
7) Srivastav S, Chandrakar R, Gupta S, Babhulkar V,
Agrawal S, Jaiswal A, et al. ChatGPT in radiology: The
advantages and limitations of artificial intelligence for
medical imaging diagnosis. Cureus 2023;15(7):e41435
8) Eisenhauer EA, Therasse P, Bogaerts J, Schwartz LH,
Sargent D, Ford R, et al. New response evaluation criteria
in solid tumours: Revised RECIST guideline (version
1.1). Eur J Cancer 2009;45(2):228-47
9) Liu X, Duan C, Kim MK, Zhang L, Jee E, Maharjan
B, et al. Claude 3 Opus and ChatGPT with GPT-4 in
dermoscopic image analysis for melanoma diagnosis:
Comparative performance analysis. JMIR Med Inform
2024;12:e59273
10) Chiarelli G, Stephens A, Finati M, Cirulli GO, Beatrici
E, Filipas DK, et al. Adequacy of prostate cancer prevention
and screening recommendations provided
by an artificial intelligence-powered large language
model. Int Urol Nephrol 2024;56(4):1-7
11) Aghamaliyev U, Karimbayli J, Giessen-Jung C,
Matthias I, Unger K, Andrade D, et al. ChatGPT?s
gastrointestinal tumor board tango: A limping dance
partner? Eur J Cancer 2024;205:114100
12) Sorin V, Barash Y, Konen E, Klang E. Large language
models for oncological applications. J Cancer Res Clin
Oncol 2023;149(11):9505-8
13) Güneş YC, Cesur T, Çamur E, Günbey Karabekmez
L. Evaluating text and visual diagnostic capabilities
of large language models on questions related to the
Breast Imaging Reporting and Data System Atlas 5th
edition. Diagn Interv Radiol 2025;31(2):1-8
14) Kaba E, Hürsoy N, Solak M, Çeliker FB. Accuracy of
large language models in thyroid nodule-related questions
based on the Korean Thyroid Imaging Reporting
and Data System (K-TIRADS). Korean J Radiol
2024;25(5):499-500
15) Bossuyt PM, Reitsma JB, Bruns DE, Gatsonis CA,
Glasziou PP, Irwig L, et al. STARD 2015: An updated
list of essential items for reporting diagnostic accuracy
studies. Radiology 2015;277(3):826-32
16) Coskun B, Ocakoglu G, Yetemen M, Kaygisiz O. Can
ChatGPT, an artificial intelligence language model,
provide accurate and high-quality patient information
on prostate cancer? Urology 2023;180:35-58
17) Lombardo R, Gallo G, Stira J, Turchi B, Santoro G, Riolo
S, et al. Quality of information and appropriateness
of OpenAI outputs for prostate cancer. Prostate
Cancer Prostatic Dis 2024;27(2):1-7
18) Irmici G, Cozzi A, Della Pepa G, Berardinis CD, D?Ascoli
E, Cellina M, et al. How do large language models
answer breast cancer quiz questions? A comparative
study of GPT-3.5, GPT-4 and Google Gemini. Radiol
Med 2024;129(10):1-8
19) Cao JJ, Kwon DH, Ghaziani TT, Kwo P, Tse G,
Kesslman A, et al. Large language models? responses to
liver cancer surveillance, diagnosis, and management
questions: Accuracy, reliability, readability. Abdom
Radiol 2024;49(12):4286-94
20) Çi Ti RM. ChatGPT and oral cancer: A study on informational
reliability. BMC Oral Health 2025;25(1):86
21) Yurtcu E, Ozvural S, Keyif B. Analyzing the performance
of ChatGPT in answering inquiries about cervical
cancer. Int J Gynaecol Obstet 2025;168(2):502-7




