Letter 2 regarding “Assessing the performance of ChatGPT in answering questions regarding cirrhosis and hepatocellular carcinoma”
Article information
Dear Editor,
We read with great interest the recently published research analyzing the performances of ChatGPT with respect to the management of cirrhosis and hepatocellular carcinoma [1]. In addition to these advanced liver diseases, steatotic liver disease (SLD) also represents a considerable burden on global health, as it affects one-third of the worldwide population [2]. SLD requires long-term self-management and continuous support. This stems from its slow progression, the emphasis on lifestyle changes, and the constant need for regular patient-physician interactions. Therefore, for patients diagnosed with SLD, education plays a pivotal role in understanding, managing, and possibly reversing their condition. In our evolving digital era, large language models (LLMs), which are sophisticated generative AI systems trained on vast volumes of data that are capable of producing human-like textual responses, have emerged as promising aids for patient education [3], particularly in facilitating interactions through natural language dialogues [4]. However, given that the efficacy of LLMs in advancing SLD patient education might vary, it is imperative to compare their performances. Therefore, we conducted a comparative evaluation study to assess the performance of five leading LLMs in responding to SLD-related queries.
Our study was performed between Sep 8th and 28th, 2023. We curated 30 common SLD-related queries spanning domains such as risk factors, clinical test and diagnosis, treatment, follow-up, and prognosis based on guideline-based topics and our clinical experience (Table 1) [5,6]. As a separate and independent prompt, each query was posed to five LLMs: ChatGPT-3.5, ChatGPT-4, Google Bard, Meta Llama2 and Anthropic Claude2, which yielded a total of 30 responses per LLM-chatbot. The generated responses were then randomly ordered within each set of questions and stripped of revealing information (e.g., statements such as “I’m not a doctor” from ChatGPT) to blind reviewers to the LLM-specific response identity. Three seasoned attending-level physicians independently graded the responses as either “appropriate” or “inappropriate” over five rounds, each on a separate day, with an overnight washout interval in between to mitigate memory bias (Supplementary Fig. 1). Specifically, the responses were graded as “appropriate” when they were free from errors and “inappropriate” when they contained potential factual errors that could harm or mislead the average patient. The final grade for each chatbot response was determined using a majority consensus approach, based on the grade most often assigned by the three expert graders.
We assessed the performances of the five LLMs in responding to SLD-related queries. As shown in Table 1, ChatGPT-4 provided 29 of 30 (96.7%) appropriate responses, followed by Bard and Llama2 with 27 of 30 (90.0%), and ChatGPT-3.5 and Claude2 both with 24 of 30 (80.0%), Chi-square test χ2=6.17, P=0.18. A notable area of concern was the frequent misclassification of fatty liver disease as synonymous with nonalcoholic fatty liver disease (NAFLD). This oversimplification can lead to inaccuracies. For example, ChatGPT-3.5 replied to the question “Are there different stages of fatty liver disease, and how do they differ?” with the following response: “Yes, there are different stages of fatty liver disease, which is also known as nonalcoholic fatty liver disease (NAFLD). …. The stages of NAFLD are typically categorized as follows: 1. Simple Steatosis (Fatty Liver): ….2. Nonalcoholic Steatohepatitis (NASH): .... 3. Fibrosis: …. 4. Cirrhosis: ….”
This rigorous evaluation study revealed that, among five state-of-the-art LLMs, ChatGPT-4 could generate largely appropriate responses to patient queries regarding SLD, boasting an impressive appropriateness rate of 96.7%. Other LLMs provided 80% to 90% appropriate responses. Health literacy—commonly defined as the degree to which individuals have the skills and abilities to obtain, process, and utilize health-related information—has emerged as a critical priority in reducing inequities among patients, including those with SLD [7,8]. Our findings underscore the varied potential of LLM chatbots to provide professional yet patient-friendly health literacy guidance to SLD patients [3]. Whereas prior investigations predominantly focused on ChatGPT3.5 [1], our study offers a comprehensive assessment of popular LLMs, namely ChatGPT-3.5, ChatGPT-4, Bard, Llama2 and Claude2, and we specifically evaluated their proficiency in addressing typical SLD-related patient queries. Notably, one in five responses from ChatGPT-3.5 and Claude2 was inappropriate, thus highlighting the need for further iterations and probably domain-specific fine-tuning. Although the exact parameters of ChatGPT-4 remain undisclosed, its impressive performance may result from the large parameter set, extensive user feedback, advanced reasoning abilities, and the integration of insights from previous models into the system [9]. This study derived benefits from implementing a robust study design with proper randomization, wash-out periods and a majority consensus grading process. However, there are also limitations. These sample queries may represent only a small proportion of real-world scenarios. In addition, as the field of LLM evolves at an unprecedented speed, future research is needed to confirm whether LLMs are adapting to new nomenclatures, such as metabolic dysfunction associated steatotic liver disease (MASLD). Generative AI with LLMs—especially ChatGPT-4—may offer yet further valuable insights into opportunities for patient education about SLDs.
Notes
Authors’ contribution
Acquisition of data: Yiwen Zhang, Hongwei Ji, Liwei Wu, Zepeng Mu. Analysis and interpretation of data: Hongwei Ji. Drafting of the manuscript: All authors. Critical revision of the manuscript for important intellectual content: All authors. Statistical analysis: Yiwen Zhang Hongwei Ji. Obtained funding: Hongwei Ji. Study supervision: Hongwei Ji.
Conflicts of Interest
The authors have no conflicts to disclose.
Acknowledgements
This study was funded in part by the National Key R & D Program of China (2022YFC2502800), National Natural Science Foundation of China (82103908), the Shandong Provincial Natural Science Foundation (ZR2021QH014), the Shuimu Scholar Program of Tsinghua University, and National Postdoctoral Innovative Talent Support Program (BX20230189). The funding sources had no role in the design and conduct of the study; collection, management, analysis, and interpretation of the data; preparation, review, or approval of the manuscript; and decision to submit the manuscript for publication.
Abbreviations
SLD
steatotic liver disease
LLMs
large language models
NAFLD
nonalcoholic fatty liver disease
MASLD
metabolic dysfunction associated steatotic liver disease
SUPPLEMENTAL MATERIAL
Supplementary material is available at Clinical and Molecular Hepatology website (http://www.e-cmh.org).