Evaluation of generative AI assistance in clinical nephrology: Assessing GPT-4, GPT-4o, Gemini 1.0 Ultra, and PaLM 2 in patient interaction and renal biopsy interpretation

Shih Yi Lin, Chang Cheng Jiang, Kin Man Law, Pei Chun Yeh, Min Kuang Tsai, Chu Lin Chou, I. Kuan Wang, I. Wen Ting, Yu Wei Chen, Che Yi Chou, Ming Han Hsieh, Heng Chih Pan, Sung Lin Hsieh, Chien Hua Chiu, Pei Wen Lee, Yu Cyuan Hong, Ying Yu Hsu, Huey Liang Kuo, Shu Woei Ju, Chia Hung Kao

Research output: Contribution to journalArticlepeer-review

Abstract

Importance: Compares the responses of four AI models to common nephrology-related questions encountered in clinical settings. Objective: To evaluate generative AI models in enhancing nephrology patient communication and education. Design: Generative AI in Nephrology Setting: In a study conducted from December 8–12, 2023, and October 21–23, 2024, IT engineers evaluated GPT-4, GPT-4o, Gemini 1.0 Ultra, and PaLM 2 for nephrology patient communication and education, querying each with 21 nephrology questions and three renal biopsy reports, repeated for consistency. Intervention(s) (for clinical trials) or Exposure(s) (for observational studies): None. Main Outcome(s) and Measure(s): Fifteen nephrologists and one nephrology researcher assessed responses for Appropriateness, Helpfulness, Consistency, and human-like empathy, with rating scale (1–4). Using Shapiro–Wilk and Mann–Whitney U tests with Holm correction, along with TF-IDF, BertScore, and ROUGE were used. The study compared the performance of GPT-4, GPT-4o, Gemini 1.0 Ultra, and PaLM 2 across 24 nephrology-related questions. Results: GPT-4o consistently achieved high scores in Appropriateness (3.39 ± 0.7) and Helpfulness (3.24 ± 0.73), while PaLM 2 demonstrated the highest consistency score (3.0 ± 0.86). In empathy, GPT-4 achieved the highest overall score (80.73%), excelling in patient-centric scenarios, followed by GPT-4o (76.56%). PaLM 2 showed competitive empathy in specific cases, despite scoring lower in consistency and Appropriateness. For Kidney-Related Queries, GPT-4o excelled in relevance metrics, achieving the highest BertScore (0.57) and ROUGE for one-word metrics (0.54). Gemini 1.0 Ultra led in generating coherent responses for Renal Biopsy Reports with the highest TF-IDF (0.56) and ROUGE for longest similar sentences (0.47). All 101 references provided by GPT-4 were 100% accurate. Conclusions and Relevance: GPT-4o emerged as the most accurate and consistent model across most evaluation categories, while GPT-4 demonstrated superior empathy and balanced performance. PaLM 2 and Gemini 1.0 Ultra showed strengths in specific areas, highlighting the potential for tailored applications of generative AI in nephrology clinical practice.

Original languageEnglish
Article number20552076251342067
JournalDigital Health
Volume11
DOIs
Publication statusPublished - 2025

Keywords

  • Gemini 1.0 Ultra
  • Generative AI
  • GPT-4
  • GPT-4o
  • nephrology
  • PaLM 2

ASJC Scopus subject areas

  • Health Policy
  • Health Informatics
  • Computer Science Applications
  • Health Information Management

Fingerprint

Dive into the research topics of 'Evaluation of generative AI assistance in clinical nephrology: Assessing GPT-4, GPT-4o, Gemini 1.0 Ultra, and PaLM 2 in patient interaction and renal biopsy interpretation'. Together they form a unique fingerprint.

Cite this