TY - JOUR
T1 - Performance Comparison of Junior Residents and ChatGPT in the Objective Structured Clinical Examination (OSCE) for Medical History Taking and Documentation of Medical Records
T2 - Development and Usability Study
AU - Huang, Ting Yun
AU - Hsieh, Pei Hsing
AU - Chang, Yung Chun
N1 - Publisher Copyright:
© Ting-Yun Huang, Pei Hsing Hsieh, Yung-Chun Chang.
PY - 2024
Y1 - 2024
N2 - Background: This study explores the cutting-edge abilities of large language models (LLMs) such as ChatGPT in medical history taking and medical record documentation, with a focus on their practical effectiveness in clinical settings—an area vital for the progress of medical artificial intelligence. Objective: Our aim was to assess the capability of ChatGPT versions 3.5 and 4.0 in performing medical history taking and medical record documentation in simulated clinical environments. The study compared the performance of nonmedical individuals using ChatGPT with that of junior medical residents. Methods: A simulation involving standardized patients was designed to mimic authentic medical history–taking interactions. Five nonmedical participants used ChatGPT versions 3.5 and 4.0 to conduct medical histories and document medical records, mirroring the tasks performed by 5 junior residents in identical scenarios. A total of 10 diverse scenarios were examined. Results: Evaluation of the medical documentation created by laypersons with ChatGPT assistance and those created by junior residents was conducted by 2 senior emergency physicians using audio recordings and the final medical records. The assessment used the Objective Structured Clinical Examination benchmarks in Taiwan as a reference. ChatGPT-4.0 exhibited substantial enhancements over its predecessor and met or exceeded the performance of human counterparts in terms of both checklist and global assessment scores. Although the overall quality of human consultations remained higher, ChatGPT-4.0’s proficiency in medical documentation was notably promising. Conclusions: The performance of ChatGPT 4.0 was on par with that of human participants in Objective Structured Clinical Examination evaluations, signifying its potential in medical history and medical record documentation. Despite this, the superiority of human consultations in terms of quality was evident. The study underscores both the promise and the current limitations of LLMs in the realm of clinical practice.
AB - Background: This study explores the cutting-edge abilities of large language models (LLMs) such as ChatGPT in medical history taking and medical record documentation, with a focus on their practical effectiveness in clinical settings—an area vital for the progress of medical artificial intelligence. Objective: Our aim was to assess the capability of ChatGPT versions 3.5 and 4.0 in performing medical history taking and medical record documentation in simulated clinical environments. The study compared the performance of nonmedical individuals using ChatGPT with that of junior medical residents. Methods: A simulation involving standardized patients was designed to mimic authentic medical history–taking interactions. Five nonmedical participants used ChatGPT versions 3.5 and 4.0 to conduct medical histories and document medical records, mirroring the tasks performed by 5 junior residents in identical scenarios. A total of 10 diverse scenarios were examined. Results: Evaluation of the medical documentation created by laypersons with ChatGPT assistance and those created by junior residents was conducted by 2 senior emergency physicians using audio recordings and the final medical records. The assessment used the Objective Structured Clinical Examination benchmarks in Taiwan as a reference. ChatGPT-4.0 exhibited substantial enhancements over its predecessor and met or exceeded the performance of human counterparts in terms of both checklist and global assessment scores. Although the overall quality of human consultations remained higher, ChatGPT-4.0’s proficiency in medical documentation was notably promising. Conclusions: The performance of ChatGPT 4.0 was on par with that of human participants in Objective Structured Clinical Examination evaluations, signifying its potential in medical history and medical record documentation. Despite this, the superiority of human consultations in terms of quality was evident. The study underscores both the promise and the current limitations of LLMs in the realm of clinical practice.
KW - clinical documentation
KW - large language model
KW - LLM
KW - medical history taking
KW - OSCE standards
KW - simulation-based evaluation
UR - http://www.scopus.com/inward/record.url?scp=85211741456&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85211741456&partnerID=8YFLogxK
U2 - 10.2196/59902
DO - 10.2196/59902
M3 - Article
AN - SCOPUS:85211741456
SN - 2369-3762
VL - 10
JO - JMIR Medical Education
JF - JMIR Medical Education
M1 - e59902
ER -