Leveraging large language models for the deidentification and temporal normalization of sensitive health information in electronic health records

Hong Jie Dai, Tatheer Hussain Mir, Ching Tai Chen, Chien Chang Chen, Hao Ping Yang, Chung Hong Lee, Yi Yun Chou, Yu Chin Teng, Shalini Gupta, Omkar Panchal, Divyabharathy Ramesh Nadar, Wei Hsiang Liao, Yu Chuan Lin, Zi Rui Zhao, Richard Tzong Han Tsai, Yung Chun Chang, Jitendra Jonnagaddala

Research output: Contribution to journalArticlepeer-review

Abstract

Secondary use of electronic health record notes enhances clinical outcomes and personalized medicine, but risks sensitive health information (SHI) exposure. Inconsistent time formats hinder interpretation, necessitating deidentification and temporal normalization. The SREDH/AI CUP 2023 competition explored large language models (LLMs) for these tasks using 3,244 pathology reports with surrogated SHIs and normalized dates. The competition drew 291 teams; the top teams achieved macro-F1 scores >0.8. Results were presented at the IW-DMRN workshop in 2024. Notably, 77.2% used LLMs, highlighting their growing role in healthcare. This study compares competition results with in-context learning and fine-tuned LLMs. Findings show that fine-tuning, especially with lower-rank adaptation, boosts performance but plateaus or degrades in models over 6 B parameters due to overfitting. Our findings highlight the value of data augmentation, training strategies, and hybrid approaches. Effective LLM-based deidentification requires balancing performance with legal and ethical demands, ensuring privacy and interpretability in regulated healthcare settings.

Original languageEnglish
Article number517
Journalnpj Digital Medicine
Volume8
Issue number1
DOIs
Publication statusPublished - Dec 2025

ASJC Scopus subject areas

  • Medicine (miscellaneous)
  • Health Informatics
  • Computer Science Applications
  • Health Information Management

Fingerprint

Dive into the research topics of 'Leveraging large language models for the deidentification and temporal normalization of sensitive health information in electronic health records'. Together they form a unique fingerprint.

Cite this