Clinical Text Augmentation and Generation Using RAG for Large Language Models

Fathima, Nasreen; Ganesh, Kavitha

doi:10.33166/AETiC.2025.05.005

Paper #5

Clinical Text Augmentation and Generation Using RAG for Large Language Models

Nasreen Fathima and Kavitha Ganesh

Abstract: Large Language Models (LLM) are becoming more essential in clinical text generation, where use of synthetic medical data is environmentally accurate and applicable for real-world healthcare applications. Existing LLMs often lack in specialized optimization and clarity, leading to incorrect outputs. These restrictions can make their references unreliable, particularly for sensitive clinical data. To overcome these problems, this research work suggests integrating generative adversarial networks with LLM to improve clinical data accuracy and reduce hallucinations. LLMs like LLaMA, BERT and GPT are broadly used in clinical settings for tasks such as summarizing patient notes and answering medical queries. Generative Adversarial Networks (GANs) are used to generate realistic synthetic clinical data, aiding privacy and data augmentation. The LDA model is added with GAN to identify the underlying topics in clinical documents, ensuring the synthetic text is coherent and thematically relevant. The use of Retrieval Augmented Generation (RAG) dynamically retrieves current medical knowledge and provides grounding responses with real-time evidence and minimizes outdated information. The first phase focuses on generating and validating synthetic clinical data using GANs and LDA to ensure high quality and domain alignment; the second phase focus on user interaction, where RAG retrieves relevant information in real time to answer queries, and an interactive interface enables seamless engagement and feedback. Continuous evaluation of NLP metrics demonstrates that the proposed Clinical Augmentation Generation and Retrieval Augmented Generation (CAG-RAG) framework outperforms the existing DALL-M approach in generating synthetic clinical text. For diagnosis-related data, the proposed CAG-RAG method achieves improvements of 15.7% in BLEU, 17% in ROUGE-1, and 17% in ROUGE-L scores. For medication-related data, the improvements were 20.8% in BLEU, 17.1% in ROUGE-1, and 17.25% in ROUGE-L. These results highlight the reliability, adaptability, and contextual accuracy for clinical applications.

Keywords: Large Language Models; Retrieval Modules; Synthetic Clinical Data; Data Augmentation; Synthetic Text; NLP Metrics; Generative Adversarial Network; Retrieval Augmented Generation.