Combining large language models with traditional methods improves accuracy in identifying early signs of cognitive decline, offering new hope for early diagnosis.
Study: Improving early detection of cognitive decline in the elderly: a comparative study using large language models in clinical notes. Image credits: MarutStudio/Shutterstock.com
A recent study in eBioMedicine evaluated the effectiveness of large language models (LLMs) in identifying signs of cognitive decline in electronic health records (EHRs).
Background
Alzheimer’s disease and related dementias affect millions of people, reducing their quality of life and incurring financial and emotional costs. Early identification of cognitive decline can lead to more effective therapy and a higher level of care.
LLMs have shown encouraging results in several healthcare domains and clinical language processing tasks, including information extraction, entity recognition, and question answering. However, its effectiveness in detecting specific clinical disorders, such as cognitive decline, using electronic health information is questionable.
Few studies have evaluated EHR data using LLMs on Health Insurance Portability and Accountability Act (HIPAA) compliant cloud computing systems. Minimal research has compared large language models to traditional artificial intelligence (AI)-based approaches such as machine learning and deep learning. This type of research can influence model enlargement techniques.
About the study
In the current study, researchers examined the early detection of progressive cognitive decline using large language models and EHR data. They also compared the performance of large language models with conventional models trained with domain-specific data.
The researchers analyzed proprietary and open-source LLMs at Mass General Brigham in Boston. They studied medical notes from four years before the 2019 diagnosis of mild cognitive impairment (MCI) in individuals aged ≥50 years.
The International Classification of Diseases, Tenth Revision, Clinical Modification (ICD-10-CM) defined MCI. The team ruled out transient, reversible and recovering cases of cognitive decline.
HIPAA compliant cloud computing systems enable GPT-4 (proprietary) and Llama 2 (open-source) designations.
Prompt augmentation methods such as error analysis instructions, retrieval augmented generation (RAG), and hard prompting enabled the development of LLM. Hard cue selections include random, targeted, and K-means clustering assisted selections.
Basic study models include XGBoost and attention-based deep neural networks (DNN). The DNN framework included bidirectional long-short-term memory (LSTM) networks. Based on the performance, researchers selected the best LLM-based approach.
They constructed an ensemble of three models based on majority voting. They used confusion matrix scores to evaluate the model’s performance. The team used an intuitive manual template design method to refine the task descriptions. Additional task guidance improved LLM reasoning.
Results
The study dataset consisted of 4,949 sections of clinical notes from 1,969 individuals, 53% of whom were women with a mean age of 76 years. Cognitive function keywords filtered the notes to develop study models. The test dataset without keyword filtering included 1,996 sections of clinical notes from 1,161 individuals, 53% of whom were female with a mean age of 77 years.
The team found GPT-4 to be more accurate and efficient than Llama 2. However, GPT-4 could not outperform conventional models trained with domain-specific and local EHR data. The error profiles of large language models trained using general domains, machine learning, or deep learning were quite different; putting them together into an ensemble improved the performance dramatically.
The ensemble study model achieved 90% accuracy, 94% recall, and 92% F1 score, outperforming all individual study models across all performance metrics with statistically significant results.
Strikingly, compared to the most accurate individual model, the ensemble study increased accuracy from less than 80% to more than 90%. Error analysis showed that at least one model incorrectly predicted 63 samples.
However, in all models there were only two cases of mutual errors (3.20%). The findings highlighted the diversity in error profiles between the models. The dynamic RAG method with five-shot prompts and error-based instructions produced the best results.
GPT-4 highlighted dementia therapy options such as Aricept and donepezil. It also detected diagnoses such as mild neurocognitive disorders, severe neurocognitive disorders and vascular dementia better than previous models. GPT-4 focused on the emotional and psychological consequences of cognitive problems, such as anxiety, which are often ignored by other models.
Unlike conventional models, GPT-4 can process ambiguous sentences and analyze sophisticated information without confusing negations and contextual factors. However, GPT-4 can occasionally overinterpret or be overly cautious, ignoring the underlying reasons for clinical events. Both GPT-4 and attention-based DNNs sometimes misinterpret clinical test results.
Conclusions
Based on the study results, large language models and traditional AI models trained on electronic health records had different error profiles. Combining three models in the ensemble study model improved the diagnostic performance.
The study results indicate that LLMs trained in general domains require additional development to improve clinical decision making. Future studies should combine LLMs with more localized models, using medical information and domain expertise to improve model performance for specific tasks and experiment with triggering and tuning tactics.