Almost all leading major language models or ‘chatbots’ show signs of mild cognitive impairment in tests widely used to detect early signs of dementia, a study in the Christmas issue of The BMJ.
The results also show that ‘older’ versions of chatbots, like older patients, often perform worse on the tests. The authors say these findings “challenge the assumption that artificial intelligence will soon replace human doctors.”
Huge advances in artificial intelligence have led to a wave of excited and anxious speculation about whether chatbots can outperform human doctors.
Several studies have shown that large language models (LLMs) are remarkably adept at a range of medical diagnostic tasks, but their sensitivity to human limitations such as cognitive decline has not yet been explored.
To close this knowledge gap, researchers assessed the cognitive skills of the leading, publicly available LLMs – ChatGPT versions 4 and 4o (developed by OpenAI), Claude 3.5 “Sonnet” (developed by Anthropic), and Gemini versions 1 and 1.5 (developed by Alphabet ) – using the Montreal Cognitive Assessment (MoCA) test.
The MoCA test is widely used to detect cognitive impairment and early signs of dementia, usually in older adults. Through a number of short tasks and questions, it assesses skills such as attention, memory, language, visual-spatial skills and executive functions. The maximum score is 30 points, with a score of 26 or higher generally considered normal.
The instructions given to the LLMs for each task were the same as those given to human patients. The score followed official guidelines and was evaluated by a practicing neurologist.
ChatGPT 4o achieved the highest score on the MoCA test (26 out of 30), followed by ChatGPT 4 and Claude (25 out of 30), with Gemini 1.0 scoring the lowest (16 out of 30).
All chatbots showed poor performance on visual-spatial skills and executive tasks, such as the trace-making task (connecting circled numbers and letters in ascending order) and the clock drawing test (drawing a clock face showing a specific time ). Gemini models failed on the delayed recall task (memorizing a string of five words).
Most other tasks, including naming, attention, language and abstraction, were performed well by all chatbots.
But in further visuospatial testing, chatbots were unable to show empathy or accurately interpret complex visual scenes. Only ChatGPT 4o passed the incongruent phase of the Stroop test, which uses combinations of color names and font colors to measure how interference affects reaction time.
These are observational findings and the authors recognize the essential differences between the human brain and large language models.
However, they point out that the uniform failure of all major language models on tasks requiring visual abstraction and executive function reveals a significant weakness that could hinder their use in clinical settings.
As such, they conclude: “Not only are neurologists unlikely to be replaced by large language models anytime soon, but our findings suggest that they will soon begin treating new, virtual patients – artificial intelligence models with cognitive impairment.”