AI systems unreliable judges of dental chatbot advice, study finds
Study shows AI cannot reliably assess AI dental advice quality, highlighting need for human expert oversight in practice.
Study compares AI language models with human clinicians
Researchers evaluated six large language models (LLMs) using nine oral health consultation questions covering topics such as infant oral care, pregnancy-related oral health, dry mouth in older adults, oral disease prevention and dental trauma. The questions were based on material from the FDI World Dental Federation. Two experienced dental clinicians scored the LLM responses, as did three additional LLMs acting as AI judges.
AI judges show poor consistency with human experts
DeepSeek-V3 and Doubao-1.8-Pro achieved the strongest overall performance on scientific accuracy, logical rigour, clinical practicality, terminology and completeness. However, the study revealed a critical flaw: agreement between the two human clinicians was high, indicating strong consistency in expert assessment. Agreement among the AI judges was much lower, and agreement between the AI judges and human clinicians was extremely poor.
The AI evaluators scored responses more harshly than human experts but still failed to reliably identify clinically important omissions, particularly in preventive advice and guidance for higher-risk patient groups. Researchers attributed this to how current LLMs evaluate clinical information, suggesting they may prioritise fluency and general completeness over the clinical importance of risks and patient-specific cautions.
Implications for clinical practice
The study did not conclude that AI systems are unsafe for providing general oral health information. Instead, it strongly cautions against relying on AI systems alone to evaluate the quality or safety of clinical advice. The researchers stated that current AI-as-a-judge frameworks are not reliable substitutes for expert human review in dentistry. The findings suggest that LLMs have potential as tools for delivering standardised oral health information and supporting patient education, particularly where immediate access to dental professionals is limited, but only with expert oversight rather than as replacements for clinician judgement.
Frequently asked questions
Can AI chatbots safely provide general oral health advice?
The study found that LLMs have potential to deliver standardised oral health information and support patient education, particularly where immediate access to dental professionals is limited. However, all AI advice should be reviewed by qualified clinicians rather than trusted as standalone guidance.
Why are AI judges unreliable for evaluating dental advice?
AI judges showed poor agreement with each other and extremely poor agreement with human experts. They scored responses more harshly than human experts yet still failed to identify clinically important omissions, particularly regarding preventive advice for higher-risk patient groups. LLMs appear to prioritise text fluency over clinical reasoning.
Which large language models performed best in the study?
DeepSeek-V3 and Doubao-1.8-Pro achieved the strongest overall performance, scoring highly on scientific accuracy, logical rigour, clinical practicality, terminology and completeness. GPT-5, Gemini 3, Qwen3-Max and Kimi K2 also performed well but with greater variability.
How consistent were human clinicians versus AI judges?
Agreement between the two experienced dental clinicians was high, showing strong consistency in expert assessment. In contrast, consistency among AI judges was much lower, and their agreement with human clinicians was extremely poor.
What topics were included in the oral health consultation questions?
The nine questions covered infant oral care, pregnancy-related oral health, dry mouth in older adults, oral disease prevention and dental trauma. All questions were based on material from the FDI World Dental Federation.