AI chatbots score well on endodontic board-style exams in study
Study shows AI chatbots score comparably to advanced trainees on endodontic board exams. Useful for supplementing clinical education, not replacing human instruction.
Two artificial intelligence chatbots, GPT-4o and Gemini 2.5 Pro, performed at a level comparable to advanced endodontic trainees when tested on simulated American Board of Endodontics (ABE) oral examinations, according to a study published in the Journal of Endodontics on 26 February 2026. Researchers at Texas A&M College of Dentistry designed the test to assess clinical reasoning and decision-making rather than simple recall, using three endodontic cases with 20 open-ended questions each.
How the chatbots performed
Both systems scored highly on a 0-3 scale. Gemini 2.5 Pro achieved a mean score of 2.83, while GPT-4o scored 2.73. Independent assessment by two board-certified endodontists found most responses rated as acceptable to excellent. There was no statistically significant difference between the two models in clinical validity or overall performance. Gemini 2.5 Pro showed more consistency across the three scenarios, while GPT-4o varied more by case type.
Limitations and educational use
The study's lead author, Dr Poorya Jalali, stressed that these results should not be over-interpreted. The chatbots cannot perform clinical examination, interpret radiographs in real settings, or diagnose independently. They performed well because they received written prompts and detailed radiographic descriptions. A real ABE examination involves live timed interaction with examiners and independent radiographic interpretation. The findings suggest AI chatbots are best used as educational supplements rather than replacements for human instruction. They could help students and residents practise answering clinical questions, test their knowledge, and compare their reasoning with model answers. Future research will explore whether these tools can help design high-quality examination questions.
Frequently asked questions
How well did GPT-4o and Gemini 2.5 Pro score on endodontic board-style exams?
Gemini 2.5 Pro achieved a mean score of 2.83 on a 0-3 scale, while GPT-4o scored 2.73. Both were rated as acceptable to excellent by board-certified examiners, with no statistically significant difference between the two models in clinical validity or overall performance.
Can AI chatbots diagnose dental conditions and plan treatment independently?
No. The chatbots cannot perform clinical examination, interpret radiographs in a real clinical setting, or diagnose independently. Their performance depends entirely on the information provided to them, such as written descriptions and radiographic findings.
How should dentists and educators use AI chatbots in endodontic education?
They work best as educational supplements alongside traditional teaching, helping students and residents practise answering clinical questions, test their knowledge, and compare their reasoning with model answers. They are not replacements for human instruction or clinical expertise.
Did one chatbot perform better than the other in the study?
Gemini 2.5 Pro showed more consistent performance across the three scenarios, while GPT-4o varied more by case type. However, there was no statistically significant overall difference between the two models.
Would these chatbots reliably pass a real American Board of Endodontics examination?
No. The study's lead author cautioned that results should not be over-interpreted. A real ABE examination involves live timed interaction with examiners and independent radiographic interpretation, which differs significantly from how the chatbots were tested.