Evaluation of Accuracy, Information Quality, and Readability of Artificial Intelligence Based Chatbots in Pediatric Oral Surgery: A Comparative Analysis Based on the AAPD Clinical Guideline

İlhan Kaya; Akif Demirel

TR EN

Yapay Zekâ Tabanlı Sohbet Robotlarının Pediatrik Oral Cerrahi Alanındaki Doğruluk, Bilgi Kalitesi ve Okunabilirliğinin Değerlendirilmesi: AAPD Klinik Kılavuzuna Dayalı Karşılaştırmalı Bir Analiz

Öz

Amaç: Yapay zekâ (YZ) tabanlı sohbet robotları, tıp ve diş hekimliği alanlarında bilgi kaynağı olarak giderek daha fazla kullanılmaktadır. Bu çalışma, Amerikan Çocuk Diş Hekimliği Akademisi (AAPD) klinik kılavuzuna referansla, dört farklı YZ sohbet robotunun pediatrik oral cerrahi konularında kanıta dayalı yanıtlar sunmadaki performansını değerlendirmeyi ve karşılaştırmayı amaçlamıştır. Gereç ve Yöntemler: Tanımlayıcı gözlemsel tasarıma sahip bu çalışmada, dört YZ sohbet robotu (ChatGPT-5, Gemini, Copilot ve DeepSeek) değerlendirilmiştir. AAPD’nin Management Considerations for Pediatric Oral Surgery kılavuzundan türetilen 20 soru bu modellere yöneltilmiştir. Yanıtların doğruluğu doğrulanmış bir derecelendirme sistemiyle, bilgi kalitesi 16 maddelik DISCERN aracıyla ve okunabilirlik düzeyi Flesch–Kincaid Okuma Seviyesi (FKGL) formülüyle değerlendirilmiştir. İstatistiksel karşılaştırmalarda parametrik olmayan Kruskal-Wallis ve Mann-Whitney U testleri, Holm-Bonferroni düzeltmesiyle birlikte uygulanmıştır (p<0.05). Bulgular: Tüm ölçütlerde sohbet robotları arasında anlamlı farklılıklar bulunmuştur. Gemini ve ChatGPT-5 en yüksek doğruluk puanlarını elde etmiştir (sırasıyla 1.30±0.47 ve 1.40±0.60; p=0.001), buna karşın DeepSeek ve Copilot daha düşük doğruluk göstermiştir. Bilgi kalitesi açısından DeepSeek en yüksek DISCERN puanına ulaşmıştır (52.90±3.73; p<0.001) ve onu Copilot izlemiştir. ChatGPT-5 ve Gemini daha okunabilir metinler üretmiş (sırasıyla 10.73±1.98 ve 11.68±1.91), ancak okunabilirlik farkları istatistiksel olarak anlamlı bulunmamıştır (p>0.05). Sonuç: Değerlendirilen modeller arasında Gemini ve ChatGPT-5 en doğru yanıtları üretirken, DeepSeek en yüksek bilgi kalitesini sağlamıştır. YZ tabanlı sohbet robotları, pediatrik oral cerrahi alanında hasta eğitimi ve akademik öğrenme için tamamlayıcı araçlar olarak umut verici görünmektedir; ancak güvenilirlikleri modeller arasında önemli ölçüde farklılık göstermektedir. Klinik kullanımdan önce bu sistemlerin sürekli olarak doğrulanması ve kılavuz temelli olarak değerlendirilmesi gereklidir.

Anahtar Kelimeler

Evaluation of Accuracy, Information Quality, and Readability of Artificial Intelligence Based Chatbots in Pediatric Oral Surgery: A Comparative Analysis Based on the AAPD Clinical Guideline

Abstract

Objective: Chatbots powered by artificial intelligence are increasingly used as tools for obtaining medical and dental knowledge. This study aimed to assess and compare the performance of four AI chatbots in providing evidence-based information on pediatric oral surgery topics, with reference to the American Academy of Pediatric Dentistry (AAPD) clinical guideline. Materials and Methods: This descriptive observational study evaluated four AI chatbots (ChatGPT-5, Gemini, Copilot and DeepSeek) by posing 20 questions derived from the AAPD Guideline on Management Considerations for Pediatric Oral Surgery. Responses were assessed for accuracy using the grading system, for quality using the 16-item DISCERN instrument and for readability using the Flesch–Kincaid Grade Level (FKGL) formula. Non-parametric Kruskal-Wallis and Mann-Whitney U tests with Holm-Bonferroni adjustment were employed for statistical comparisons (p<0.05). Results: Significant differences were observed among chatbots in all outcome measures. Gemini and ChatGPT-5 achieved the highest accuracy scores (1.30±0.47 and 1.40±0.60, respectively; p=0.001), whereas DeepSeek and Copilot showed lower accuracy. In terms of information quality, DeepSeek produced the highest DISCERN scores (52.90±3.73; p<0.001), followed by Copilot. ChatGPT-5 and Gemini yielded more readable outputs (10.73±1.98 and 11.68±1.91, respectively), though readability differences were not statistically significant (p>0.05). Conclusions: Of the models evaluated, Gemini and ChatGPT-5 produced the most accurate responses, while DeepSeek generated the highest-quality content. While AI chatbots show promise as supplementary tools for patient education and clinical learning in pediatric oral surgery, their reliability varies considerably across platforms. Continuous validation and guideline-based evaluation are essential prior to clinical integration.

Keywords

Supporting Institution

Not applicable

Project Number

Not applicable

Ethical Statement

This study does not contain any biological material or demographic data from humans or animals. Therefore, ethics committee approval is not required for this study.

Thanks

Not applicable

References

1. Brierley DJ, Chee CK, Speight PM. A review of paediatric oral and maxillofacial pathology. Int J Paediatr Dent 2013;23:319-329.
2. Kutcipal E. Pediatric oral and maxillofacial surgery. Dent Clin North Am 2013;57:83-98.
3. American Academy of Pediatric Dentistry. Management considerations for pediatric oral surgery. The Reference Manual of Pediatric Dentistry. Chicago, IL: American Academy of Pediatric Dentistry. 2025. Available at: https://www.aapd.org/globalassets/media/policies_guidelines/bp_oralsurgery.pdf
4. Cornelison BR, Erstad BL, Edwards C. Accuracy of a chatbot in answering questions that patients should ask before taking a new medication. J Am Pharm Assoc 2024;64:102110.
5. Akkoca F, Özdede M, İlhan G, Koyuncu E, Ellidokuz H. Assessing the success of ChatGPT-4o in oral radiology education and practice: a pioneering research. Cumhuriyet Dent J 2025;28:210-215.
6. Bayraktar Nahir C. Can ChatGPT be guide in pediatric dentistry? BMC Oral Health 2025;25:9.
7. Sezer B, Okutan AE. Evaluation of ChatGPT-4's performance on pediatric dentistry questions: accuracy and completeness analysis. BMC Oral Health 2025;25:1427.
8. Yıkıcı Çöl C, Nezir M, Özcan S. evaluation of the performance of different chatbots’ responses to restorative dentistry-related questions. Cumhuriyet Dent J 2025;28:237-245.

9. Goodman RS, Patrinely JR, Stone CA Jr, Zimmerman E, Donald RR, Chang SS, et al. Accuracy and reliability of chatbot responses to physician questions. JAMA Netw Open 2023;6:e2336483.
10. Zhou Z, Wang H. Use of artificial intelligence chatbots in clinical management of immune-related adverse events. J Immunother Cancer 2024;12:e009999.
11. Mika AP, Martin JR, Engstrom SM, Polkowski GG, Wilson JM. Assessing ChatGPT responses to common patient questions regarding total hip arthroplasty. J Bone Joint Surg Am 2023;105:1519-1526.
12. Dursun D, Bilici Geçer R. Can artificial intelligence models serve as patient information consultants in orthodontics? BMC Med Inform Decis Mak 2024;24:211.
13. Gökcek Taraç M, Nale T. Artificial intelligence in pediatric dental trauma: do artificial intelligence chatbots address parental concerns effectively? BMC Oral Health 2025;25:736.
14. Badarudeen S, Sabharwal S. Assessing readability of patient education materials: current role in orthopaedics. Clin Orthop Relat Res 2010;468:2572-2580.
15. Wang LW, Miller MJ, Schmitt MR, Wen FK. Assessing readability formula differences with written health information materials: application, results, and recommendations. Res Social Adm Pharm 2013;9:503-516.
16. Yagcı F, Eraslan R, Albayrak H, İpekten F. Accuracy and reliability of artificial intelligence chatbots as public information sources in implant dentistry. Int J Oral Maxillofac Implants 2025;0:1-23.
17. Suárez A, Díaz-Flores García V, Algar J, Gómez Sánchez M, Llorente de Pedro M, Freire Y. Unveiling the ChatGPT phenomenon: evaluating the consistency and accuracy of endodontic question answers. Int Endod J 2024;57:108-113.
18. Özbay Y, Erdoğan D, Dinçer GA. Evaluation of the performance of large language models in clinical decision-making in endodontics. BMC Oral Health 2025;25:648.
19. Guven Y, Ozdemir OT, Kavan MY. Performance of artificial intelligence chatbots in responding to patient queries related to traumatic dental injuries: a comparative study. Dent Traumatol 2025;41:338-347.
20. Rokhshad R, Zhang P, Mohammad-Rahimi H, Pitchika V, Entezari N, Schwendicke F. Accuracy and consistency of chatbots versus clinicians for answering pediatric dentistry questions: a pilot study. J Dent 2024;144:104938.
21. Sezer B, Okutan AE. Evaluation of ChatGPT-4's performance on pediatric dentistry questions: accuracy and completeness analysis. BMC Oral Health 2025;25:1427.
22. Bhargava H, Salomon C, Suresh S, Chang A, Kilian R, Stijn DV, et al. Promises, pitfalls, and clinical applications of artificial intelligence in pediatrics. J Med Internet Res 2024;26:e49022.
23. Cross JL, Choma MA, Onofrey JA. Bias in medical AI: implications for clinical decision-making. PLOS Digit Health 2024;3:e0000651.
24. Chustecki M. Benefits and risks of AI in health care: narrative review. Interact J Med Res 2024;13:e53616.
25. Jamil Abusamra HN, Ali SHM, Khidir Elhussien WA, Ahmed Mirghani AM, Alameen Ahmed AA, Abdelrahman Ibrahim ME. Ethical and practical considerations of artificial intelligence in pediatric medicine: a systematic review. Cureus. 2025;17:e79024.
26. Park T, Lee IH, Lee SW, Kong SW. Artificial intelligence in pediatric healthcare: current applications, potential, and implementation considerations. Clin Exp Pediatr 2025;68:641-651.

Details

Primary Language

English

Subjects

Oral and Maxillofacial Surgery , Paedodontics

Journal Section

Research Article

Authors

İlhan Kaya
0000-0002-6505-9658
Türkiye

Akif Demirel ^*
0000-0002-1433-0452
Türkiye

Publication Date

December 29, 2025

Submission Date

October 9, 2025

Acceptance Date

October 25, 2025

Published in Issue

Year 2025 Volume: 28 Number: 4

IZ

https://izlik.org/JA39PB67YZ

Cite

RIS / Bibtex

EndNote

Kaya İ, Demirel A (December 1, 2025) Evaluation of Accuracy, Information Quality, and Readability of Artificial Intelligence Based Chatbots in Pediatric Oral Surgery: A Comparative Analysis Based on the AAPD Clinical Guideline. Cumhuriyet Dental Journal 28 4 586–593.