Trustworthiness and Explainability of Deep Learning for Diabetic Retinopathy Screening: Calibration and Clinical Utility Analysis

Ahmed Y. Abdelkafy

doi:10.66279/z86vqm30

Authors

Ahmed Y. Abdelkafy ElSewedy University of Technology - POLYTECHNIC OF EGYPT Author

Competing Interests

No Competing Interests

DOI:

https://doi.org/10.66279/z86vqm30

Keywords:

Diabetic retinopathy, Trustworthy AI, Probability calibration, Explainable AI, Deep learning

Abstract

The implementation of deep learning models for diabetic retinopathy (DR) screening necessitates not only superior predictive accuracy but also dependable probability assessments, transparent decision processes, and verifiable clinical efficacy. Despite the increasing volume of work indicating high classification accuracy, the reliability of these models in practical screening environments remains little investigated. This paper introduces a thorough post-hoc evaluation approach for evaluating the reliability of a pretrained deep learning model utilized in diabetic retinopathy screening. A ResNet-50 model, trained on the APTOS 2019 dataset, was assessed for binary classification of referable versus non-referable diabetic retinopathy without any model retraining. In addition to standard performance measurements, probability calibration was evaluated through predicted calibration error and reliability diagrams, as well as post-hoc temperature scaling. Model explainability was evaluated by Grad-CAM visualizations, while clinical utility was tested using decision curve analysis at different referral levels. The model had robust discriminative performance, attaining an area under the receiver operating characteristic curve of 0.907, although it displayed considerable probability miscalibration. The analysis of explainability revealed that precise predictions mostly focused on therapeutically relevant retinal regions, while high-confidence incorrect predictions highlighted potential risks in autonomous applications. Decision curve analysis demonstrated a positive net clinical benefit across a wide range of parameters. These findings highlight that accuracy alone is inadequate for clinical preparedness and stress the need for a comprehensive assessment of trustworthiness for the safe implementation of deep learning models in diabetic retinopathy screening.

Downloads

Download data is not yet available.

References

[1] A. R. Ran, J. L. Ding, Z. Tang, C. Lam, T. X. Nguyen, and A. Y. Lee, "Systematic review and meta-analysis of regulator-approved deep learning systems for fundus diabetic retinopathy detection," Lancet Digital Health, 2025.

[2] D. E. Mathew, D. U. Ebem, A. C. Ikegwu, P. E. Ukeoma, and C. E. Nwabueze, "Recent emerging techniques in explainable artificial intelligence," Neural Processing Letters, 2025.

[3] M. Bajaj et al., "Retinopathy, neuropathy, and foot care: Standards of care in diabetes—2026," Diabetes Care, vol. 49, no. Suppl. 1, pp. S261–S276, 2026. DOI: https://doi.org/10.2337/dc26-S012

[4] A. J. Vickers et al., "Decision curve analysis: Confidence intervals and hypothesis testing for net benefit," Diagnostic and Prognostic Research, vol. 7, no. 1, pp. 1–10, 2023. DOI: https://doi.org/10.1186/s41512-023-00148-y

[5] A. S. Sambyal, U. Niyaz, N. C. Krishnan, and D. R. Bathula, "Understanding calibration of deep neural networks for medical image classification," Computer Methods and Programs in Biomedicine, vol. 241, p. 107816, 2023. DOI: https://doi.org/10.1016/j.cmpb.2023.107816

[6] S. N. Saw et al., "Current status and future directions of explainable artificial intelligence in medical imaging," European Journal of Radiology, vol. 181, p. 111714, 2024.

[7] M. Huber, P. Schober, S. Petersen, and M. M. Luedi, "Decision curve analysis confirms higher clinical utility of multi-domain prediction models," BMC Medical Informatics and Decision Making, vol. 23, no. 1, pp. 1–11, 2023. DOI: https://doi.org/10.1186/s12911-023-02156-w

[8] S. Sivaprasad, T. Y. Wong, T. W. Gardner, J. K. Sun, and D. S. W. Ting, "Diabetic retinal disease," Nature Reviews Disease Primers, 2025. DOI: https://doi.org/10.1038/s41572-025-00646-x

[9] T. Islam, M. S. Hafiz, J. R. Jim, M. M. Kabir, and M. F. Mridha, "A systematic review of deep learning data augmentation in medical imaging," Healthcare Analytics, vol. 4, p. 100223, 2024. DOI: https://doi.org/10.1016/j.health.2024.100340

[10] V. K. Prasad, A. Verma, P. Bhattacharya, S. Shah, and A. Singh, "Revolutionizing healthcare through deep learning in medical imaging," Scientific Reports, vol. 14, p. 30273, 2024. DOI: https://doi.org/10.1038/s41598-024-71358-7

[11] M. Minderer et al., "Revisiting the calibration of modern neural networks," in Advances in Neural Information Processing Systems, vol. 34, pp. 21722–21734, 2021.

[12] T. Dawood, E. Chan, R. Razavi, A. P. King, and E. Puyol-Antón, "Addressing deep learning model calibration," in Proc. IEEE ISBI, pp. 1–5, 2023. DOI: https://doi.org/10.1109/ISBI53787.2023.10230515

[13] A. Kumar and M. Chawla, "A novel deep learning architecture for diabetic retinopathy detection," Int. J. Diabetes Dev. Countries, 2025. DOI: https://doi.org/10.1007/s13410-025-01605-8

[14] M. Chawla, "Enhancing diabetic retinopathy detection using optimized MobileNet," Biomedical Signal Processing

and Control, 2026.

[15] M. A. Lago, G. Zamzmi, B. Eich, and J. G. Delfino, "Evaluating explainability in medical imaging," Bioengineering, vol. 13, no. 1, p. 111, 2026. DOI: https://doi.org/10.3390/bioengineering13010111

[16] C. Yu, J. Ye, Y. Liu, X. Zhang, and Z. Zhang, "AMF-MedIT framework," arXiv preprint arXiv:2506.19439, 2025.

[17] A. DeGrave, Auditing the reasoning processes of medical-image AI, Ph.D. dissertation, Univ. Washington, 2024.

[18] H. S. Alghamdi, "Explainable deep neural networks for diabetic retinopathy," Applied Sciences, vol. 12, no. 19, p. 9435, 2022. DOI: https://doi.org/10.3390/app12199435

[19] Y. Singh, G. J. Gores, and B. J. Erickson, "Beyond the black box: Explainable AI in medical imaging," Radiology: Imaging Cancer, vol. 7, no. 3, p. e250198, 2025. DOI: https://doi.org/10.1148/rycan.250198

[20] S. Chandravadhana, V. Anusuya, D. Kirubha, and R. Devi, "DiabEyeNet," Iranian Journal of Science and Technology, 2025.

[21] K. Chalkou, A. J. Vickers, F. Pellegrini, and G. Salanti, "Decision curve analysis for personalized treatment," Medical Decision Making, vol. 43, no. 1, pp. 114–127, 2023. DOI: https://doi.org/10.1177/0272989X221143058

[22] M. Huber, P. Y. Wuethrich, and T. Vetsch, "Decision curve analysis for cardiac risk prediction," British Journal of

Anaesthesia, 2025.

[23] L. A. C. Millard and P. A. Flach, "Evaluating classification performance across contexts," arXiv preprint arXiv:2509.24608, 2025.

[24] Z. L. Teo et al., "Global prevalence of diabetic retinopathy and projection of burden through 2045: Systematic review and meta-analysis," Ophthalmology, vol. 128, no. 11, pp. 1580–1591, 2021. DOI: https://doi.org/10.1016/j.ophtha.2021.04.027

[25] GBD 2021 Diabetes Collaborators, "Global, regional, and national burden of diabetes from 1990 to 2021, with projections to 2050," Lancet, vol. 402, no. 10411, pp. 173–203, 2023.

[26] World Health Organization, World report on vision, 2019.

[27] H. Cao et al., "Clinical trial landscape of diabetic retinopathy," PMC, 2025.

[28] S. Ahmad et al., "PolyVision: Collaborative neural networks for retinal disease detection," Journal of Advances in Information Technology, vol. 17, no. 1, pp. 55–64, 2026. DOI: https://doi.org/10.12720/jait.17.1.55-64

[29] C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger, "On calibration of modern neural networks," in Proc. ICML, pp. 1321–1330, 2017.

[30] J. M. Johnson and T. M. Khoshgoftaar, "Survey on deep learning with class imbalance," Journal of Big Data, vol. 6, no. 1, p. 27, 2019. DOI: https://doi.org/10.1186/s40537-019-0192-5

[31] W. Samek et al., "Explaining deep neural networks and beyond: A review of methods and applications," Proc. IEEE, vol. 109, no. 3, pp. 247–278, 2021. DOI: https://doi.org/10.1109/JPROC.2021.3060483