Enhancing Stroke Risk Prediction with Explainable AI: Leveraging Resampling and Machine Learning for Improved Accuracy
Abstract
Highlight:
- Advanced resampling techniques improved class balance in stroke datasets
- Gradient Boosting with SMOTE reached 92% accuracy with SHAP interpretability
ABSTRACT
Introduction: Stroke represents a significant global health concern, impacting millions worldwide and contributing substantially to morbidity and mortality. Early detection and accurate risk prediction remain critical for effective prevention strategies. Objective: This study aimed to improve stroke risk prediction by employing machine learning algorithms on health survey data to identify key predictors and enhance predictive performance. Method: A dataset derived from the National Health and Nutrition Examination Survey, comprising 4,603 participants, was utilized. The dataset exhibited class imbalance, with only 7.86% of individuals diagnosed with stroke. To address this imbalance, advanced resampling techniques, including SMOTE, SMOTETomek, and ADASYN, were applied. A range of tree-based algorithms was implemented, including Gradient Boosting, AdaBoost, XGBoost, and a Voting Classifier integrating Decision Tree, AdaBoost, and Gradient Boosting classifiers. Model evaluation included accuracy and AUC scores. Explainable Artificial Intelligence (XAI) analyses were conducted using SHAP (SHapley Additive exPlanations) to interpret feature importance. Result: The Gradient Boosting classifier, in conjunction with SMOTE, achieved the highest performance with an accuracy of 92% and an AUC score of 0.70. SHAP analysis identified age, general health condition, marital status, and BMI as the most influential predictors of stroke risk. Conclusion: This study underscores the essential need for ongoing advancements in early stroke detection methodologies. The findings highlight the transformative potential of machine learning and XAI in predictive healthcare, offering valuable insights for stroke prevention strategies.
Full text article
References
1. McLaren. Stroke in 2024: By the Numbers [Internet]. 2024. https://www.mclaren.org/main/news/stroke-in-2024-by-the-numbers-4449
2. Golubnitschaja O, Potuznik P, Polivka J, Pesta M, Kaverina O, Pieper CC, et al. Ischemic stroke of unclear aetiology: a case-by-case analysis and call for a multi-professional predictive, preventive and personalised approach. EPMA Journal. 2022;13(4):535–45. DOI: 10.1007/s13167-022-00307-z.
3. Centracare. Strokes By the Numbers [Internet]. 2023. https://www.centracare.com/articles-stories/ strokes-by-the-numbers/
4. World Stroke Organization. Impact of Stroke [Internet]. 2025. https://www.world-stroke.org/ world-stroke-day-campaign/about-stroke/impact-of-stroke
5. Centers for Disease Control and Prevention. Stroke facts [Internet]. 2024. https://www.cdc.gov/stroke/ data-research/facts-stats/index.html
6. NIH. How many people are affected by/at risk for stroke? [Internet]. 2016 https://www.nichd.nih.gov/ health/topics/stroke/conditioninfo/risk
7. Correction to: Heart Disease and Stroke Statistics—2023 Update: A Report From the American Heart Association. Circulation. 2023;148(4). DOI: 10.1161/cir.0000000000001167.
8. Wang Ping. Imbalanced Data-based Prediction and Risk Factor Analysis of Stroke. 2024. DOI: 10.17632/xggs239bnw.1.
9. Natekin A, Knoll A. Gradient boosting machines, a tutorial. Frontiers in Neurorobotics. 2013;7(21). DOI: 10.3389/fnbot.2013.00021.
10. Lyashevska O, Malone F, MacCarthy E, Fiehler J, Buhk JH, Morris L. Class imbalance in gradient boosting classification algorithms: Application to experimental stroke data. Statistical Methods in Medical Research. 2020;30(3):916–25. DOI: 10.1177/0962280220980484.
11. P. Nandal, Malik S. Leveraging AdaBoost and CatBoost to Classify the Likelihood of Brain Stroke. Journal of Scientific Research. 2024;16(3):637–46. DOI: 10.3329/jsr.v16i3.67891.
12. Hornyák O, Iantovics LB. AdaBoost Algorithm Could Lead to Weak Results for Data with Certain Characteristics. Mathematics. 2023;11(8):1801. DOI: 10.3390/math11081801.
13. Rui C, Zhang S, Li J, Guo D, Zhang W, Wang X, et al. A study on predicting the length of hospital stay for Chinese patients with ischemic stroke based on the XGBoost algorithm. BMC Medical Informatics and Decision Making. 2023;23(1). DOI: 10.1186/s12911-023-02140-4.
14. Chang W, Ji X, Xiao Y, Zhang Y, Chen B, Liu H, et al. Prediction of Hypertension Outcomes Based on Gain Sequence Forward Tabu Search Feature Selection and XGBoost. Diagnostics. 2021;11(5):792. DOI: 10.3390/diagnostics11050792.
15. Sheela Lavanya J M, Subbulakshmi P. Unveiling the potential of machine learning approaches in predicting the emergence of stroke at its onset: a predicting framework. Scientific Reports. 2024;14(1). DOI: 10.1038/s41598-024-70354-1.
16. Asadi F, Rahimi M, Daeechini AH, Paghe A. The most efficient machine learning algorithms in stroke prediction: A systematic review. Health Science Reports.2024;7(10). DOI: 10.1002/hsr2.70062.
17. 1.Yin Q, Ye X, Huang B, Qin L, Ye X, Wang J. Stroke Risk Prediction: Comparing Different Sampling Algorithms. International Journal of Advanced Computer Science and Applications. 2023;14(6). DOI: 10.14569/ijacsa.2023.01406115.
18. 1.Li X, Bian D, Yu J, Li M, Zhao D. Using machine learning models to improve stroke risk level classification methods of China national stroke screening. BMC Medical Informatics and Decision Making. 2019;19(1). DOI: 10.1186/s12911-019-0998-2.
19. Ewald FK, Bothmann L, Wright MN, Bischl B, Casalicchio G, König G. A Guide to Feature Importance Methods for Scientific Inference. Communications in Computer and Information Science. 2024;440–64. DOI: 10.1007/978-3-031-63797-1_22.
20. Hu L, Wang K. Computing SHAP Efficiently Using Model Structure Information. arXiv (Cornell University). 2023. DOI: 10.48550/arxiv.2309.02417.
21. Alageel N, Alharbi R, Alharbi R, Alsayil M, Alharbi LA. Using Machine Learning Algorithm as a Method for Improving Stroke Prediction. International Journal of Advanced Computer Science and Applications. 2023;14(4). DOI: 10.14569/ijacsa.2023.0140481.
22. Emon MU, Keya MS, Meghla TI, Rahman MdM, Mamun MSA, Kaiser MS. Performance Analysis of Machine Learning Approaches in Stroke Prediction. IEEE Xplore. 2020. p. 1464–9. DOI: 10.1109/ICECA49313.2020.9297525
23. Hassan A, Gulzar Ahmad S, Ullah Munir E, Ali Khan I, Ramzan N. Predictive modelling and identification of key risk factors for stroke using machine learning. Scientific Reports. 2024;14(1):11498. DOI: 10.1038/s41598-024-61665-4.
24. Dritsas E, Trigka M. Stroke Risk Prediction with Machine Learning Techniques. Sensors. 2022;22(13):4670. DOI: 10.3390/s22134670.
25. Sailasya G, Kumari GLA. Analyzing the Performance of Stroke Prediction using ML Classification Algorithms. International Journal of Advanced Computer Science and Applications [Internet]. 2021;12(6). DOI: 10.14569/ijacsa.2021.0120662.
26. Rahim AMA, Sunyoto A, Arief MR. Stroke Prediction Using Machine Learning Method with Extreme Gradient Boosting Algorithm. MATRIK : Jurnal Manajemen, Teknik Informatika dan Rekayasa Komputer.2022;21(3):595–606. DOI: 10.30812/matrik.v21i3.1666.
27. Sundaram .M S, K Pavithra, V Poojasree. STROKE PREDICTION USING MACHINE LEARNING. IARJSET. 2022;9(6). DOI: 10.17148/iarjset.2022.9620.
28. Biswas N, Uddin KMM, Rikta ST, Dey SK. A comparative analysis of machine learning classifiers for stroke prediction: A predictive analytics approach. Healthcare Analytics. 2022;2:100116. DOI: 10.1016/j.health.2022.100116.
29. Guhdar M, Ismail Melhum A, Luqman Ibrahim A. Optimizing Accuracy of Stroke Prediction Using Logistic Regression. Journal of Technology and Informatics. 2023;4(2):41–7. DOI: 10.37802/joti. v4i2.278.
30. Mezher MA. Genetic Folding (GF) Algorithm with Minimal Kernel Operators to Predict Stroke Patients. Applied Artificial Intelligence. 2022;36(1). DOI: 10.1080/08839514.2022.2151179
Authors
Copyright (c) 2025 Minhazul Alam Mahin, Md. Mominul Islam, Md. Zulfikar Alam, Arnob Dutta Pollob, Oxita Zaman

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.