Comparative Analysis of Oversampling and SMOTEENN Techniques in Machine Learning Algorithms for Breast Cancer Prediction

Tri Yulian, Erliyan Redy Susanto

Abstract


Breast cancer is the leading cause of cancer-related death among women, with one of the major challenges in developing predictive models being the class imbalance in medical datasets. This imbalance hinders the detection of minority classes (patients with cancer), which is critical for early diagnosis. This study aims to analyze the performance of Support Vector Machine (SVM) and Random Forest algorithms in predicting breast cancer using oversampling and SMOTEENN preprocessing techniques. The dataset used is the SEER Breast Cancer Dataset, which was balanced using both techniques. Model performance was evaluated using metrics such as accuracy, precision, recall, and F1-score. The results showed that SVM with oversampling achieved the highest accuracy of 98.97%, followed by SVM with SMOTEENN at 97.20%. Random Forest with oversampling reached an accuracy of 96.63%, while with SMOTEENN it achieved 95.90%. SVM proved more effective in identifying both classes with minimal error, particularly when combined with oversampling. These findings highlight that selecting the appropriate model and data preprocessing technique—such as oversampling or SMOTEENN—can significantly enhance predictive accuracy. This research contributes to the development of more accurate and reliable breast cancer prediction systems, supporting early diagnosis and clinical decision-making in medical applications.

Keywords


breast cancer; machine learning; oversampling; random forest; support vector machine

Full Text:

PDF

References


M. Arnold et al., “Current and Future Burden of Breast Cancer: Global Statistics for 2020 and 2040,” The Breast, vol. 66, hal. 15–23, Des 2022, doi: 10.1016/j.breast.2022.08.010.

C. H. Barrios, “Global Challenges in Breast Cancer Detection and Treatment,” The Breast, vol. 62, hal. S3–S6, Mar 2022, doi: 10.1016/j.breast.2022.02.003.

U. Naseem et al., “An Automatic Detection of Breast Cancer Diagnosis and Prognosis based on Machine Learning using Ensemble of Classifiers,” IEEE Access, vol. 10, hal. 78242–78252, 2022, doi: 10.1109/ACCESS.2022.3174599.

M. Nasser dan U. K. Yusof, “Deep Learning based Methods for Breast Cancer Diagnosis: A Systematic Review and Future Direction,” Diagnostics, vol. 13, no. 1, hal. 161, Jan 2023, doi: 10.3390/diagnostics13010161.

R. Rabiei, S. M. Ayyoubzadeh, S. Sohrabei, M. Esmaeili, dan A. Atashi, “Prediction of Breast Cancer using Machine Learning Approaches,” J. Biomed. Phys. Eng., vol. 12, no. 3, hal. 297–308, 2022, doi: 10.31661/jbpe.v0i0.2109-1403.

P. Dinesh, A. S. Vickram, dan P. Kalyanasundaram, “Medical Image Prediction for Diagnosis of Breast Cancer Disease Comparing the Machine Learning Algorithms: SVM, KNN, Logistic Regression, Random Forest and Decision Tree to Measure Accuracy,” 2024, hal. 020140. doi: 10.1063/5.0203746.

E. Y. Boateng, J. Otoo, dan D. A. Abaye, “Basic Tenets of Classification Algorithms K-Nearest-Neighbor, Support Vector Machine, Random Forest and Neural Network: A Review,” J. Data Anal. Inf. Process., vol. 08, no. 04, hal. 341–357, 2020, doi: 10.4236/jdaip.2020.84020.

J. A. Benítez-Andrades, C. Prada-García, N. Ordás-Reyes, M. E. Blanco, A. Merayo, dan A. Serrano-García, “Enhanced Prediction of Spine Surgery Outcomes using Advanced Machine Learning Techniques and Oversampling Methods,” Heal. Inf. Sci. Syst., vol. 13, no. 1, hal. 24, Mar 2025, doi: 10.1007/s13755-025-00343-9.

M. Khushi et al., “A Comparative Performance Analysis of Data Resampling Methods on Imbalance Medical Data,” IEEE Access, vol. 9, hal. 109960–109975, 2021, doi: 10.1109/ACCESS.2021.3102399.

E. F. Agyemang et al., “Addressing Class Imbalance Problem in Health Data Classification: Practical Application from an Oversampling Viewpoint,” Appl. Comput. Intell. Soft Comput., vol. 2025, no. 1, Jan 2025, doi: 10.1155/acis/1013769.

R. Resmiati dan T. Arifin, “Klasifikasi Pasien Kanker Payudara menggunakan Metode Support Vector Machine dengan Backward Elimination,” Sistemasi, vol. 10, no. 2, hal. 381, 2021, doi: 10.32520/stmsi.v10i2.1238.

M. M. Hassan et al., “A Comparative Assessment of Machine Learning Algorithms with the Least Absolute Shrinkage and Selection Operator for Breast Cancer Detection and Prediction,” Decis. Anal. J., vol. 7, hal. 100245, Jun 2023, doi: 10.1016/j.dajour.2023.100245.

S. Bej, N. Davtyan, M. Wolfien, M. Nassar, dan O. Wolkenhauer, “LoRAS: An Oversampling Approach for Imbalanced Datasets,” Mach. Learn., vol. 110, no. 2, hal. 279–301, Feb 2021, doi: 10.1007/s10994-020-05913-4.

F. Gurcan dan A. Soylu, “Learning from Imbalanced Data: Integration of Advanced Resampling Techniques and Machine Learning Models for Enhanced Cancer Diagnosis and Prognosis,” Cancers (Basel)., vol. 16, no. 19, hal. 3417, Okt 2024, doi: 10.3390/cancers16193417.

G. Menghani, “Efficient Deep Learning: A Survey on Making Deep Learning Models Smaller, Faster, and Better,” ACM Comput. Surv., vol. 55, no. 12, hal. 1–37, Des 2023, doi: 10.1145/3578938.

jing teng (North China Electric Power University), “SEER Breast Cancer Data,” IEEE Dataport. [Daring]. Tersedia pada: https://ieee-dataport.org/open-access/seer-breast-cancer-data

D. A. Pisner dan D. M. Schnyer, “Support Vector Machine,” in Machine Learning, Elsevier, 2020, hal. 101–121. doi: 10.1016/B978-0-12-815739-8.00006-7.

G. Dagnew dan B. H. Shekar, “Ensemble Learning‐based Classification of Microarray Cancer Data on Tree‐based Features,” Cogn. Comput. Syst., vol. 3, no. 1, hal. 48–60, Mar 2021, doi: 10.1049/ccs2.12003.

N. Syam dan R. Kaul, “Random Forest, Bagging, and Boosting of Decision Trees,” in Machine Learning and Artificial Intelligence in Marketing and Sales, Emerald Publishing Limited, 2021, hal. 139–182. doi: 10.1108/978-1-80043-880-420211006.

S. A. Alex, J. J. Vedha Nayahi, dan S. Kaddoura, “Deep Convolutional Neural Networks with Genetic Algorithm-based Synthetic Minority Over-Sampling Technique for Improved Imbalanced Data Classification,” Appl. Soft Comput., vol. 156, hal. 111491, Mei 2024, doi: 10.1016/j.asoc.2024.111491.

D. Krstinić, M. Braović, L. Šerić, dan D. Božić-Štulić, “Multi-Label Classifier Performance Evaluation with Confusion Matrix,” in Computer Science & Information Technology, AIRCC Publishing Corporation, Jun 2020, hal. 01–14. doi: 10.5121/csit.2020.100801.




DOI: https://doi.org/10.32520/stmsi.v14i3.5146

Article Metrics

Abstract view : 325 times
PDF - 82 times

Refbacks

  • There are currently no refbacks.


Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.