Lung Cancer Classification using the Naïve Bayes Method with SMOTE

Ananda Ikhwana Khairur Akbar; Yani Parti Astuti

doi:10.32520/stmsi.v14i6.5607

Lung Cancer Classification using the Naïve Bayes Method with SMOTE

Ananda Ikhwana Khairur Akbar, Yani Parti Astuti

Abstract

The primary challenges addressed in this study include delays in the early detection of lung cancer due to non-specific initial symptoms, the limitations of the Naïve Bayes algorithm in processing categorical data such as symptoms, gender, and smoking habits, as well as class imbalance issues in the dataset that can affect model accuracy. To overcome these challenges, the SMOTE (Synthetic Minority Over-sampling Technique) method was applied to improve classification performance. This study aims to implement the Naïve Bayes algorithm for lung cancer classification and compare its performance on imbalanced data versus data balanced using SMOTE. The methodology consists of data preprocessing, encoding, applying SMOTE for balancing, and classification using Naïve Bayes. Evaluation was performed using three data split ratios: 80:20, 70:30, and 60:40. The results show that applying SMOTE led to performance improvements, with the most significant gains observed at the 60:40 split ratio. In this case, model accuracy improved from 88.29% to 93.19%. For the “Yes” (positive) class, precision remained at 0.96, recall at 0.91, and F1-score at 0.93. However, for the “No” (negative) class, precision improved from 0.40 to 0.90, recall from 0.60 to 0.96, and F1-score from 0.48 to 0.93. Conversely, slight decreases in accuracy were observed for the 80:20 and 70:30 ratios after SMOTE application. These findings demonstrate that SMOTE significantly enhances model performance at the 60:40 ratio, not only in terms of accuracy but also in recall and F1-score, which are crucial for reducing false negatives in the minority (“Yes”) class. This is especially critical in early detection, as correctly identifying actual cancer cases is more important than merely maintaining overall accuracy. Although SMOTE did not always improve accuracy at other ratios, it still contributed to better cancer case detection. Therefore, its application should be considered carefully, balancing overall accuracy with clinically meaningful metrics.

Keywords

machine learning; naive bayes; lung cancer; classification; model evaluation

Full Text:

PDF

References

S. Andarini, A. A. Santoso, M. A. Arfiansyah, et al., “Indonesian Society of Respirology Position Paper on Lung Cancer Control in Indonesia,” J. Respirologi Indones., Vol. 44, No. 4, Dec. 2024, doi: 10.36497/jri.v44i4.884.

A. F. Hamdani, W. Purbaningsih, and W. Y. Nalapraya, “Karakteristik Demografi dan Klinikopatologi Pasien Kanker Paru di RSUD Al−Ihsan,” J. Ris. Kedokt., pp. 97–102, Dec. 2023, doi: 10.29313/jrk.v3i2.2959.

R. Prakasha, M. Urs, and S. Babu, “International Journal of Intelligent Systems and Applications in Engineering Machine Learning Approach for Lung Cancer Detection and Classification-A Comparative Analysis,” Mar. 2024. [Online]. Available: www.ijisae.org

B. Dunn, M. Pierobon, and Q. Wei, “Automated Classification of Lung Cancer Subtypes using Deep Learning and CT-Scan based Radiomic Analysis,” Bioengineering, Vol. 10, No. 6, Jun. 2023, doi: 10.3390/bioengineering10060690.

Suprapto, “Improvement Naive Bayes menggunakan Forward Selection, Information Gain dan Gain Ratio untuk Penanganan Independensi Fitur,” J. Sos. dan Teknol., Vol. 5, No. 4, 2025.doi: 10.59188/jurnalsostech.v5i4.32084

Q. An, S. Rahman, J. Zhou, and J. J. Kang, “A Comprehensive Review on Machine Learning in Healthcare Industry: Classification, Restrictions, Opportunities and Challenges,” May 01, 2023, MDPI. doi: 10.3390/s23094178.

D. Silviana Halawa and R. Mahyuni, “Implementasi Naive Bayes pada Sistem Pakar untuk mendiagnosa Penyakit Kelenjar Limfa (Getah Bening),” Nov. 2024, [Online]. Available: https://ojs.trigunadharma.ac.id/index.php/jsi

H. P. Almeyda, Z. F. Khoiri, M. S. Haris, N. H. Alkaff, and S. Sukmadiningtyas, “Implementation of K-Nearest Neighbor Algorithm for Classification of Lung Cancer Causes,” JURTEKSI (Jurnal Teknol. dan Sist. Informasi), Vol. 11, No. 1, pp. 37–44, Dec. 2024, doi: 10.33330/jurteksi.v11i1.3305.

R. Alifahasni Zakiah, S. Wahjuni, and W. B. Suwarno, “Pemilihan Algoritma Machine Learning untuk Perangkat dengan Komputasi Terbatas pada Deteksi Kematangan Buah Melon Berjala Selection of Machine Learning Algorithms for Limited Computing Device in Netted Melon Ripeness Detection,” 2023. [Online]. Available: http://journal.ipb.ac.id/index.php/jika

D. Juliani and M. Soleh, “Implementasi Machine Learning untuk Klasifikasi Penyakit Kanker Paru menggunakan Metode Naïve Bayes dengan Tambahan Fitur Chatbot (Implementation of Machine Learning for Lung Cancer Classification using Naïve Bayes Method with Additional Chatbot Features),” Aug. 2024. [Online]. Available: https://www.kaggle.com/datasets/mysarahmadb

D. Septhya, K. Rahayu, S. Rabbani, V. Fitria, Y. Irawan, and R. Hayami, “MALCOM: Indonesian Journal of Machine Learning and Computer Science Implementation of Decision Tree Algorithm and Support Vector Machine for Lung Cancer Classification Implementasi Algoritma Decision Tree dan Support Vector Machine untuk Klasifikasi Penyakit Kanker Paru,” Vol. 3, pp. 15–19, 2023. doi:10.57152/malcom.v3i1.591

S. S. Berutu, H. Budiati, J. Jatmika, and F. Gulo, “Data Preprocessing Approach for Machine Learning-based Sentiment Classification,” J. INFOTEL, Vol. 15, No. 4, pp. 317–325, Nov. 2023, doi: 10.20895/infotel.v15i4.1030.

R. Taufik, R. Jimah, and A. Solichin, “Implementasi dan Analisis Model Machine Learning Decision Tree untuk Deteksi Akun Palsu di Twitter,” J. MEDIA Inform. BUDIDARMA, Vol. 8, No. 2, p. 797, Apr. 2024, doi: 10.30865/mib.v8i2.7548.

M. Guntara and F. D. Astuti, “Komparasi Kinerja Label-Encoding dengan One-Hot-Encoding pada Algoritma K-Nearest Neighbor menggunakan Himpunan Data Campuran,” JIKO (Jurnal Inform. dan Komputer), Vol. 9, No. 2, p. 352, Jun. 2025, doi: 10.26798/jiko.v9i2.1605.

M. K. Suryadewiansyah, T. Endra, and E. Tju, “Jurnal Nasional Teknologi dan Sistem Informasi Naïve Bayes dan Confusion Matrix untuk Efisiensi Analisa Intrusion Detection System Alert,” Aug. 2022, doi: 10.25077/TEKNOSI.v8i2.2022.081-088.

DOI: https://doi.org/10.32520/stmsi.v14i6.5607

Article Metrics

Abstract view : 687 times
PDF - 95 times

Refbacks

There are currently no refbacks.

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Username
Password
Remember me