Lung Cancer Classification Using the Extreme Gradient Boosting (XGBoost) Algorithm and Mutual Information for Feature Selection

Regitha Zizilia; Yulison Herry Chrisnanto; Gunawan Abdillah

doi:10.32520/stmsi.v14i5.5345

Lung Cancer Classification Using the Extreme Gradient Boosting (XGBoost) Algorithm and Mutual Information for Feature Selection

Regitha Zizilia, Yulison Herry Chrisnanto, Gunawan Abdillah

Abstract

Lung cancer is one of the deadliest types of cancer worldwide and is often detected too late due to the absence of early symptoms. This study aims to evaluate the impact of feature selection using Mutual Information on the performance of lung cancer classification with the XGBoost algorithm. Mutual Information is employed to select relevant features, including those with linear and non-linear relationships with the target variable, while XGBoost is chosen for its ability to handle large datasets and reduce overfitting. The study was conducted on a dataset containing 30,000 data entries, with data split scenarios of 90:10, 80:20, 70:30, and 60:40. The results show that the testing accuracy before applying Mutual Information ranged from 93.42% to 93.83%, while K-Fold Cross-Validation accuracy ranged from 94.59% to 94.76%. After feature selection, testing accuracy remained stable for the 70:30 and 60:40 split scenarios, at 93.60% and 93.42% respectively. However, K-Fold Cross-Validation accuracy decreased to 89.26% and 90.88%. In contrast, for the 90:10 and 80:20 split scenarios, a decline in accuracy was observed — testing accuracy dropped to 88.63% and 88.85%, and K-Fold Cross-Validation accuracy fell to 88.87% and 90.24%. Feature selection using Mutual Information improves computational efficiency by reducing the number of features, and it can be effectively applied to simplify feature sets without significantly compromising model performance in certain data scenarios, depending on the characteristics of the dataset.

Keywords

classification; lung cancer; mutual information; XGBoost; K-Fold cross validation

Full Text:

PDF

References

R. D. Marzuq, S. A. Wicaksono, and N. Y. Setiawan, “Prediksi Kanker Paru-Paru menggunakan Algoritme Random Forest Decision Tree,” J. Pengemb. Teknol. Inf. dan Ilmu Komput., Vol. 7, No. 7, pp. 3448–3456, 2023.

J. Al-Tawalbeh, B. Alshargawi, H. Alquran, W. Al-Azzawi, W. A. Mustafa, and A. Alkhayyat, “Classification of Lung Cancer by Using Machine Learning Algorithms,” IICETA 2022 - 5th Int. Conf. Eng. Technol. its Appl., pp. 528–531, 2022, doi: 10.1109/IICETA54559.2022.9888332.

R. Bisht, N. Thapliyal, R. Bisht, and G. Wadhwa, “Lung Cancer Detection using Ensemble Learning and Machine Learning Algorithms,” 2nd Int. Conf. Artif. Intell. Mach. Learn. Appl. Healthc. Internet Things, AIMLA 2024, pp. 1–6, 2024, doi: 10.1109/AIMLA59606.2024.10531596.

S. Bharathy, R. Pavithra, and B. Akshaya, “Lung Cancer Detection using Machine Learning,” Proc. - Int. Conf. Appl. Artif. Intell. Comput. ICAAIC 2022, No. Icaaic, pp. 539–543, 2022, doi: 10.1109/ICAAIC53929.2022.9793061.

A. Rahmadeyan and M. Mustakim, “Seleksi Fitur pada Supervised Learning: Klasifikasi Prestasi Belajar Mahasiswa Saat dan Pasca Pandemi COVID-19,” J. Nas. Teknol. dan Sist. Inf., Vol. 9, No. 1, pp. 21–32, 2023, doi: 10.25077/teknosi.v9i1.2023.21-32.

H. I. H. Yusri, A. A. Ab Rahim, S. L. M. Hassan, I. S. A. Halim, and N. E. Abdullah, “Water Quality Classification using SVM And XGBoost Method,” 2022 IEEE 13th Control Syst. Grad. Res. Colloquium, ICSGRC 2022 - Conf. Proc., No. July, pp. 231–236, 2022, doi: 10.1109/ICSGRC55096.2022.9845143.

T. Gori, A. Sunyoto, and H. Al Fatta, “Preprocessing Data dan Klasifikasi untuk Prediksi Kinerja Akademik Siswa,” J. Teknol. Inf. dan Ilmu Komput., Vol. 11, No. 1, pp. 215–224, 2024, doi: 10.25126/jtiik.20241118074.

S. Gadge and A. Karande, “Study of Different Types of Evaluation Methods in Classification and Regression,” 2022 IEEE Reg. 10 Symp. TENSYMP 2022, pp. 1–5, 2022, doi: 10.1109/TENSYMP54529.2022.9864426.

I. W. Dharmana, I. G. A. Gunadi, and L. J. E. Dewi, “Deteksi Transaksi Fraud Kartu Kredit menggunakan Oversampling ADASYN dan Seleksi Fitur SVM-RFECV,” J. Teknol. Inf. dan Ilmu Komput., Vol. 11, No. 1, pp. 125–134, 2024, doi: 10.25126/jtiik.20241117640.

W. Mostert and K. M. Malan, “Comparative Analysis,” pp. 1–16, 2021.

B. Yao, C. Li, and Y. Chen, “Supervised Feature Selection based on Sparse Representation and Mutual Information,” Proc. 2023 IEEE 5th Int. Conf. Civ. Aviat. Saf. Inf. Technol. ICCASIT 2023, pp. 1354–1358, 2023, doi: 10.1109/ICCASIT58768.2023.10351667.

A. Nageswari, U. Jyothi, G. Divya, T. Ammannamma, and V. Usha, “Water Quality Classification using XGBoost method,” ICCCMLA 2024 - 6th Int. Conf. Cybern. Cogn. Mach. Learn. Appl., pp. 302–306, 2024, doi: 10.1109/ICCCMLA63077.2024.10871422.

V. S. Desdhanty and Z. Rustam, “Liver Cancer Classification using Random Forest and Extreme Gradient Boosting (XGBoost) with Genetic Algorithm as Feature Selection,” 2021 Int. Conf. Decis. Aid SCI. Appl. DASA 2021, pp. 716–719, 2021, doi: 10.1109/DASA53625.2021.9682311.

V. Jagadeesh and P. Sivakumar, “Enhanced Pipeline Safety: Cloud-based Leak Prediction Using XGBoost,” Proc. - 2024 IEEE 16th Int. Conf. Commun. Syst. Netw. Technol. CICN 2024, pp. 1087–1091, 2024, doi: 10.1109/CICN63059.2024.10847573.

E. Helmud, E. Helmud, F. Fitriyani, and P. Romadiana, “Classification Comparison Performance of Supervised Machine Learning Random Forest and Decision Tree Algorithms using Confusion Matrix,” J. Sisfokom (Sistem Inf. dan Komputer), Vol. 13, No. 1, pp. 92–97, 2024, doi: 10.32736/sisfokom.v13i1.1985.

R. Prasetya, “Data Mining Application on Weather Prediction using Classification Tree, Naïve Bayes and K-Nearest Neighbor Algorithm with Model Testing of Supervised Learning Probabilistic Brier Score, Confusion Matrix and ROC,” Jaict, Vol. 4, No. 2, p. 25, 2020, doi: 10.32497/jaict.v4i2.1690.

I. K. Nti, O. Nyarko-Boateng, and J. Aning, “Performance of Machine Learning Algorithms with Different K Values in K-fold CrossValidation,” Int. J. Inf. Technol. Comput. Sci., Vol. 13, No. 6, pp. 61–71, 2021, doi: 10.5815/ijitcs.2021.06.05.

N. C. Ramadhan, H. H. H, T. Rohana, and A. M. Siregar, “Optimasi Algoritma Machine Learning menggunakan Seleksi Fitur Xgboost untuk Klasifikasi Kanker Payudara,” TIN Terap. Inform. Nusant., Vol. 5, No. 2, pp. 162–171, 2024, doi: 10.47065/tin.v5i2.5408.

C. Giola, P. Danti, and S. Magnani, “Learning Curves: A Novel Approach for Robustness Improvement of Load Forecasting †,” Eng. Proc., Vol. 5, No. 1, 2021, doi: 10.3390/engproc2021005038.

DOI: https://doi.org/10.32520/stmsi.v14i5.5345

Article Metrics

Abstract view : 108 times
PDF - 60 times

Refbacks

There are currently no refbacks.

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Username
Password
Remember me