Data Balancing Approach Using Combine Sampling on Sentiment Analysis With K-Nearest Neighbor

Evlyn Pricilia Kondy; Siswanto Siswanto; Nirwan Ilyas

doi:10.32520/stmsi.v13i5.4013

Data Balancing Approach Using Combine Sampling on Sentiment Analysis With K-Nearest Neighbor

Evlyn Pricilia Kondy, Siswanto Siswanto, Nirwan Ilyas

Abstract

One of the topics that has been discussed on twitter is the rules regarding the removal of masks. However, there's a chance that the data from Twitter contains unequal data classes. An unequal amount of data can cause the classification process to malfunction. Combining under- and oversampling techniques is known as combine sampling, and it is a data-balancing strategy. The research's data consists of Indonesian tweets using the hashtag "The Policy of Removing Masks." In this study, the classification approach was K-Nearest Neighbor, while the oversampling and undersampling techniques were SMOTE and Tomek Links. The purpose of this research is to classify sentiment using the K-Nearest Neighbor algorithm and to use combine sampling to balance the amount of training data in the two classes that are not yet balanced. 234 training data with a positive sentiment and 652 training data with a negative sentiment were obtained after the data was divided. Due to an imbalance in the quantity of training data between the two classes, the positive class's data is minor and the negative class's data is major. The quantity of training data are 613 in the positive class and 613 in the negative class obtained following the combine sampling. Following the balancing of data between the two classes, sentiment classification was performed, yielding accuracy of 60.4%, precision of 78.5%, and recall of 65%. The reason for the accuracy number of 60.4% is because machine learning misinterpreted a tweet regarding Indonesia's mask removal policy, leading to incorrect classification.

Full Text:

PDF

References

Y. Tresnawati, “Analisis Sentimen Pada Twitter Menggunakan Pendekatan Agglomerative Hierarchial Clustering,” Universitas Sanata Dharma, 2017. [Online]. Available: https://123dok.com/document/y6e7jl4z-analisis-sentimen-twitter-menggunakan-pendekatan-agglomerative-hierarchical-clustering.html

A. R. Isnain, J. Supriyanto, and M. P. Kharisma, “Implementation of K-Nearest Neighbor (K-NN) Algorithm For Public Sentiment Analysis of Online Learning,” IJCCS (Indonesian J. Comput. Cybern. Syst., vol. 15, no. 2, p. 121, Apr. 2021, doi: 10.22146/ijccs.65176.

S. Mulyani, S. A. Thamrin, and S. Siswanto, “Analisis Sentimen Masyarakat Pada Kebijakan Vaksinasi Covid-19 Di Twitter Menggunakan Metode Mesin Vektor Pendukung Dengan Kernel Radial Basis Function Berbasis Fitur Leksikon,” Jambura J. Probab. Stat., vol. 3, no. 2, pp. 110–119, 2022, doi: 10.34312/jjps.v3i2.16663.

N. Rezki, S. A. Thamrin, and S. Siswanto, “Sentiment Analysis of Merdeka Belajar Kampus Merdeka Policy Using Support Vector Machine With Word2Vec,” BAREKENG J. Ilmu Mat. dan Terap., vol. 17, no. 1, pp. 0481–0486, 2023, doi: 10.30598/barekengvol17iss1pp0481-0486.

R. T. Prasetio, “Seleksi Fitur dan Optimasi Parameter k-NN Berbasis Algoritma Genetika Pada Dataset Medis,” J. RESPONSIF, vol. 2, no. 2, pp. 213–221, 2020, [Online]. Available: http://ejurnal.ars.ac.id/index.php/jti

A. Prayoga Permana, K. Ainiyah, and K. Fahmi Hayati Holle, “Analisis Perbandingan Algoritma Decision Tree, kNN, dan Naive Bayes untuk Prediksi Kesuksesan Start-up,” JISKa, vol. 6, no. 3, pp. 178–188, 2021, [Online]. Available: http://repository.uin-malang.ac.id/9921/

G. E. A. P. A. Batista, R. C. Prati, and M. C. Monard, “A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data,” 2004. [Online]. Available: https://dl.acm.org/doi/abs/10.1145/1007730.1007735

S. Choirunnisa and J. Lianto, “Hybrid Method of Undersampling and Oversampling for Handling Imbalanced Data,” in 2018 International Seminar On Research Of Information Technology and Intelligent Systems, 2018, pp. 276–280.

M. Mustaqim, B. Warsito, and B. Surarso, “Combination of synthetic minority oversampling technique (Smote) and backpropagation neural network to handle imbalanced class in predicting the use of contraceptive implants,” Regist. J. Ilm. Teknol. Sist. Inf., vol. 5, no. 2, pp. 116–127, Jul. 2019, doi: 10.26594/register.v5i2.1705.

H. Ali, M. N. M. Salleh, R. Saedudin, K. Hussain, and M. F. Mushtaq, “Imbalance class problems in data mining: A review,” Indones. J. Electr. Eng. Comput. Sci., vol. 14, no. 3, pp. 1552–1563, Jun. 2019, doi: 10.11591/ijeecs.v14.i3.pp1552-1563.

H. Shamsudin, U. K. Yusof, A. Jayalakshmi, and M. N. Akmal Khalid, “Combining oversampling and undersampling techniques for imbalanced classification: A comparative study using credit card fraudulent transaction dataset,” in IEEE International Conference on Control and Automation, ICCA, IEEE Computer Society, Oct. 2020, pp. 803–808. doi: 10.1109/ICCA51439.2020.9264517.

C. M. F. Andriani and D. Susilaningrum, “Klasifikasi Waiting Time for Pilot di Pelabuhan Tanjung Perak Menggunakan Metode Regresi Logistik – Synthetic Minority Oversampling Technique (SMOTE),” J. Sains dan Seni ITS, vol. 12, no. 1, pp. 111–118, 2023.

E. Erlin, Y. Desnelita, N. Nasution, L. Suryati, and F. Zoromi, “Dampak SMOTE terhadap Kinerja Random Forest Classifier berdasarkan Data Tidak seimbang,” MATRIK J. Manajemen, Tek. Inform. dan Rekayasa Komput., vol. 21, no. 3, pp. 677–690, Jul. 2022, doi: 10.30812/matrik.v21i3.1726.

L. Ganda, R. Putra, K. Marzuki, and H. Hairani, “Correlation-based feature selection and Smote-Tomek Link to improve the performance of machine learning methods on cancer disease prediction,” Eng. Appl. Sci. Res., vol. 50, no. 6, pp. 577–583, 2023, doi: 10.14456/easr.2023.59.

A. Baita and N. Cahyono, “Analisis Sentimen Mengenai Vaksin SINOVAC Menggunakan Algoritma Support Vector Machine (SVM) DAN K-Nearest Neighbor (KNN),” Inf. Syst. J., vol. 4, no. 2, pp. 42–46, 2021.

P. Shah, P. Swaminarayan, and M. Patel, “Sentiment analysis on film review in Gujarati language using machine learning,” International Journal of Electrical and Computer Engineering, vol. 12, no. 1. Institute of Advanced Engineering and Science, pp. 1030–1039, Feb. 01, 2022. doi: 10.11591/ijece.v12i1.pp1030-1039.

H. Hairani, A. Anggrawan, and D. Priyanto, “Improvement Performance of the Random Forest Method on Unbalanced Diabetes Data Classification Using Smote,” Int. J. Informatics Vis., vol. 7, no. 1, pp. 258–264, 2023.

K. Alamat, W. Nugraha, D. Risdiansyah, D. Purwaningtias, T. Hidayatulloh, and S. Suhada, “Kombinasi Tomek Link dan SMOTE Untuk Mengatasi Ketidakseimbangan Kelas Pada Credit Card Fraud,” J. Larik, vol. 2, no. 2, pp. 32–40, 2022, [Online]. Available: http://jurnal.bsi.ac.id/index.php/larik

I. N. Switrayana, D. Ashadi, H. Hairani, and A. Aminuddin, “Sentiment Analysis and Topic Modeling of Kitabisa Applications using Support Vector Machine (SVM) and Smote-Tomek Links Methods,” Int. J. Eng. Comput. Sci. Appl., vol. 2, no. 2, pp. 81–91, Sep. 2023, doi: 10.30812/ijecsa.v2i2.3406.

E. Gusniawan Pradana, “Implementasi Web Crawler Untuk Mencari Harga Barang Termurah Dari Berbagai Situs E-Commerce Indonesia,” J. Teknol. Pint., vol. 2, no. 9, pp. 1–11, 2022.

J. Budiarto, “Identifikasi Kebutuhan Masyarakat Nusa Tenggara Barat pada Pandemi Covid-19 di Media Sosial dengan Metode Crawling (Requirements Identification for NTB People in pandemic covid-19 at Social Media Using Crawling Method),” JTIM J. Teknol. Inf. dan Multimed., vol. 2, no. 4, pp. 244–250, 2021.

P. Y. Saputra, “Implementasi Teknik Crawling Untuk Pengumpulan Data Dari Media Sosial Twitter,” J. Din., vol. 8, no. 2, pp. 160–168, 2017, [Online]. Available: www.quicksprout.com

M. Yusran, S. Rasyid, E. Sagita, R. N. D. Julia, and Siswanto, “Sentiment Analysis of Sustainable Development Goals on Twitter with Classifying Decision Tree C5.0 and Classification and Regression Tree,” Int. J. Acad. Appl. Res., vol. 6, no. 6, pp. 104–110, 2022, [Online]. Available: www.ijeais.org/ijaar

T. D. Dikiyanti, A. M. Rukmi, and M. I. Irawan, “Sentiment analysis and topic modeling of BPJS Kesehatan based on twitter crawling data using Indonesian Sentiment Lexicon and Latent Dirichlet Allocation algorithm,” in Journal of Physics: Conference Series, IOP Publishing Ltd, Mar. 2021. doi: 10.1088/1742-6596/1821/1/012054.

W. Astuti, D. Djoko, and A. Widodo, “Pemetaan Tindak Kejahatan Jalanan di Kota Semarang Menggunakan Algoritma K-Means Clustering,” J. Tek. Elektro, vol. 8, no. 1, pp. 5–7, 2016.

C. Sains Teknologi, S. Pakpahan, A. Manullang, and K. Kunci, “Analisis Sentimen Integritas KPK Tahun 2021 Pencegahan Korupsi pada Twitter KPK menggunakan Metode K-Nearest Neighbor dan Naive Bayes,” Citra Sains Teknol., vol. 2, no. 1, pp. 63–73, 2022.

M. S. Bahri, A. Hermawan, E. Pricilia Kondy, and R. Joyce Semida, “Performance Comparison of Supporting Vector Machine Method without or with Particle Swarm Optimization Based on Sentiment Analysis WhatsApp Review,” Int. J. Acad. Appl. Res., vol. 6, no. 6, pp. 94–101, 2022, [Online]. Available: www.ijeais.org/ijaar

W. E. Nurjanah, R. Setya Perdana, and M. A. Fauzi, “Analisis Sentimen Terhadap Tayangan Televisi Berdasarkan Opini Masyarakat pada Media Sosial Twitter menggunakan Metode K-Nearest Neighbor dan Pembobotan Jumlah Retweet,” J. Pengemb. Teknol. Inf. dan Ilmu Komput., vol. 1, no. 12, pp. 1750–1757, 2017.

R. Adinugroho, “Perbandingan Rasio Split Data Training dan Data Testing Menggunakan Metode LSTM Dalam Memprediksi Harga Indeks Saham Asia,” 2022. [Online]. Available: https://repository.uinjkt.ac.id/dspace/handle/123456789/67314

S. Rabbani, D. Safitri, N. Rahmadhani, A. A. F. Sani, and M. K. Anam, “Perbandingan Evaluasi Kernel SVM untuk Klasifikasi Sentimen dalam Analisis Kenaikan Harga BBM,” MALCOM Indones. J. Mach. Learn. Comput. Sci., vol. 3, no. 2, pp. 153–160, Oct. 2023, doi: 10.57152/malcom.v3i2.897.

D. Darwis, N. Siskawati, and Z. Abidin, “Penerapan Algoritma Naive Bayes untuk Analisis Sentimen Review Data Twitter BMKG Nasional,” J. Tekno Kompak, vol. 15, no. 1, pp. 131–145, 2021.

A. T. Putra, E. Kardinata, H. Junaedi, F. Chandra, and J. Santoso, “Ekstraksi Relasi Antar Entitas di Bahasa Indonesia Menggunakan Neural Network,” J. Inf. Syst. Hosp. Technol., vol. 3, no. 02, pp. 49–54, Oct. 2021, doi: 10.37823/insight.v3i02.156.

E. F. Swana, W. Doorsamy, and P. Bokoro, “Tomek Link and SMOTE Approaches for Machine Fault Classification with an Imbalanced Dataset,” Sensors, vol. 22, no. 9, May 2022, doi: 10.3390/s22093246.

R. M. Sari and A. Prasetyo, “Penerapan Synthetic Minority Oversampling Technique terhadap Data Perokok Anak di Nusa Tenggara Barat Tahun 2021,” Inferensi, vol. 6, no. 2, p. 133, Sep. 2023, doi: 10.12962/j27213862.v6i2.18472.

A. K. Duggal and M. Dave, “A Comparative Study of Load Balancing Algorithms in a Cloud Environment ..,” in Advances in Computing and Intelligent Systems Algorithms for Intelligent Systems Series, 2019, pp. 115–126. [Online]. Available: http://www.springer.com/series/16171

G. Douzas, F. Bacao, and F. Last, “Improving imbalanced learning through a heuristic oversampling method based on k-means and SMOTE,” Inf. Sci. (Ny)., vol. 465, pp. 1–20, Oct. 2018, doi: 10.1016/j.ins.2018.06.056.

D. C. R. Novitasari, M. F. Rozi, and R. Veriani, “Klasifikasi Kelainan Pada Jantung Melalui Citra Iris Mata Menggunakan Fuzzy C-Means Sebagai Pengambil Fitur Iris dan Klasifikasi Menggunakan Support Vector Machine,” INTEGER J. Inf. Technol., vol. 4, no. 1, pp. 1–10, 2019.

D. Devi, S. K. Biswas, and B. Purkayastha, “A Review on Solution to Class Imbalance Problem: Undersampling Approaches,” in 2020 International Conference on Computational Performance Evaluation (ComPE), 2020, pp. 626–631.

L. M. Sinaga, Sawaluddin, and S. Suwilo, “Analysis of classification and Naïve Bayes algorithm k-nearest neighbor in data mining,” in IOP Conference Series: Materials Science and Engineering, 2020. doi: 10.1088/1757-899X/725/1/012106.

R. Damarta, A. Hidayat, and A. S. Abdullah, “The Application of K-Nearest Neighbors Classifier For Sentiment Analysis of PT PLN (Persero) Twitter Account Service Quality,” in Journal of Physics: Conference Series, 2021. doi: 10.1088/1742-6596/1722/1/012002.

Anggi Priliani Yulianto and S. Darwis, “Penerapan Metode K-Nearest Neighbors (kNN) pada Bearing,” J. Ris. Stat., vol. 1, no. 1, pp. 10–18, Jul. 2021, doi: 10.29313/jrs.v1i1.16.

S. Dyah Fritama, Y. Raymond Ramadhan, and M. Andayani Komara, “Analisis Sentimen Review Produk Acne Spot Treatment di Female Daily Menggunakan Algoritma K-Nearest Neighbor,” KLIK Kaji. Ilm. Inform. dan Komput., vol. 4, no. 1, pp. 134–143, 2023, doi: 10.30865/klik.v4i1.1070.

A. Habibie and I. Rachmawati, “Analisis Preferensi Konsumen Dalam Memilih Smartphone di Indonesia Consumer Analysis of Preferences in Choosing Smartphone in Indonesia,” in e-Proceeding of Management, 2020, pp. 114–124.

Y. Dang, N. Jiang, H. Hu, Z. Ji, and W. Zhang, “Image classification based on quantum K-Nearest-Neighbor algorithm,” Quantum Inf. Process., vol. 17, no. 9, pp. 1–18, Sep. 2018, doi: 10.1007/s11128-018-2004-9.

Indrayanti, D. Sugianti, and M. Karomi, Al Adib, “Optimasi Parameter Pada Algoritma K-Nearest Neighbour Untuk Klasifikasi Penyakit Diabetes Mellitus,” in Prosiding SNATIF, 2017, pp. 823–829.

S. Ruuska, W. Hämäläinen, S. Kajava, M. Mughal, P. Matilainen, and J. Mononen, “Evaluation of the Confusion Matrix Method in the Validation of an Automated System For Measuring Feeding Behaviour of Cattle,” Behav. Processes, vol. 148, pp. 56–62, Mar. 2018, doi: 10.1016/j.beproc.2018.01.004.

A. . Ihsan, “Reduksi Atribut Pada Algoritma K-Nearest Neighbor (KNN) Dengan Menggunakan Algoritma Genetika,” 2018. [Online]. Available: https://repositori.usu.ac.id/handle/123456789/3878

G. Zeng, “On the Confusion Matrix In Credit Scoring and Its Analytical Properties,” Commun. Stat. - Theory Methods, vol. 49, no. 9, pp. 2080–2093, 2019, doi: 10.1080/03610926.2019.1568485.

B. Juba and H. S. Le, “Precision-Recall versus Accuracy and the Role of Large Data Sets,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2019, pp. 4039–4048. [Online]. Available: www.aaai.org

DOI: https://doi.org/10.32520/stmsi.v13i5.4013

Article Metrics

Abstract view : 754 times
PDF - 181 times

Refbacks

There are currently no refbacks.

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Username
Password
Remember me