Optimization of IndoBERT-Lite Fine-Tuning for Spam Detection in Digital Customer Services

Farouq Mulya Al Simabua, Lathifah Alfat

Abstract


Automated text moderation systems on public service platforms are often exploited by manipulative spam messages from brokers offering illegal financial services. Previous text classification studies have frequently prioritized high accuracy metrics while overlooking the impact of data leakage caused by repetitive spam templates, a methodological flaw that can lead to severe model overfitting. This study aims to design and optimize a Natural Language Processing (NLP) model using the IndoBERT-Lite architecture to distinguish between organic user complaints and manipulative broker-generated comments. The proposed methodology focuses on extreme data deduplication, reducing 55,156 raw records into a balanced dataset containing 4,626 unique samples (57.1% organic and 42.9% spam). The training process was optimized using Gradient Accumulation and Early Stopping to ensure genuine model generalization capability. The evaluation results demonstrate that the optimized model successfully mitigated the initial overfitting problem, achieving both accuracy and F1-score values of 98% on unseen test data. These findings provide a reliable and data leakage–free automated moderation solution for internal digital customer service systems.

Keywords


broker spam; data leakage; early stopping; IndoBERT-Lite; Text classification

Full Text:

PDF

References


A. Elmahdy, H. A. Inan, and R. B. Sim, “Privacy Leakage in Text Classification A Data Extraction Approach,” pp. 13–20, Jan. 2022, DOI: 10.18653/v1/2022.privatenlp-1.3.

J.-S. Kim, H.-J. Lee, H. W. Lee, and S.-H. Choi, “Advanced Analysis of Learning-based Spam Email Filtering Methods based on Feature Distribution Differences of Dataset,” IEEE Access, Vol. 12, pp. 167313–167323, Jan. 2024, DOI: 10.1109/access.2024.3495830.

K. Kamdan, M. P. Anugrah, M. J. Almutaali, R. Ramdani, and I. L. Kharisma, “Performance Analysis of IndoBERT for Detection of Online Gambling Promotion in YouTube Comments,” in The 7th International Global Conference Series on ICT Integration in Technical Education & Smart Society, MDPI, Sep. 2025, p. 66. DOI: 10.3390/engproc2025107066.

M. Isnaini, Y. Triana, and I. Afrita, “Law Enforcement in Online Fraud Cases in the Jurisdiction of the Pekanbaru City Resort Police,” JILPR J. Indones. Law Policy Rev., Vol. 7, No. 2, pp. 391–405, Feb. 2026, DOI: 10.56371/jirpl.v7i2.594.

R. L. Mustofa and C. E. Widodo, “Analisis Sentimen berbasis Aspek pada Aplikasi Elektronik Survei Kepuasan Masyarakat (E-SKM) Jawa Tengah menggunakan IndoBERT”.

R. A. Supono and M. I. Imani, “Implementasi Machine Learning untuk Klasifikasi Email Spam menggunakan Indobert, Hugging Face Transfomers dan Streamlit,” J. Sos. Teknol., Vol. 6, No. 1, pp. 420–440, Feb. 2026, DOI: 10.59188/jurnalsostech.v6i1.32659.

F. Destryanto, P. Rizqiyah, and P. Sokibi, “Sentiment Analysis of Public Response to the Free Nutritious Meal Program on Instagram using IndoBERT,” J. Artif. Intell. Eng. Appl. JAIEA, Vol. 5, No. 1, pp. 1924–1928, Oct. 2025, DOI: 10.59934/jaiea.v5i1.1755.

V. E. Sidauruk and W. Herowati, “Indobert-based Sentiment Analysis of Political Discourse on Platform X: The Case of Prabowo-Gibran Administration,” J. Appl. Inform. Comput., Vol. 10, No. 1, pp. 673–683, Feb. 2026, DOI: 10.30871/jaic.v10i1.11586.

S. Agustian, M. I. Syah, N. Fatiara, and R. Abdillah, “New Directions in Text Classification Research: Maximizing The Performance of Sentiment Classification from Limited Data,” ArXiv Cornell Univ., Jul. 2024, DOI: 10.48550/arxiv.2407.05627.

D. Purwitasari, A. S. S. Ansyah, A. P. Kurniawan, and A. N. Kholifah, “A Hybrid Method on Emotion Detection for Indonesian Tweets of COVID-19,” J. RESTI Rekayasa Sist. dan Teknol. Inf., Vol. 7, No. 2, pp. 254–262, Mar. 2023, DOI: 10.29207/resti.v7i2.4816.

R. Jeffmarvin, H. Dzaky, Y. Ardiyanto, A. D. Saputra, D. Irawan, and J. B. Ardianto, “Analisis Perbandingan: SMOTE dan Undersampling pada Klasifikasi Spam Naïve Bayes,” J. Inform. Interact. Technol., Vol. 2, No. 2, pp. 377–383, Aug. 2025, DOI: 10.63547/jiite.v2i2.92.

S. Ni et al., “Training on the Benchmark is not All You Need,” ArXiv Cornell Univ., Sep. 2024, DOI: 10.48550/arxiv.2409.01790.

W. Elouataoui, “AI-Driven Frameworks for Enhancing Data Quality in Big Data Ecosystems: Error_Detection, Correction, and Metadata Integration,” ArXiv Cornell Univ., May 2024, DOI: 10.48550/arxiv.2405.03870.

A. G. M. Meque, N. Hussain, G. Sidorov, and A. Gelbukh, “Machine Learning-based Guilt Detection in Text,” SCI. Rep., Vol. 13, No. 1, Jul. 2023, DOI: 10.1038/s41598-023-38171-0.

K. F. H. Holle, D. N. Munna, and E. W. Ekaputri, “Performance Evaluation of Transformer Models: Scratch, Bart, and Bert for News Document Summarization,” J. Tek. Inform. Jutif, Vol. 6, No. 2, pp. 787–802, Apr. 2025, DOI: 10.52436/1.jutif.2025.6.2.2534.

M. A. Hadi and F. H. Fard, “Evaluating Pre-Trained Models for User Feedback Analysis in Software Engineering: A Study on Classification of App-Reviews,” Empir. Softw. Eng., Vol. 28, No. 4, May 2023, DOI: 10.1007/s10664-023-10314-x.

A. T. Riadi, F. Indriani, M. I. Mazdadi, M. R. Faisal, and R. Herteno, “Cross-Temporal Generalization of IndoBERT for Indonesian Hoax News Classification,” J. Tek. Inform. Jutif, Vol. 6, No. 5, pp. 5291–5304, Oct. 2025, DOI: 10.52436/1.jutif.2025.6.5.4757.

N. K. Nissa and E. Yulianti, “Multi-Label Text Classification of Indonesian Customer Reviews using Bidirectional Encoder Representations from Transformers Language Model,” Int. J. Power Electron. Drive Syst. J. Electr. Comput. Eng., Vol. 13, No. 5, pp. 5641–5641, Jun. 2023, DOI: 10.11591/ijece.v13i5.pp5641-5652.

D. Z. Abidin, L. Afuan, A. N. Toscany, and N. Nurhadi, “A Comprehensive Benchmarking Pipeline for Transformer-based Sentiment Analysis using Cross-Validated Metrics,” J. Tek. Inform. Jutif, Vol. 6, No. 4, pp. 1797–1810, Aug. 2025, DOI: 10.52436/1.jutif.2025.6.4.4894.

Wildan Amru Hidayat and V. R. S. Nastiti, “Perbandingan Kinerja Pre-trained IndoBERT-base dan IndoBERT-Lite pada Klasifikasi Sentimen Ulasan TikTok Tokopedia Seller Center dengan Model IndoBERT” JSiI J. Sist. Inf., Vol. 11, No. 2, pp. 13–20, Sep. 2024, DOI: 10.30656/jsii.v11i2.9168.

H. D. Jayanti and A. Rohman, “Cyberbullying Detection in Indonesian TikTok Comments using IndoBERT with Fairness Evaluation,” J. Inf. Syst. Inform., Vol. 8, No. 1, pp. 907–927, Mar. 2026, DOI: 10.63158/journalisi.v8i1.1448.

M. A. Nur, N. Umar, Z. Feng, and H. Gani, “Evaluation of IndoBERT and RoBERTa: Performance of Indonesian Language Transformer Models in Sentiment Classification,” JIKO J. Inform. Dan Komput., Vol. 8, No. 2, pp. 121–127, Jul. 2025, DOI: 10.33387/jiko.v8i2.9988.

M. H. Humaidi, S. Sutrisno, and P. W. Laksono, “Implementation of Machine Learning for Text Classification using the Naive Bayes Algorithm in Academic Information Systems at Sebelas Maret University Indonesia,” E3S Web Conf., Vol. 465, pp. 2048–2048, Jan. 2023, DOI: 10.1051/e3sconf/202346502048.

S. Beddar-Wiesing, A. Moallemy-Oureh, M. Kempkes, and J. M. Thomas, “Absolute Evaluation Measures for Machine Learning: A Survey,” ArXiv Cornell Univ., Jul. 2025, [Online]. Available: http://arxiv.org/abs/2507.03392

S. Sunardi, A. Yudhana, and M. Fahmi, “Improving Waste Classification using Convolutional Neural Networks: An Application of Machine Learning for Effective Environmental Management,” Rev. Intell. Artif., Vol. 37, No. 4, pp. 845–855, Aug. 2023, DOI: 10.18280/ria.370404.

A. Nuril Wahyuni, T. Listyorini, and E. Supriyati, “Implementasi Model IndoBERT dan mBERT untuk Deteksi Berita Hoaks berbasis Web,” JATI J. Mhs. Tek. Inform., Vol. 10, No. 2, pp. 2736–2743, Mar. 2026, DOI: 10.36040/jati.v10i2.17752.

M. B. M. Amin et al., “Deteksi Spam Berbahasa Indonesia berbasis Teks menggunakan Model Bert,” J. Teknol. Inf. Dan Ilmu Komput., Vol. 11, No. 6, pp. 1291–1302, Dec. 2024, DOI: 10.25126/jtiik.1168121.




DOI: https://doi.org/10.32520/stmsi.v15i5.6398

Article Metrics

Abstract view : 0 times
PDF - 0 times

Refbacks

  • There are currently no refbacks.


Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.