Novel Genre Classification based on Synopsis using the Random Forest Algorithm

Prananing Mahanani, Charitas Fibriani (SCOPUS ID=57192643331)

Abstract


Novel genre classification based on synopses presents a significant challenge in text processing, as each genre exhibits distinct lexical characteristics. This study evaluates the performance of the Random Forest algorithm in classifying novel genres under conditions of imbalanced data distribution. The research stages include text preprocessing—comprising case folding, tokenization, stopword removal, and stemming—feature extraction using Term Frequency–Inverse Document Frequency (TF-IDF), and model training with Random Forest. In addition, manual data balancing was applied by increasing samples in minority classes through simple oversampling. The model was evaluated using accuracy metrics and confusion matrix analysis. The results indicate that Random Forest is able to identify most genres with moderate accuracy, particularly for classes with larger data volumes. The initial model achieved an accuracy of 42.11%, which increased to 46.67% after the application of data balancing. Misclassification primarily occurred in genres with limited samples that share similar vocabulary with dominant genres. These findings demonstrate that Random Forest can still be applied to synopsis-based novel genre classification without fully relying on balancing techniques. However, performance remains uneven across classes, highlighting the need for per-genre analysis to obtain a more comprehensive evaluation.

Keywords


classification; imbalanced data; novel genre; random forest

Full Text:

PDF

References


A. Sethy, A. K. Rout, A. Uriti, and S. P. Yalla, “Revue d ’ Intelligence Artificielle A Comprehensive Machine Learning Framework for Automated Book Genre Classifier,” Vol. 37, No. 3, pp. 745–751, 2023.

S. Nouas, L. Oukid, and F. Boumahdi, “ur l P re,” Data SCI. Manag., 2025, DOI: 10.1016/j.dsm.2025.03.001.

C. Kaope and Y. Pristyanto, “The Effect of Class Imbalance Handling on Datasets Toward Classification Algorithm Performance,” Vol. 22, No. 2, pp. 227–238, 2023, DOI: 10.30812/matrik.v22i2.2515.

L. Dube and T. Verster, “Enhancing Classification Performance in Imbalanced Datasets: A Comparative Analysis of Machine Learning Models,” Data SCI. Financ. Econ., Vol. 3, No. 4, pp. 354–379, 2023, DOI: 10.3934/dsfe.2023021.

N. Jalal, A. Mehmood, G. Sang, and I. Ashraf, “A Novel Improved Random Forest for Text Classification using Feature Ranking and Optimal Number of Trees,” J. King Saud Univ. - Comput. Inf. SCI., Vol. 34, No. 6, pp. 2733–2742, 2022, DOI: 10.1016/j.jksuci.2022.03.012.

A. Agung, A. Witaradiani, I. G. Arta, and P. Praba, “Klasifikasi Genre Buku berdasarkan Sinopsis menggunakan Naïve Bayes dan Logistic Regression,” Vol. 3, pp. 835–844, 2025.

N. D. Primadya, A. Nugraha, and S. Y. Fahrezi, “Optimizing Imbalanced Data Classification : Under Sampling Algorithm Strategy with Classification Combination,” No. April 2024, pp. 277–288.

M. R. F. Rahmatullah, P. N. Andono, and M. A. Soeleman, “Improving Random Forest Performance for Sentiment Analysis on Unbalanced Data using SMOTE and BoW Integration : PLN Mobile Application Case Study,” Vol. 12, No. 1, pp. 1–10, 2025, DOI: 10.15294/sji.v12i1.19295.

A. Nawaz, A. Ahmad, and S. S. Khan, “Beyond Rebalancing: Benchmarking Binary Classifiers Under Class Imbalance Without Rebalancing Techniques,” 2025, [Online]. Available: http://arxiv.org/abs/2509.07605

S. Wang, Y. Dai, J. Shen, and J. Xuan, “Research on Expansion and Classification of Imbalanced Data based on SMOTE Algorithm,” SCI. Rep., Vol. 11, No. 1, pp. 1–11, 2021, DOI: 10.1038/s41598-021-03430-5.

M. Altalhan, A. Algarni, and M. Turki-Hadj Alouane, “Imbalanced Data Problem in Machine Learning: A Review,” IEEE Access, Vol. 13, pp. 13686–13699, 2025, DOI: 10.1109/ACCESS.2025.3531662.

A. S. More and D. P. Rana, “An Experimental Assessment of Random Forest Classification Performance Improvisation with Sampling and Stage Wise Success Rate Calculation,” Procedia Comput. SCI., Vol. 167, No. Iccids 2019, pp. 1711–1721, 2020, DOI: 10.1016/j.procs.2020.03.381.

T. Wongvorachan, S. He, and O. Bulut, “A Comparison of Undersampling, Oversampling, and SMOTE Methods for Dealing with Imbalanced Classification in Educational Data Mining,” Inf., Vol. 14, No. 1, 2023, DOI: 10.3390/info14010054.

W. Chen, K. Yang, Z. Yu, Y. Shi, and C. L. P. Chen, A Survey on Imbalanced Learning: Latest Research, Applications and Future Directions, Vol. 57, No. 6. 2024. DOI: 10.1007/s10462-024-10759-6.

D. Siswara, A. M. Soleh, and A. Hamim Wigena, “Classification Modeling with RNN-based, Random Forest, and XGBoost for Imbalanced Data: A Case of Early Crash Detection in ASEAN-5 Stock Markets,” Sci. J. Informatics, Vol. 11, No. 3, pp. 569–582, 2024, DOI: 10.15294/sji.v11i3.4067.

M. Imani, A. Beikmohammadi, and H. R. Arabnia, “Comprehensive Analysis of Random Forest and XGBoost Performance with SMOTE, ADASYN, and GNUS under Varying Imbalance Levels,” Technologies, Vol. 13, No. 3, pp. 1–40, 2025, DOI: 10.3390/technologies13030088




DOI: https://doi.org/10.32520/stmsi.v15i1.5815

Article Metrics

Abstract view : 8 times
PDF - 0 times

Refbacks

  • There are currently no refbacks.


Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.