Leveraging TF-IDF and Random Forest to Uncover Genre Patterns in Google Books Metadata

Main Article Content

Nadya Awalia Putri
Bayu Priya Mukti

Abstract

This paper presents a machine learning-based approach for classifying books into genres using their descriptions. We employed a Random Forest classifier combined with Term Frequency-Inverse Document Frequency (TF-IDF) to convert text descriptions into numerical features, enabling the classification of books into six genres: Fiction, Literary Criticism, Education, Social Science, Biography & Autobiography, and Unknown Genre. The model was trained and evaluated on a dataset sourced from Google Books, which was preprocessed to remove missing data and clean the text descriptions by eliminating punctuation, numbers, and stopwords. We performed 5-fold cross-validation to assess the model's performance, which resulted in an average cross-validation accuracy of 64.22%. The final model achieved an accuracy of 62.71% on the test set, with the highest recall observed in the "Fiction" genre. The results indicated that the Random Forest classifier was particularly effective in classifying well-represented genres like "Fiction" and "Unknown Genre." However, genres with fewer samples, such as "Social Science" and "Biography & Autobiography," showed poor performance, highlighting the challenges posed by class imbalance and data sparsity. A confusion matrix and classification report revealed these discrepancies, with certain genres being misclassified more often than others. This research demonstrates the feasibility of using machine learning for automated book genre classification, offering significant potential for enhancing book recommendation systems and improving user experience. Despite its promising results, the study's limitations, including data sparsity and genre imbalance, suggest that further work is needed to refine the model. Future research could explore the use of deep learning techniques and the expansion of the dataset to address these issues and improve genre classification accuracy. The potential for automated genre classification in real-world applications, such as book categorization and personalized recommendations, presents an exciting direction for the book industry.

Article Details

How to Cite
[1]
N. A. Putri and B. P. Mukti, “Leveraging TF-IDF and Random Forest to Uncover Genre Patterns in Google Books Metadata”, Int. J. Appl. Inf. Manag., vol. 5, no. 4, pp. 168–178, Nov. 2025.
Section
Articles