Leveraging TF-IDF and Random Forest to Uncover Genre Patterns in Google Books Metadata

Nadya Awalia Putri; Bayu Priya Mukti

doi:10.47738/ijaim.v5i4.112

PDF (English)

Diterbitkan: Nov 21, 2025

DOI: https://doi.org/10.47738/ijaim.v5i4.112

Citation Analysis:

Nadya Awalia Putri

Magister of Computer Science, Amikom Purwokerto University, Indonesia

Bayu Priya Mukti

Magister of Computer Science, Amikom Purwokerto University, Indonesia

Abstrak

This paper presents a machine learning-based approach for classifying books into genres using their descriptions. We employed a Random Forest classifier combined with Term Frequency-Inverse Document Frequency (TF-IDF) to convert text descriptions into numerical features, enabling the classification of books into six genres: Fiction, Literary Criticism, Education, Social Science, Biography & Autobiography, and Unknown Genre. The model was trained and evaluated on a dataset sourced from Google Books, which was preprocessed to remove missing data and clean the text descriptions by eliminating punctuation, numbers, and stopwords. We performed 5-fold cross-validation to assess the model's performance, which resulted in an average cross-validation accuracy of 64.22%. The final model achieved an accuracy of 62.71% on the test set, with the highest recall observed in the "Fiction" genre. The results indicated that the Random Forest classifier was particularly effective in classifying well-represented genres like "Fiction" and "Unknown Genre." However, genres with fewer samples, such as "Social Science" and "Biography & Autobiography," showed poor performance, highlighting the challenges posed by class imbalance and data sparsity. A confusion matrix and classification report revealed these discrepancies, with certain genres being misclassified more often than others. This research demonstrates the feasibility of using machine learning for automated book genre classification, offering significant potential for enhancing book recommendation systems and improving user experience. Despite its promising results, the study's limitations, including data sparsity and genre imbalance, suggest that further work is needed to refine the model. Future research could explore the use of deep learning techniques and the expansion of the dataset to address these issues and improve genre classification accuracy. The potential for automated genre classification in real-world applications, such as book categorization and personalized recommendations, presents an exciting direction for the book industry.

Cara Mengutip

[1]

N. A. Putri dan B. P. Mukti, “Leveraging TF-IDF and Random Forest to Uncover Genre Patterns in Google Books Metadata”, Int. J. Appl. Inf. Manag., vol. 5, no. 4, hlm. 168–178, Nov 2025.

Terbitan

Vol 5 No 4 (2025): Regular Issue: December 2025

Bagian

Articles

Artikel ini berlisensiCreative Commons Attribution-ShareAlike 4.0 International License.

Authors who publish with International Journal for Applied Information Management agree to the following terms: Authors retain copyright and grant the International Journal for Applied Information Management right of first publication with the work simultaneously licensed under a Creative Commons Attribution License (CC BY-SA 4.0) that allows others to share (copy and redistribute the material in any medium or format) and adapt (remix, transform, and build upon the material) the work for any purpose, even commercially with an acknowledgement of the work's authorship and initial publication in International Journal for Applied Information Management. Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in International Journal for Applied Information Management. Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).

2776-8007 (Online)
Published by	:	Bright Institute
Website	:	ijaim.net
Email	:	agung@ijaim.net (managing editor)
		support@ijaim.net (technical issues)

Bilah Samping Artikel

Isi Artikel Utama

Abstrak

Rincian Artikel