Exploring Text Classification Methods for Bulletin Board System Posts: A Comparative Analysis of BERT, BiLSTM, and Different Loss Functions

Hu, Wenhan; Wang, Mini Han

doi:10.3233/FAIA250105

Abstract

Multi-Label Text Classification (MLTC) is a crucial task in natural language processing (NLP), enabling the assignment of multiple labels to a single text sample, which aligns with the diverse and multifaceted nature of discussions typically found in Bulletin Board System. This study presents an investigation into text classification methodologies, leveraging a dataset comprising 388,693 entries, with 234,237 entries manually annotated for model training. The dataset encompasses diverse text data from prominent social platforms, including GitHub, H5-based forums, WeChat, QQ group chats, and more. Four distinct methods for text classification are compared: BERT and BiLSTM models with Binary Cross-Entropy (BCE) loss, BERT for feature extraction followed by BiLSTM and BCE, BERT and BiLSTM models with Focal Loss (FL), and BERT for feature extraction followed by BiLSTM and FL. The experimentation reveals insights into their performance, indicating that models utilizing pre-trained BERT for feature extraction outperform those without pre-training. Focal Loss emerges as a superior alternative to Binary Cross-Entropy, demonstrating efficacy in handling class imbalance and noisy data, thereby improving overall model accuracy and robustness. These findings underscore the importance of thoughtful model architecture and loss function selection. Future research directions include exploring ensemble methodologies, alternative pre-training techniques for BERT, and enhancing model interpretability. Keeping pace with NLP advancements and integrating cutting-edge techniques into future investigations holds promise for further advancements in model efficacy and practical utility.

This website uses cookies

This website uses cookies