With the rapid development of e-commerce and online review platforms, the number of reviews of product has been multiplied, which makes it significant to mine valuable information from them for both businesses and consumers. Usually text classification methods are the main approaches to deal with this kind of problems. There are several steps in the process of text classification, and many different choices of methods or components can be selected in each step, so there are many possible combinations of schemas. However, there was lack of comparison of those different combinations in the past. In this paper, different combinations of components of text classification are constructed and evaluated. In the feature selection and weighting step, mutual information, information gain, chi-square test and TF-IDF methods are used as the alternatives. In the text classification step, four frequently used machine learning methods are selected as the components. The experiments are conducted on an annotated Chinese car reviews corpus. Results show that the combination of using chi-square test and Support Vector Machine algorithm obtain the best performance. The relationship between the performance and the number of the features is also studied, and empirical size of the corpus in this kind of task is given.
IOS Press, Inc.
6751 Tepper Drive
Clifton, VA 20124
Tel.: +1 703 830 6300
Fax: +1 703 830 2300 email@example.com
(Corporate matters and books only) IOS Press c/o Accucoms US, Inc.
For North America Sales and Customer Service
West Point Commons
Lansdale PA 19446
Tel.: +1 866 855 8967
Fax: +1 215 660 5042 firstname.lastname@example.org