Multilayered Class Discrimination in Large-Scale Taxonomies

Gomez, Juan Carlos; Moens, Marie-Francine

doi:10.3233/978-1-61499-105-2-615

Abstract

In this work we implement and evaluate a methodology to classify multi-labeled web documents into large-scale taxonomies, using their text content. Multi-label hierarchical classification using large-scale taxonomies is a hard task due to problems of scarcity of training data in many nodes of the hierarchy, overlapping of content and complex decision surfaces. We propose a novel feature extraction model called Multilayered Class Discrimination (MCD), which reduces the dimensions of the text-content features of the web documents along the different levels of the hierarchy, helping to discriminate each class from other classes in the same level and reducing the effects of the mentioned problems. The results of categorizing web documents from the DMOZ directory show that our model improves the accuracy of the categorization when compared with the use of word features, and that the results are competitive with the ones presented in the Second LSHTC Challenge.

This website uses cookies

This website uses cookies