

Imbalanced text classification, as practical and essential text classification, is the task to learn labels or categories for imbalanced text data. Existing imbalanced text classification approaches are mostly based on the Imbalance Ratio (i.e. ratio of sizes between categories). Recently, some researchers verified that the imbalance ratio severely affects the performance of classifiers when intrinsic characteristics of data such as class overlapping and small disjuncts occur. However, since the distribution of real-world data is unknown, it is difficult to describe above intrinsic characteristics directly. In this paper, we transform the unknown distribution of data into a graph model and present a graph-based imbalance index named GIR to predict the impact of imbalanced text data on classification performance. Firstly, we introduce an environmental factor that makes the imbalance index sensitive to the intrinsic characteristics of data. Secondly, we propose a graph-based method to calculate this environmental factor. Finally, we use the imbalance index to analyze the performances of imbalanced learning methods and the impact of imbalanced data on text classifiers. The experimental results evaluated on both synthetic data sets and real-world data sets demonstrate the effectiveness of our approach.