MNDO: Multivariate Normal Distribution Based Over-Sampling for Binary Classification

Ambai, Kotaro; Fujita, Hamido

doi:10.3233/978-1-61499-900-3-425

Abstract

Datasets that the number of instances of majority classes and minority classes are not equal is called imbalanced datasets. In such a dataset classification, it is difficult to apply learning algorithm. There are many unbalanced datasets in the real world, the imbalance problem is subject to research by researchers in many fields. Sampling is one way to handling the imbalance problem. Sampling technique focuses on balancing instances of majority classes and minority classes. However, with many over-sampling techniques, samples are synthesized using the distance between existing samples without using the correlation of each attribute. In this paper, we propose Multivariate Normal Distribution based Over-sampling (MNDO) considering the correlation in the dataset. MNDO firstly calculate the correlation coefficient of each attribute of the positive class. Next, generate new samples using multivariate normal distribution. Multivariate normal distribution is calculated using two attributes with the strongest correlation. Attributes which correlation is very weak will be over-sampled using the univariate normal distribution. The proposed method uses statistics of positive class, therefore it is possible to recover the missing value that exists in the imbalance dataset. In addition, outliers can be reproduced stochastically, so more realistic samples can be generated. We used 39 imbalance datasets in the experiment. To compare with the existing method, 6 sampling methods (SMOTE, Borderline SMOTE 1, Borderline SMOTE 2, ADASYN, SMOTE-ENN, SMOTE-Tomek), 3 learning methods (SVM, Decision Tree, k-NN) and 2 scaling (Normalize, Standardize) were used. As a result of the experiment, proposed method showed excellent results for some datasets.

This website uses cookies

This website uses cookies