RESOLVING MULTI-CLASS IMBALANCE USING COUNTERFACTUAL DATA AUGMENTATION

  • Hoai Bac Le
  • Thai Nguyen Xuan
  • Quang Hung Nguyen

Abstract

Data imbalance in multi-class classification, where some classes have significantly fewer
samples than others, is a prevalent challenge in machine learning. This paper evaluates the
effectiveness of Counterfactual Augmentation for Multi-Class (CFA-MC), a data augmentation
technique that utilizes "native counterfactuals" to generate synthetic samples for minority classes,
compared to two popular oversampling techniques, SMOTE and ADASYN. We conduct experiments
on multiple benchmark multi-class datasets and employ three different classifiers (Random Forest, kNearest Neighbors, and Multilayer Perceptron) to assess the performance of the three methods.
Experimental results demonstrate that CFA-MC consistently outperforms SMOTE and ADASYN in
terms of macro-averaged F1-score, suggesting its ability to generate more plausible synthetic
samples, improve decision boundary representation, and mitigate overfitting.

References

[1].He, H., & Garcia, E. A. (2009). Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering, 21(9), 1263–1284.
[2].Japkowicz, N., & Stephen, S. (2002). The class imbalance problem: A systematic study. Intelligent data analysis, 6(5), 429-449.
[3].Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research, 16(1), 321–357.
[4].He, H., Bai, Y., Garcia, E. A., & Li, S. (2008). ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced Learning. In 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence) (pp. 1322-1328). IEEE.
[5].Temraz, M., & Keane, M. T. (2022). Solving the class imbalance problem using a counterfactual method for data augmentation. Machine Learning with Applications, 9, 100375
[6].Mantiuk, F., & Kurth, L. (2022). Comment on “Solving the Class Imbalance Problem Using aCounterfactual Method for Data Augmentation”
[7].Dua, D., & Graff, C. (2017). UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences.
[8].Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5-32.
[9].Cover, T., & Hart, P. (1967). Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 13(1), 21-27.
[10]. Haykin, S. (2009). Neural networks and learning machines. Pearson Education.
[11]. Powers, D. M. (2011). Evaluation: From precision, recall and F-measure to ROC, informedness, markedness & correlation. Journal of Machine Learning Technologies, 2(1), 37-63.
[12]. Keane, M. T., & Smyth, B. (2020). Good counterfactuals and where to find them: A case-based technique for generating counterfactuals for explainable AI (XAI). In Proceedings of the 28th International Conference on Case Based Reasoning (pp. 163–178). Springer.
[13]. Alcalá-Fdez, J., Fernandez, A., Luengo, J., Derrac, J., García, S., Sánchez, L., & Herrera, F. (2011). KEEL datamining software tool: Data set repository, integration of algorithms and experimental analysis framework. Journal of Multiple-Valued Logic and Soft Computing, 17, 255–287.
[14]. Hosmer Jr, D. W., Lemeshow, S., & Sturdivant, R. X. (2013). Applied logistic regression. John Wiley & Sons.
[15]. Fawcett, T. (2006). An introduction to ROC analysis. Pattern recognition letters, 27(8), 861-874.
[16]. Han, H., Wang, W. Y., & Mao, B. H. (2005). Borderline-SMOTE: A new oversampling method in imbalanced data sets learning. In International Conference on Intelligent Computing (pp. 878-887). Springer.
[17]. Bunkhumpornpat, C., Sinapiromsaran, K., & Lursinsap, C. (2009). Safe-levelSMOTE: Safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. In Pacific-Asia Conference on Knowledge Discovery and Data Mining (pp. 475-482). Springer.
[18]. Nguyen, H. M., Cooper, E. W., & Kamei, K. (2011). Borderline oversampling for imbalanced data classification. International Journal of Knowledge Engineering and Soft Data Paradigms, 3(1), 4-21.
[19]. Ramentol, E., Caballero, Y., Bello, R., & Herrera, F. (2012). SMOTE-RSB: A hybrid preprocessing approach based on oversampling and undersampling for high imbalanced datasets using SMOTE and rough sets theory. Knowledge and Information Systems, 33, 245–265.
[20]. Zhang, Z., Luo, Y., & Chen, S. (2022). Multi-class imbalanced learning with Gaussian distribution-based oversampling. Knowledge-Based Systems, 242, 108358.
[21]. Li, J., Liu, X., & Yin, J. (2023). Adaptive synthetic sampling for multiclass imbalanced data. Pattern Recognition, 136, 109205.
[22]. Wang, S., Wang, Y., & Liu, Z. (2023). Multi-class imbalanced learning with Kmeans SMOTE and ensemble learning. Information Sciences, 618, 117-134.
[23]. Feng, L., Cai, Y., & Liu, J. (2022). Multi-class imbalanced data classification using evolutionary undersampling. Swarm and Evolutionary Computation, 71, 101083.
[24]. Chen, Z., Li, W., & Wang, S. (2023). Cluster-based undersampling for multiclass imbalanced data. Applied Soft Computing, 135, 109232.
[25]. Liu, Y., Yang, J., & Zhou, Z. (2022). A hybrid resampling method for multi-class imbalanced data classification. IEEE Access, 10, 35854-35866.
Published
2025-01-17