Resolving Multi-Class Imbalance using Counterfactual Data Augmentation
Abstract
Data imbalance in multi-class classification, where some classes have significantly fewer samples than others, is a prevalent challenge in machine learning. This paper evaluates the effectiveness of Counterfactual Augmentation for Multi-Class (CFA-MC), a data augmentation technique that utilizes ”native counterfactuals” to generate synthetic samples for minority classes, compared to two popular oversampling techniques, SMOTE and ADASYN. We conduct experiments on multiple benchmark multi-class datasets and employ three different classifiers (Random Forest, k-Nearest Neighbors, and Multilayer Perceptron) to assess the performance of the three methods. Experimental results demonstrate that CFAMC consistently outperforms SMOTE and ADASYN in terms of macro-averaged F1-score, suggesting its ability to generate more plausible synthetic samples, improve decision boundary representation, and mitigate overfitting.
References
H. He and E. A. Garcia, ”Learning from imbalanced data,” IEEE Transactions on Knowledge and Data Engineering, vol. 21, no. 9, pp. 1263–1284, Sep. 2009.
N. Japkowicz and S. Stephen, ”The class imbalance problem: A systematic study,” Intelligent Data Analysis, vol. 6, no. 5, pp. 429–449, 2002.
N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, ”SMOTE: Synthetic Minority Over-sampling Technique,” Journal of Artificial Intelligence Research, vol. 16, pp. 321–357, Jun. 2002.
H. He, Y. Bai, E. A. Garcia, and S. Li, ”ADASYN: Adaptive synthetic sampling approach for imbalanced learning,” Proceedings of the IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), pp. 1322–1328, 2008.
M. Temraz and M. T. Keane, ”Solving the class imbalance problem using a counterfactual method for data augmenta tion,” Machine Learning Applications, vol. 9, p. 100375, Jun. 2022.
F. Mantiuk and L. Kurth, ”Comment on Solving the Class Imbalance Problem Using a Counterfactual Method for Data Augmentation,” Machine Learning Applications, vol. 11, p. 100412, Dec. 2022.
D. Dua and C. Graff, ”UCI Machine Learning Reposi tory,” University of California, Irvine, School of Informa tion and Computer Sciences, 2017. [Online]. Available: http://archive.ics.uci.edu/ml
L. Breiman, ”Random forests,” Machine Learning, vol. 45, no. 1, pp. 5–32, Oct. 2001.
T. Cover and P. Hart, ”Nearest neighbor pattern classifica tion,” IEEE Transactions on Information Theory, vol. 13, no. 1, pp. 21–27, Jan. 1967.
S. Haykin, Neural Networks and Learning Machines, 3rd ed. Upper Saddle River, NJ, USA: Pearson Education, 2009.
D. M. Powers, ”Evaluation: From precision, recall and F measure to ROC, informedness, markedness & correlation,” Journal of Machine Learning Technologies, vol. 2, no. 1, pp. 37–63, 2011.
M. T. Keane and B. Smyth, ”Good counterfactuals and where to find them: A case-based technique for generating coun terfactuals for explainable AI (XAI),” Proceedings of the 28th International Conference on Case-Based Reasoning, pp. 163–178, 2020.
A. Fern´andez, S. Garc´ ıa, F. Herrera, and N. V. Chawla, ”SMOTE for learning from imbalanced data: progress and challenges, marking the 15-year anniversary,” Journal of Artificial Intelligence Research, vol. 61, pp. 863-905, 2018.
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, ... and E. Duchesnay, ”Scikit-learn: Machine learning in Python,” Journal of Machine Learning Research, vol. 12(Oct), pp. 2825-2830, 2011.
