Resolving Multi-Class Imbalance using Counterfactual Data Augmentation

  • Thai-Nguyen Xuan Faculty of Information Technology, University of Science, Ho Chi Minh, Vietnam / Vietnam National University, Ho Chi Minh, Vietnam
  • Quang Hung Nguyen Posts and Telecommunications Institute of Technology, Ha Noi, Vietnam
  • Bac-Le Faculty of Information Technology, University of Science, Ho Chi Minh, Vietnam / Vietnam National University, Ho Chi Minh, Vietnam
Keywords: Multi-class imbalance, data augmentation, counterfactuals, SMOTE, ADASYN, classification

Abstract

Data imbalance in multi-class classification, where some classes have significantly fewer samples than others, is a prevalent challenge in machine learning. This paper evaluates the effectiveness of Counterfactual Augmentation for Multi-Class (CFA-MC), a data augmentation technique that utilizes ”native counterfactuals” to generate synthetic samples for minority classes, compared to two popular oversampling techniques, SMOTE and ADASYN. We conduct experiments on multiple benchmark multi-class datasets and employ three different classifiers (Random Forest, k-Nearest Neighbors, and Multilayer Perceptron) to assess the performance of the three methods. Experimental results demonstrate that CFAMC consistently outperforms SMOTE and ADASYN in terms of macro-averaged F1-score, suggesting its ability to generate more plausible synthetic samples, improve decision boundary representation, and mitigate overfitting.

Author Biographies

Thai-Nguyen Xuan, Faculty of Information Technology, University of Science, Ho Chi Minh, Vietnam / Vietnam National University, Ho Chi Minh, Vietnam

Thai-Nguyen Xuan is a Computer Science teacher at Trinh Hoai Duc High School in Thuan An, Binh Duong province, Vietnam. He received his Bachelor of Science in Computer Science Education from Ho Chi Minh City University of Education in 2008 and is currently pursuing a Master’s degree in Computer Science at the University of Science, Vietnam National University, Ho Chi Minh City, Vietnam. His research interests include Deep Learning, Computer Vision, and Bioinformatics.

Quang Hung Nguyen, Posts and Telecommunications Institute of Technology, Ha Noi, Vietnam

Hung-Nguyen Quang received Bachelor’s and Master’s degrees from Hanoi University in 1994 and 2000, respectively, and a Ph.D. degree from the Posts and Telecommunications Institute of Technology in 2007. He is currently working at the Posts and Telecommunications Institute of Technology, Hanoi, Vietnam. His research interests include AI applications in e-commerce, multimedia, and policy. 

Bac-Le, Faculty of Information Technology, University of Science, Ho Chi Minh, Vietnam / Vietnam National University, Ho Chi Minh, Vietnam

Bac Le is currently a Professor and Head of the Department of Computer Science, Faculty of Information Technology, University of Science, Vietnam National University, Ho Chi Minh City, Vietnam. His research interests include AI, XAI, Machine Learning, Data Science, and Soft Computing.

References

H. He and E. A. Garcia, ”Learning from imbalanced data,” IEEE Transactions on Knowledge and Data Engineering, vol. 21, no. 9, pp. 1263–1284, Sep. 2009.

N. Japkowicz and S. Stephen, ”The class imbalance problem: A systematic study,” Intelligent Data Analysis, vol. 6, no. 5, pp. 429–449, 2002.

N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, ”SMOTE: Synthetic Minority Over-sampling Technique,” Journal of Artificial Intelligence Research, vol. 16, pp. 321–357, Jun. 2002.

H. He, Y. Bai, E. A. Garcia, and S. Li, ”ADASYN: Adaptive synthetic sampling approach for imbalanced learning,” Proceedings of the IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), pp. 1322–1328, 2008.

M. Temraz and M. T. Keane, ”Solving the class imbalance problem using a counterfactual method for data augmenta tion,” Machine Learning Applications, vol. 9, p. 100375, Jun. 2022.

F. Mantiuk and L. Kurth, ”Comment on Solving the Class Imbalance Problem Using a Counterfactual Method for Data Augmentation,” Machine Learning Applications, vol. 11, p. 100412, Dec. 2022.

D. Dua and C. Graff, ”UCI Machine Learning Reposi tory,” University of California, Irvine, School of Informa tion and Computer Sciences, 2017. [Online]. Available: http://archive.ics.uci.edu/ml

L. Breiman, ”Random forests,” Machine Learning, vol. 45, no. 1, pp. 5–32, Oct. 2001.

T. Cover and P. Hart, ”Nearest neighbor pattern classifica tion,” IEEE Transactions on Information Theory, vol. 13, no. 1, pp. 21–27, Jan. 1967.

S. Haykin, Neural Networks and Learning Machines, 3rd ed. Upper Saddle River, NJ, USA: Pearson Education, 2009.

D. M. Powers, ”Evaluation: From precision, recall and F measure to ROC, informedness, markedness & correlation,” Journal of Machine Learning Technologies, vol. 2, no. 1, pp. 37–63, 2011.

M. T. Keane and B. Smyth, ”Good counterfactuals and where to find them: A case-based technique for generating coun terfactuals for explainable AI (XAI),” Proceedings of the 28th International Conference on Case-Based Reasoning, pp. 163–178, 2020.

A. Fern´andez, S. Garc´ ıa, F. Herrera, and N. V. Chawla, ”SMOTE for learning from imbalanced data: progress and challenges, marking the 15-year anniversary,” Journal of Artificial Intelligence Research, vol. 61, pp. 863-905, 2018.

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, ... and E. Duchesnay, ”Scikit-learn: Machine learning in Python,” Journal of Machine Learning Research, vol. 12(Oct), pp. 2825-2830, 2011.

Published
2025-01-17