Performance Analysis of Deep Learning Models for Software Fault Prediction Using the BugHunter Dataset
Performance Analysis of Deep Learning Models for Software Fault Prediction Using the BugHunter Dataset
Abstract
Software fault prediction (SFP) involves the identification of
potentially fault-prone modules before the testing phase in the software
development lifecycle. By predicting faults early in the development process, the SFP process enables software developers to focus their efforts on
components that may contain faults, thereby enhancing the overall quality and reliability of the software. Machine learning and deep learning
techniques have been widely applied to train SFP models. However, these
approaches face several challenges, including irrelevant or redundant features, imbalanced datasets, overfitting, and complex model structures.
The NASA dataset from the PROMISE repository is the most commonly
used dataset for fault prediction. Recently, the BugHunter dataset with
its substantially larger number of instances was explored to train the SFP
models. In this study, we present the comparative study of three deep
learning models, including Convolutional Neural Network (CNN), Recurrent Neural Network (RNN), Long Short-Term Memory (LSTM) and
four machine learning models as K-Nearest Neighbors (KNN), Multilayer
Perceptron (MLP), Adaptive Boosting (AdaBoost), Extreme Gradient
Boosting (XGB) to investigate the performance of SFP models on the
BugHunter dataset. We employ the Lasso method for feature selection
and apply the Synthetic Minority Oversampling Technique (SMOTE) to
address the issue of imbalanced data, aiming to enhance the accuracy
of the results. The experimental findings reveal that CNN and RNN
outperformed other machine learning models, achieving the best overall
performance.
