Published In
Publication Number
Page Numbers
Paper Details
A Comparative Analysis of Sampling Techniques for Imbalanced Datasets in Machine Learning
Authors
Sandeep Yadav
Abstract
In machine learning, the challenge of class imbalance—where one class is significantly underrepresented compared to others—often leads to models with poor predictive performance, especially for minority classes. This study provides a detailed comparative analysis of sampling techniques designed to address this imbalance, focusing on their effectiveness across different types of imbalanced datasets. The techniques examined include basic undersampling and oversampling, along with more sophisticated synthetic methods like SMOTE (Synthetic Minority Over-sampling Technique), ADASYN (Adaptive Synthetic Sampling), and borderline variants of SMOTE. Using several real-world and synthetic datasets, this research evaluates the performance of these techniques based on key metrics tailored for imbalanced data, such as F1-score, G-mean, precision, recall, and area under the precision-recall curve.
Our findings reveal that while undersampling can improve computational efficiency, it may lead to significant data loss and reduced model robustness. Conversely, oversampling, though effective in balancing the dataset, can introduce redundancy and increase model complexity. Among synthetic methods, SMOTE and its variants demonstrate improved performance by generating more diverse samples in the feature space, although they may also introduce noise when not carefully applied. ADASYN was particularly effective in scenarios with higher levels of imbalance, adapting sample generation based on instance difficulty. Ultimately, this study underscores the importance of selecting a sampling method based on the specific dataset characteristics and model requirements, providing practical guidance for practitioners in choosing optimal sampling techniques for achieving balanced and fair machine learning models in imbalanced contexts.
Keywords
Imbalanced Datasets, Sampling Techniques, Machine Learning, Undersampling, Oversampling, SMOTE (Synthetic Minority Over-sampling Technique), ADASYN (Adaptive Synthetic Sampling), Classification, Precision-Recall Curve, Model Robustness, Class Imbalance
Citation
A Comparative Analysis of Sampling Techniques for Imbalanced Datasets in Machine Learning. Sandeep Yadav. 2021. IJIRCT, Volume 7, Issue 5. Pages 1-7. https://www.ijirct.org/viewPaper.php?paperId=2411071