Different oversampling techniques overview and application
Hi there! :)
This article is about using 5 different oversampling techniques on the example of ICR (Identifying Age-Related Conditions) Kaggle competition. I described the competition briefly in my previous article and that it is very easy to overfit in it. So the hope that oversampling of minority classes will allow to balance positive and negative classes and make the model more stable.
We (my teammate is Dimitrii R. ) tried different oversampling strategies and optimized sampling strategy coefficients in the Optuna. For example in target we have have 4 classes (A is 0 and rest is 1’s):
A 509 samples
B 61 samples
G 29 samples
D 18 samples
The result after oversampling could be
A 509
B 509
D 509
G 509
or
A 509
B 169
D 169
G 169
or
A 509
B 169
D 120
G 80
Recommended by LinkedIn
etc.
1. Random Oversampling with Gaussian Noise Up-sampling
In this simple approach we use RandomOverSampler from imblearn.over_sampling, make upsampling using strategy and add some noise for upsampled rows rows. Parameters of noise also tuned by Optuna.
Links:
2. Synthetic Minority Oversampling Technique (SMOTE)
I tried SMOTE, KMeansSMOTE and ADASYN (and didn’t try SMOTENC, SMOTEN, BorderlineSMOTE, SVMSMOTE). It showed slightly lower performance on validation than the next approach. Parameters to tune in Optuna are strategy coefficients and n-neighbors. It is sensitive to NaN’s, so you should do the imputation on NaNs first.
Links:
3. Oversampling Using Gaussian Mixture Models
Drawing samples from Gaussian Mixture Models (GMM), or other generative models, is another creative oversampling technique that can potentially outperform SMOTE variants. To avoid linear interpolation and to consider outliers, this proposed method generates instances by the Gaussian Mixture Model.
Links:
4. Synthetic Minority Over-Sampling Technique for Regression with Gaussian Noise (SMOGN)
Conducts the Synthetic Minority Over-Sampling Technique for Regression (SMOTER) with traditional interpolation, as well as with the introduction of Gaussian Noise (SMOTER-GN). Selects between the two over-sampling techniques by the KNN distances underlying a given observation. If the distance is close enough, SMOTER is applied. If too far away, SMOTER-GN is applied.
Links:
5. Tabular Data Augmentation Using Deep Learning
The most technically interesting approach I found and applied. The deep_tabular_augmentation works on the simple idea that we want to keep the data in a dedicated class (which we call the Learner) together with the model. The data has to come as a dataloader object, which is stored in the DataBunch class. In it are the dataloaders for the training and test data. The runner class then defines the flow.
Links: