Different oversampling techniques overview and application

Different oversampling techniques overview and application

Hi there! :)

This article is about using 5 different oversampling techniques on the example of ICR (Identifying Age-Related Conditions) Kaggle competition. I described the competition briefly in my previous article and that it is very easy to overfit in it. So the hope that oversampling of minority classes will allow to balance positive and negative classes and make the model more stable. 

We (my teammate is Dimitrii R. ) tried different oversampling strategies and optimized sampling strategy coefficients in the Optuna. For example in target we have have 4 classes (A is 0 and rest is 1’s):

A    509 samples

B     61 samples

G     29 samples

D     18 samples

The result after oversampling could be

A    509

B    509

D    509

G    509

or

A    509

B    169

D    169

G    169

or

A    509

B    169

D    120

G    80

etc.

1. Random Oversampling with Gaussian Noise Up-sampling

In this simple approach we use RandomOverSampler from imblearn.over_sampling, make upsampling using strategy and add some noise for upsampled rows rows. Parameters of noise also tuned by Optuna. 

Links: 

2. Synthetic Minority Oversampling Technique (SMOTE)

I tried SMOTE, KMeansSMOTE and ADASYN (and didn’t try SMOTENC, SMOTEN, BorderlineSMOTE, SVMSMOTE). It showed slightly lower performance on validation than the next approach. Parameters to tune in Optuna are strategy coefficients and n-neighbors. It is sensitive to NaN’s, so you should do the imputation on NaNs first.

Links:

3. Oversampling Using Gaussian Mixture Models

Drawing samples from Gaussian Mixture Models (GMM), or other generative models, is another creative oversampling technique that can potentially outperform SMOTE variants. To avoid linear interpolation and to consider outliers, this proposed method generates instances by the Gaussian Mixture Model.

Links:

4. Synthetic Minority Over-Sampling Technique for Regression with Gaussian Noise (SMOGN)

Conducts the Synthetic Minority Over-Sampling Technique for Regression (SMOTER) with traditional interpolation, as well as with the introduction of Gaussian Noise (SMOTER-GN). Selects between the two over-sampling techniques by the KNN distances underlying a given observation. If the distance is close enough, SMOTER is applied. If too far away, SMOTER-GN is applied.

Links:

5. Tabular Data Augmentation Using Deep Learning

The most technically interesting approach I found and applied. The deep_tabular_augmentation works on the simple idea that we want to keep the data in a dedicated class (which we call the Learner) together with the model. The data has to come as a dataloader object, which is stored in the DataBunch class. In it are the dataloaders for the training and test data. The runner class then defines the flow.

Links:

To view or add a comment, sign in

More articles by Ivan Isaev

  • LLM Finetuning FAQ Notes

    I recently studied to improve my knowledge in the field of fine tuning LLMs. As as byproduct of this I have notes which…

    2 Comments
  • Free useful guides for distilling models

    1. OpenAI Distillation Guide The manual contains a detailed description of the process of transferring knowledge from a…

  • Quatitative interview task: human approach vs AI approach

    It is interesting to comare human approach to solving tasks reqired knowleage of some theorems with how current AI…

  • Group-wise Precision Quantization with Test Time Adaptation (GPQT with TTA)

    What is Group-wise Precision Quantization with Test Time Adaptation (GPQT with TTA)? Group-wise Precision Quantization…

  • Pseudo Labeling

    Pseudo Labeling (Lee 2013) assigns fake labels to unlabeled samples based on the maximum softmax probabilities…

  • Learning to distill ML models

    I’m investigating the topic of ML models distillation and learning to do that. These are my takeaways with the links to…

  • Kaggle Santa 2024 and what do the puzzles have to do with it?

    Our team got 23-rd place in Santa 2024 with a silver medal. We were close to gold but not this time.

  • Qdrant and other vector DBs

    Issue with vector DB size There are plenty of vector DBs available including FAISS, OpenSearch, Milvous, Pinackle…

  • Chutes: did you try it?

    Hi there I found one thing and want to ask if you tried it. It named Chutes and could be found there https://chutes.

    3 Comments
  • InternVL2 test drive

    Intern_vl2 Is one another vision language model I tried some time ago and I like it a lot. It is quite fast (10 times…

Insights from the community

Others also viewed

Explore topics