Deep Learning: Dealing with noisy labels

Deep Learning: Dealing with noisy labels

How would you use a dataset with incorrect labels, if it's not possible to manually clean the dataset or gather more data?

Unlike academia or Kaggle competitions in real world use cases, we have to deal with noisy labels. Errors in labels can arise due to a coding mistake or change in the method of data collection. In some circumstances like disease diagnosis, errors may occur due to the fact that even for an expert, the true labels are hard to determine, if there is insufficient information available.

This problem is often overlooked as all benchmarks of high performing deep learning models are based on datasets with clean labels. In many of the real world applications, it is impractical to manually clean the dataset.

In this article, we will see that apart from gathering more data or manually cleaning data, what are the options we have:

  1. Adding a noise layer over the base model in deep learning. This noise layer will learn the transition between clean labels and bad labels. Essentially, we want the noise layer or noise model to overfit on noise during training and we remove this model during testing. We do not want base model to learn noise. Initialization and regularization of weights, play a very important role in determining the predictive performance of the model. There are various variants using this approach. One of the major drawbacks of using this is when the number of classes are large, estimating the transition matrix between clean label and bad labels becomes difficult.
  2. Choosing the loss function: This is very crucial as studies [1, 2] have shown that 0,1 loss is more robust to noise than hinge loss, log loss, squared loss or exponential loss. Finding loss functions that are robust to noise is an active area of research.
  3. Creating a small dataset with clean labels. Train your model on that small dataset. Train the model on the whole dataset, compare the feature vector from clean dataset to that of the whole dataset, use similarity between the feature vectors to decide if it is a correct label. CleanNet [3] uses this principle. There are a number of semisupervised learning techniques which involve creating a small dataset with clean labels. In some, we re-weight the instances with bad labels by their prediction probabilities.
  4.  Ensembles with soft labels: Soft labels in general help in learning softer decision boundaries.These can reflect the similarities between semantically related classes and thus makes the learning more robust to overfitting and noise. Voting ensembles trained on different sets of data with soft labeling reduce the effect noise in the labelling and help in improving generalization. Although, Boosting ensembles should be avoided as they are prone to overfit on noise.
  5. Rank pruning: In this method, we rank the data samples by their predictive probability and remove the samples with lower rank. The idea behind this approach is that once the model has achieved decent performance in terms of predictive power, the samples which it is not able to learn would be the noise hence, it should leave those samples. This is opposite to the idea of boosting. It is shown to be empirically effective in improving the metrics while training on noisy data. However this may potentially remove the useful information from the data distribution.
  6. Small loss trick: Similar to the previous idea, small loss trick considers instances with small losses as correct instances. There has been a promising line of research based on this idea. Google's MentorNet [5] is also based on this idea. Sample selection bias is one of the major concerns when using this.
  7. Shuffling: Deep learning models are sensitive to noise when labeled noise is concentrated rather than when it is distributed across the samples. Hence, shuffling when possible would help in creating a robust model to label noise.
  8. Decoupling: In the this [6], approach two classifiers are trained with different random initialization and preferably on two different subsets of data. Only after they reach the advanced stage of optimization, these two classifiers are trained simultaneously and they update their weights whenever there is a disagreement between their predictions.
  9. Cross training: In this approach, two classifiers are trained simultaneously where, each model plays the roles of both teacher of generating curriculums and student of updating model parameters.

This is hot area of research currently, as there are many complex frameworks which achieve state of the art results using combinations of the above mentioned tricks like Co-teaching, Co-teaching+ and Mentormix.

References:

[1] N. Manwani and P. Sastry, “Noise tolerance under risk minimization,” IEEE transactions on cybernetics, vol. 43, no. 3, pp. 1146–1151, 2013.

[2] G. Patrini, F. Nielsen, R. Nock, and M. Carioni, “Loss factorization, weakly supervised learning and label noise robustness,” in International conference on machine learning, 2016, pp. 708–717

[3] Kuang-Huei Lee, Xiaodong He, Lei Zhang, and Linjun Yang. Cleannet: Transfer learning for scalable image classifier training with label noise. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5447–5456, 2018

[4] Han, B., Yao, Q., Yu, X., Niu, G., Xu, M., Hu, W., Tsang, I., and Sugiyama, M. (2018). Co-teaching: Robust training of deep neural networks with extremely noisy labels. In Advances in Neural Information Processing Systems, pages 8527–8537.

[5] Jiang, L., Zhou, Z., Leung, T., Li, L.-J., and Fei-Fei, L. (2018). Mentornet: Learning data-driven curriculum for very deep neural networks on corrupted labels. In International Conference on Machine Learning

[6] Malach, E. and Shalev-Shwartz, S. (2017). Decoupling” when to update” from” how to update”. In Advances in Neural Information Processing Systems, pages 960–970.

[7] Wang, W. and Zhou, Z.-H. (2017). Theoretical foundation of co-training and disagreement-based algorithms. arXiv preprint arXiv:1708.04403.

[8] Z. Zhang, J. Yang, Z. Zhang and Y. Li, "Cross-Training Deep Neural Networks for Learning from Label Noise," 2019 IEEE International Conference on Image Processing (ICIP), Taipei, Taiwan, 2019, pp. 4100-4104, doi: 10.1109/ICIP.2019.8803597.

Udita Patel

Senior Applied Scientist at Amazon | NLP | LLMs

4y

Gave me a summarized starting point to play with different approaches. thanks a lot

To view or add a comment, sign in

More articles by Tarun Bhatia

Insights from the community

Others also viewed

Explore topics