DisturbLabel Algorithm for CNN Regularization

1 / 35

Embed Share

"Learn about the innovative DisturbLabel algorithm that focuses on regularizing CNN models on the loss layer. Explore how CNN regularization techniques like weight decay, data augmentation, and dropout play a crucial role in preventing overfitting and enhancing model performance in large-scale visual recognition tasks. Discover the significance of DisturbLabel in introducing regularization at the loss layer, a crucial aspect in the training process of deep convolutional neural networks."

jveron Follow

Uploaded on Apr 03, 2025 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

CVPR 2016 DisturbLabel: Regularizing CNN on the Loss Layer Speaker: Lingxi Xie Authors: Lingxi Xie, Jingdong Wang, Zhen Wei, Meng Wang, Qi Tian Department of Statistics University of California, Los Angeles http://www.ucla.edu

Outline Introduction A Review of CNN Regularization The DisturbLabel Algorithm Disucssions Experiments Conclusions 4/3/2025 CVPR 2016 - Presentation 2

Outline Introduction A Review of CNN Regularization The DisturbLabel Algorithm Disucssions Experiments Conclusions 4/3/2025 CVPR 2016 - Presentation 3

Introduction Deep Convolutional Neural Network (CNN) A hierarchical model in learning visual patterns Dominating the conventional Bag-of-Visual-Words models in large-scale visual recognition tasks Prerequisites: large-scale datasets (e.g., ImageNet) and powerful computational resources (e.g., modern GPUs) Regularization An important technique in preventing over-fitting Widely adopted in the CNN training process 4/3/2025 CVPR 2016 - Presentation 4

Outline Introduction A Review of CNN Regularization The DisturbLabel Algorithm Disucssions Experiments Conclusions 4/3/2025 CVPR 2016 - Presentation 5

A Review of CNN Regularization A way of preventing over-fitting Typical CNN regularization methods Weight decay: constraining the parameters with 2-regularization Data augmentation: generating more training data by randomly transforming input images Dropout: randomly discarding a part of neuron responses in training Introducing stochastic operation in training 4/3/2025 CVPR 2016 - Presentation 6

A Summary of CNN Regularization Regularization Method Regularization Units Weight decay Data augmentation Dropout DropConnect Stochastic Pooling DisturbLabel Neuron connections (weights) Input layer (neurons) Hidden layer (neurons) Neuron connections (weights) Pooling layer (neurons) Loss layer (neurons) DisturbLabel is the first work to regularize CNN on the loss layer! 4/3/2025 CVPR 2016 - Presentation 7

Outline Introduction A Review of CNN Regularization The DisturbLabel Algorithm Disucssions Experiments Conclusions 4/3/2025 CVPR 2016 - Presentation 8

The CNN Training Process Input: Image dataset: ? = ?: dataset size ?? ?: a data point ??= 0, ,0,1,0, ,0 ? Initialized model ?: ? ?;?0 ? ?: model parameters (weights in CNN) Output: Trained model ?: ? ?;? ? ? ??,?? ?=1 4/3/2025 CVPR 2016 - Presentation 9

The CNN Training Process (cont.) In each iteration ?, a mini-batch ? is sampled from ?, and the current model parameters ?? 1 are updated via Stochastic Gradient Descent (SGD): 1 ? ?,? ???? 1? ?,? ??= ?? 1+ ?? 1 ?? 1: the current learning rate ? ?,? : the loss function of ?,? Computed via gradient back-propagation 4/3/2025 CVPR 2016 - Presentation 10

The DisturbLabel Algorithm Working on each mini-batch independently An extra sampling process for each data point Each data is disturbed with probability ? ? is named the noise rate of the algorithm For a disturbed datum ??,??, it is assigned with a new class label ?, which is distributed uniformly among 1,2, ,? , regardless of the true label ?. This datum is changed to ??, ??, in which ??depends on ?, and sent into the network training process. 4/3/2025 CVPR 2016 - Presentation 11

A Toy Example of DisturbLabel Each mini-batch is disturbed indepently The disturbed label may remain unchanged Batch 1 Batch 2 Batch 3 Batch 4 Batch 5 Batch ? 5 7 3 2 0 1 4/3/2025 CVPR 2016 - Presentation 12

The Effect of Noise Rate ? A proper noise rate helps to improve the accuracy, but introducing too much noise harms recognition Just like the drop-ratio in Dropout 4/3/2025 CVPR 2016 - Presentation 14

DisturbLabel as Regularizer DisturbLabel acts as a regularizer, which improves the recognition accuracy by preventing over-fitting Training error increases while testing error decreases 4/3/2025 CVPR 2016 - Presentation 15

Outline Introduction A Review of CNN Regularization The DisturbLabel Algorithm Disucssions Experiments Conclusions 4/3/2025 CVPR 2016 - Presentation 16

DisturbLabel as Model Ensemble ? Given the original dataset ? = and the way of disturbing labels, we generate a family of noisy datasets ? = where ??is the ?-th noisy set, and ??is the probability of its presence Note that the total number ? of possible datasets is exponentially large, thus it is impossible to train an individual model for each of these sets, nor to combine them at the testing stage ??,?? ?=1 ? ??,?? , ?=1 4/3/2025 CVPR 2016 - Presentation 17

DisturbLabel as Model Ensemble (cont.) An equivalent solution is to use mini-batches. The family of all possible mini-batches is ? = ?,?? ?=1 , where ?? is the probability of the presence of the ?-th mini-batch ? A mini-batch can be sampled from different ?? s DisturbLabel samples one mini-batch following the probability distribution over the family ?, and serves as an alternative way of training the same model with different data ? 4/3/2025 CVPR 2016 - Presentation 18

Cooperation with Dropout Both Dropout and DisturbLabel ensembles models Dropout: different structures trained on same data DisturbLabel: same structures trained on different data 4/3/2025 CVPR 2016 - Presentation 19

DisturbLabel as Data Augmentation Given a disturbed data point ??, ??, its loss function value is ? ??, ??. We can generate a data point ??,?? with the original class label preserved, and ? ??,?? ? ??, ??, so that the effect of ??, ?? is approximately equivalent to that of ??,?? ??,?? can be considered an augmented datum ?? can be computed by iterative back-propagation 4/3/2025 CVPR 2016 - Presentation 20

Visualizing Augmented Data Ep. 1 1.77% Ep. 2 1.08% Ep. 5 0.97% Ep. 10 0.90% Ep. 20 0.86% Ep. 10 28.97% Ep. 20 25.61% Ep. 30 24.82% Ep. 40 24.68% Ep. 60 23.33% Ep. 80 22.74% Ep. 100 22.50% 0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9

Application: Few Training Data The MNIST and CIFAR10 datasets In all the classes, only 1% or 10% training data are preserved MNIST CIFAR10 10.92 43.29 1% data, w/o DisturbLabel 6.38 37.83 1% data, w/ DisturbLabel 2.83 27.21 10% data, w/o DisturbLabel 1.89 24.37 10% data, w/ DisturbLabel 0.86 22.50 100% data, w/o DisturbLabel 0.66 20.26 100% data, w/ DisturbLabel 4/3/2025 CVPR 2016 - Presentation 22

Application: Imbalanced Training Data The MNIST and CIFAR10 datasets Except for the first class, only 1% or 10% training data are preserved MNIST CIFAR10 overall first class overall first class 9.31 0.28 42.01 11.48 1% data, w/o DisturbLabel 6.29 2.35 36.92 24.30 1% data, w/ DisturbLabel 2.78 0.47 26.50 13.09 10% data, w/o DisturbLabel 1.76 1.46 24.03 18.19 10% data, w/ DisturbLabel 0.86 0.89 22.50 22.41 100% data, w/o DisturbLabel 0.66 0.71 20.26 20.29 100% data, w/ DisturbLabel 4/3/2025 CVPR 2016 - Presentation 23

Difference with Using Soft Labels Using soft labels for training For each training sample ??,??, change ?? to a ? ?, ,? where ? is the noise rate in DisturbLabel Relationship to using a disturbed label ?? ?? is deterministic, while ?? is stochastic ? ?? = ? ?? ? 1 ? ? ,? ?, ,? soft form ??= ?,1 , ? 4/3/2025 CVPR 2016 - Presentation 24

Results of Using Soft Labels Using soft labels does not produce better results Reason: the soft label is deterministic, therefore cannot provide regularization as in DisturbLabel 4/3/2025 CVPR 2016 - Presentation 25

Outline Introduction A Review of CNN Regularization The DisturbLabel Algorithm Disucssions Experiments Conclusions 4/3/2025 CVPR 2016 - Presentation 26

MNIST Experiments The MNIST dataset Handwritten digit recognition (10 classes) 60,000 training and 10,000 testing images The network structure A 2-layer LeNet (input size: 28 28) A 5-layer BigNet (input size: 24 24) DisturbLabel with ? = 10% 4/3/2025 CVPR 2016 - Presentation 27

SVHN Experiments The SVHN dataset Street view digit recognition (10 classes) 73,257 training, 26,032 testing and 531,131 extra images (after pre-proc., 598,388 training images) The network structure A 3-layer LeNet (input size: 32 32) A 5-layer BigNet (input size: 24 24) DisturbLabel with ? = 10% 4/3/2025 CVPR 2016 - Presentation 28

MNIST and SVHN Results w/o DA w/ DA SVHN MNIST w/o DA w/ DA 0.52 0.21 Wan, ICML 13 1.94 Wan, ICML 13 0.47 Zeiler, ICLR 13 2.80 Zeiler, ICLR 13 0.45 Goodfellow, ICML 14 2.47 Goodfellow, ICML 14 0.47 Lin, ICLR 14 2.35 Lin, ICLR 14 0.39 Lee, AISTATS 15 1.92 Lee, AISTATS 15 0.31 Liang, CVPR 15 1.77 Liang, CVPR 15 0.86 0.48 LeNet, no regul. 3.93 3.48 LeNet, no regul. 0.68 0.43 LeNet, + Dropout 3.65 3.25 LeNet, + Dropout 0.66 0.45 LeNet, + DisturbLabel 3.69 3.27 LeNet, + DisturbLabel 0.63 0.41 LeNet, + both regul. 3.61 3.21 LeNet, + both regul. 0.69 0.39 BigNet, no regul. 2.87 2.35 BigNet, no regul. 0.36 0.29 BigNet, + Dropout 2.23 2.08 BigNet, + Dropout 0.38 0.32 BigNet, + DisturbLabel 2.28 2.21 BigNet, + DisturbLabel 0.33 CVPR 2016 - Presentation 0.28 BigNet, + both regul. 2.19 2.02 4/3/2025 BigNet, + both regul. 29

CIFAR Experiments The CIFAR10/CIFAR100 datasets Low-resolution natural images (10 or 100 classes) 50,000 training and 10,000 testing images Uniformly distributed over 10/100 classes The network structure A 3-layer LeNet (input size: 32 32) A 5-layer BigNet (input size: 24 24) DisturbLabel with ? = 10% 4/3/2025 CVPR 2016 - Presentation 30

CIFAR Results w/o DA w/ DA CIFAR100 CIFAR10 w/o DA w/ DA 9.32 Wan, ICML 13 Wan, ICML 13 15.13 Zeiler, ICLR 13 42.51 Zeiler, ICLR 13 11.68 9.38 Goodfellow, ICML 14 38.57 Goodfellow, ICML 14 10.41 8.81 Lin, ICLR 14 35.68 Lin, ICLR 14 9.69 7.97 Lee, AISTATS 15 34.57 Lee, AISTATS 15 8.69 7.09 Liang, CVPR 15 31.75 Liang, CVPR 15 22.50 15.76 LeNet, no regul. 56.72 43.31 LeNet, no regul. 19.42 14.24 LeNet, + Dropout 49.08 41.28 LeNet, + Dropout 20.26 14.48 LeNet, + DisturbLabel 51.83 41.84 LeNet, + DisturbLabel 19.18 13.98 LeNet, + both regul. 48.72 40.98 LeNet, + both regul. 11.23 9.29 BigNet, no regul. 39.54 33.59 BigNet, no regul. 9.69 7.08 BigNet, + Dropout 33.30 27.05 BigNet, + Dropout 9.82 7.93 BigNet, + DisturbLabel 34.81 28.39 BigNet, + DisturbLabel 9.45 CVPR 2016 - Presentation 6.98 BigNet, + both regul. 32.99 26.63 4/3/2025 BigNet, + both regul. 31

ILSVRC2012 Experiments The ILSVRC2012 dataset High-resolution natural images (1,000 classes) 1.3M training and 50K validation images Almost uniformly distributed over all classes The network structure The AlexNet 5 convolution layers, 3 pooling layers, and 3 fully- connected layers 4/3/2025 CVPR 2016 - Presentation 32

ILSVRC2012 Results top-1 ILSVRC2012 top-5 43.1 19.9 AlexNet, + Dropout 42.8 19.7 AlexNet, + both regul. 4/3/2025 CVPR 2016 - Presentation 33

Outline Introduction A Review of CNN Regularization The DisturbLabel Algorithm Disucssions Experiments Conclusions 4/3/2025 CVPR 2016 - Presentation 34

Conclusions Regularization is an important technique to prevent over-fitting in network training DisturbLabel regularizes CNN on the loss layer DisturbLabel is very simple to implement DisturbLabel works well in a wide range of tasks DisturbLabel can be interpreted as an implicit way of model ensemble and/or data augmentation 4/3/2025 CVPR 2016 - Presentation 35

Thank you! Questions please? 4/3/2025 CVPR 2016 - Presentation 36

DisturbLabel Algorithm for CNN Regularization

Download Presentation

Presentation Transcript

Related

More Related Content