nips20199289-data-parameters-a-new-family-of-parameters-for-learning-a-differentiable-curriculum_W

上傳人：我*** IP屬地：北京上傳時間：2020-09-24 格式：DOCX 頁數(shù)：11 大?。?54.26KB 積分：9.6 舉報 版權(quán)申訴

nips20199289-data-parameters-a-new-family-of-parameters-for-learning-a-differentiable-curriculum_W_第2頁

nips20199289-data-parameters-a-new-family-of-parameters-for-learning-a-differentiable-curriculum_W_第3頁

nips20199289-data-parameters-a-new-family-of-parameters-for-learning-a-differentiable-curriculum_W_第4頁

nips20199289-data-parameters-a-new-family-of-parameters-for-learning-a-differentiable-curriculum_W_第5頁

已閱讀5頁，還剩6頁未讀，繼續(xù)免費(fèi)閱讀

版權(quán)說明：本文檔由用戶提供并上傳，收益歸屬內(nèi)容提供方，若內(nèi)容存在侵權(quán)，請進(jìn)行舉報或認(rèn)領(lǐng)

文檔簡介

1、Data Parameters: A New Family of Parameters for Learning a Differentiable CurriculumShreyas SaxenaAppleshreyas_Oncel Tuzel Apple Dennis DeCoste Apple AbstractRecent works have shown that learning from easier instances first can help deep neural networks

2、 (DNNs) generalize better. However, knowing which data to present during different stages of training is a challenging problem. In this work, we address this problem by introducing data parameters. More specifically, we equip each sample and class in a dataset with a learnable parameter (data parame

3、ters), which governs their importance in the learning process. During training, at each iteration, as we update the model parameters, we also update the data parameters. These updates are done by gradient descent and do not require hand-crafted rules or design. When applied to image classification t

4、ask on CIFAR10, CIFAR100, WebVision and ImageNet datasets, and object detection task on KITTI dataset, learning a dynamic curriculum via data parameters leads to consistent gains, without any increase in model complexity or training time. When applied to a noisy dataset, the proposed method learns t

5、o learn from clean images and improves over the state-of-the-art methods by 14%. To the best of our knowledge, our work is the first curriculum learning method to show gains on large scale image classification and detection tasks.1 IntroductionCurriculum learning 1, 7, 12, 17, 35 has garnered lot of

6、 attention in the field of machine learning. It draws inspiration from the learning principles underlying cognitive process of humans and animals, which starts by learning easier concepts and then gradually transitions to learning more complex concepts. Existing work has shown that with the help of

7、this paradigm, DNNs can achieve better generalization 1, 2, 15.The key to applying curriculum learning to different problems is to come up with a ranking function that assigns learning priorities to the training samples. A sample with a higher priority is supposed to be learned earlier than a sample

8、 with a lower priority. For the majority of early work in curriculum learning, the curriculum is provided by a pre-determined heuristic. For instance, for the task of classifying shapes 1, shapes which had less variation were assigned a higher priority. In 29, authors approached grammar induction, w

9、here short sentences were assigned higher priority. The main issues which limit the application of this approach are: (1) for many complex problems, it is not trivial to define what are the easy examples or subtasks, (2) in cases where humans can design a curriculum, it is assumed that the difficult

10、y of learning a sample for humans correlates with the difficulty of learning the sample for a learning algorithm, and (3) even if one could define the curriculum, the pre-determined curriculum might not be appropriate at all learning stages of the dynamically learned model.Learning a curriculum in a

11、n automatic manner is a hard task, since the ease or difficulty of an example is relative to the current state of the model. In order to overcome these issues, in this work, we introduce a new family of parameters for DNNs termed data parameters. More specifically, each33rd Conference on Neural Info

12、rmation Processing Systems (NeurIPS 2019), Vancouver, Canada.class and data point have their own data parameter, governing their importance in the learning process. During learning, at every iteration, as we update the standard model parameters, we also update the data parameters using stochastic gr

13、adient descent. Learning data parameters for class and instances leads to a dynamic and differentiable curriculum, without any need of human-intervention.The main contributions of our work are:1. We introduce a new class of parameters termed data parameters for every class and data point in the data

14、set. We show that data parameters can be learned using gradient descent, and doing so amounts to learning a dynamic and differentiable curriculum. In our formulation, data parameters are involved only during training, and hence do not affect model complexity at inference.2. We show that for image cl

15、assification and object detection tasks, learning a curriculum for CNNs improves over baseline by prioritizing classes and their instances. To the best of our knowledge, our paper is the first curriculum learning method to show gains on large scale image classification tasks (ImageNet 5) and on an o

16、bject detection task (KITTI9).3. We show that in presence of noisy labels, the learnt curriculum prioritizes learning from clean labels. Doing so, our method outperforms the state-of-art by a significantmargin.4. We show that when presented with random labels, in comparison to a baseline DNN which m

17、emorizes the data, the learned curriculum resists memorizing corrupt data.2 Learning a Dynamic Curriculum via Data ParametersAs suggested earlier, the main intuition behind our idea is simple: each class and data point in the training set has a parameter associated to it, which weighs the contributi

18、on of that class or data point in the gradient update of model parameters. In contrast to existing works which set these parameters with a heuristic, in our work these parameters are learnable and are learnt along with the model parameters. Unlike model parameters which are involved during training

19、and inference, data parameters are only involved during training. Therefore, using data parameters during training does not affect the model complexity and run-time at inference. In the next section we formalize this intuition for class-level curriculum, followed by instance-level curriculum.2.1 Lea

20、rning curriculum over classesWe first describe learning a dynamic curriculum over classes where the contribution of each sample to the model learning is determined by its class. This curriculum favors learning from easier classes at the earlier stages of training. The curriculum over classes is dyna

21、mic and is controlled by the class-level data parameters, which are also updated via the training process. In what follows, we will refer to class-level data parameters as class-parameters.i=1Let xi, yi N denote the data, where xi Rd denotes a single data point and yi 1, .,k denotes its target label

22、. Let class Rk denote the class parameters for the classes in the dataset. We denotetheneuralnetwork functionmappingtheinputdata xi to logitszi Rk as zi = f(xi) where arethemodelparameters. Duringtraining, wepasstheinputsample xi throughthe DNN, andyicompute its corresponding logits zi, butinstead o

23、f computing the softmax directly on the logits, we scalethelogitsoftheinstancewithparametercorresponding to thetargetclass, class. Note, scaling of logits with the parameter of targetclass can be interpreted as atemperature scaling of logits. The cross-entropy loss for a data point xi can then be wr

24、itten asLi = log(ypi i )exp(zi i /class)(1)11piyi = Pjyyijyiexp(zi/class)where pi i , zii and class denote probability, logit and parameter of the target class yi for data pointyyyijxi respectively. If we set all class parameters to one, i.e. class = 1, j = 1 . . . k, we recover thegradient for the

25、standard cross-entropy loss.During training we solvemin 1 XNLi(2),class Ni=1where in addition to the model parameters, , we also optimize the class-level parametres, class. The gradient of the loss with respect to logits is given by:Li = pi 1(j = yi)jzijclass yi(3)where 1(j = yi) means value 1 when

26、j = yi and value 0 otherwise. The gradient of the loss with respect to the parameter of target class is given by:Li(1 pi i )Xqiziyi =yziclass yipiclass 2 yi(j j)j6=yi(4)pyiwhere qji = 1 jiis the probability distribution over non-target classes (indexed by j, with j 6= yi).Effect of class parameters

27、on learning:The class parameters are updated with the negative ofyithe gradient given in equation (4), where the parameter corresponding to target class class will increase if the logit of the target class is less than the expected value of logits on non-target classesP(i. e. zii i qizi ) and vice-v

28、ersa. Therefore, during the course of learning, if data-points ofyj yj ja certain class are being misclassified, the gradient update on class parameters gradually increases the parameter associated with this class. Increasing the class parameter, flattens the curvature of the loss function for insta

29、nces of that class, thereby decaying the gradients w.r.t. logits (see equation (3). Decreasing the class parameter has an inverse effect, and accelerates the learning.2.2 Learning curriculum over instancesjiIn the previous section, we have detailed how we can learn a dynamic curriculum over classes

30、of a dataset. A natural extension of this framework is to have a dynamic curriculum over the instances in the dataset. In this case, in equation (1), rather than having class parameters for each class class, j 1, . . . , k, we can have a instance parameters for each sample present in the dataset, in

31、st, i 1, . . . , N .This parameterization helps us to learn a curriculum over instances of a class, which is useful when instances within a class have different levels of difficulty. For instance, consider the task of classifying images of an object. In some instances, the object could be fully visi

32、ble (easy), while in others, it could be occluded by other objects (hard). Another task is learning with noisy/corrupt labels. In this setting, labels of some instances would be consistent with the input (easy), while labels of some instances would not be consistent (hard). In our experiments, we sh

33、ow that the learning of a curriculum over instances learns to ignore the noisy samples.We can also learn a joint curriculum over classes and instances to have the benefits of both. In thiscase, during training, the parameter foinrsat data point xi is set as the sum of its targets class paramter i an

34、d its own instance parameter i.e. = class + inst. In this setting, the gradient of theyiiyiip 1(j=y )loss with respect to the logits (as in equation 3) can be expressed asii=zjiji . Since theieffective parameter of an instance is formed by the addition of class and instance level parameters,the grad

35、ient for these parameters for a data point xi is the same and is denoted by:XLi(1 pi i )qizi= y zi ij j(5)()2yiij6=yiHowever note, during training, instance parameters collect their gradient from individual samples (when sampled in a mini-batch), while class parameters average the gradient from all

36、samples of the class present in a mini-batch.Inference with data parameters: As explained earlier, during training, we modify the logits of a sample with data parameters (class or instance parameters). During inference, we do not have parameters on the test set, and hence do not scale the logits wit

37、h a data parameter. Not scaling the logits, has no affect on the argmax of softmax, but the classification probability is uncalibrated. If one is interested in calibrated output, calibration can be done on a held-out validation set 11. Note this modus operandi, of not scaling logits at inference, ma

38、intains our claim: use of data parameters does not affect the models capacity and run-time at infernece.3 Experimental evaluationIn this section we first describe the implementation details of our method. Next we will show results of our method when applied for the task of image classification and d

39、etection. After that, we evaluate our dynamic curriculum framework in presence of noisy labels. Finally, we show that our framework, when applied to all random labels, acts as a strong regularizer and resists memorization. Note, since our method modifies the logits at the very end of forward pass, t

40、he gains reported below come without any additional computational overhead during training.3.1 Implementation detailsOptimizing data parameters with gradient descent requires constraint optimization with constraint 0. Instead, we choose to optimize in log parameterization log(), which can be mapped

41、back using exponential mapping. Using an exponential mapping resolves log parameterization to positivedomain, and allows us to perform unconstrained optimization.In our loss function, in addition to standard 2 regularizer on model paramters,| 2, we also have 2 regularization on data parameter|s|, lo

42、g(class) |2 and|log(inst) 2|,| with their contribution being controlled by weight decay parameter. This regularizer favors original softmax formulation with = 1, and prevents data parameters from obtaining very high values.Unless stated otherwise, the following implementation details holds true for

43、our experiments. For all numbers reported in this paper, we report the mean and standard deviation over 3 runs. We learn the class and instance parameters using stochastic gradient descent (SGD). Class and instance parameters are initialized with = 1 and optimized using gradient descent with momentu

44、m 0.9. When learning a joint curriculum over class and instances, class parameters are initialized as 1 and instance parameters are initialized as 0.001. This ensures that the sum of both parameters results in = 1, thereby recovering the original softmax formulation. For both sets of parameters we u

45、se separate optimizers with their respective learning rates and weight decay. The learning rate and weight decay for class parameters is set to 0.1 and 5e4 (same as model parameters of DNN). The learning rate and weight decay for instance parameters varies depending upon the task, and is set by usin

46、g the validation set. When a class or instance is not present in a mini-batch, we do not update the momentum buffer associated with the data parameter of the class or the instance respectively.3.2 Learning a curriculum for image classificationIn this section we demonstrate the efficacy of our method

47、 when applied to the task of image classifi- cation. We evaluate our dynamic curriculum learning framework on CIFAR100 18 and ImageNet 2012 classification 5 dataset.CIFAR100 dataset contains 100 classes, 50,000 images in the training set and 10,000 images in the test set. We evaluate our framework o

48、n CIFAR100 with WideResNet (depth:28, widening factor:10, dropout:0) 38. We first reproduce the results for WideResNet1 by setting the minibatch size, optimizer and learning rate schedule identical to the original paper 38 and report the numbers in Table 1.ImageNet dataset contains 1000 classes, wit

49、h 1.28 million training samples. We report top-1 accuracy on the validation set which consists of 50,000 images. We evaluate our framework with ResNet18 14, we use the implementation from PyTorchs website 2. As per the standard settings, we train the1Authors report results as median run of 5 runs. W

50、e reimplement their method and report mean and standard deviation over three runs.2/pytorch/examples/tree/master/imagenet1.2Instance Parameter1.1Class Parameter1.0020406080100120EpochsFigure 1: Left: Class level dynamic curriculum on CIFAR100. Curriculum learnt over classes

51、 is dynamic in nature and adapts itself for different classes. Right: Instance level dynamic curriculum on ImageNet. Two instances of the same class, as per their difficulty, are learnt at different points during training.models for a total of 100 epochs, with learning rate decay of 0.1 every 30 epo

52、chs i.e. at 30, 60, and 90 epochs. Weight decay for class and instance level data parameters is set as 1e 4 (same as modelparameters) respectively. The learning rate for class and instance level data parameters is set as 0.1and 0.8 respectively.We report results for ImageNet and CIFAR100 in Table 1.

53、 As seen from the table, on CIFAR100 dataset, learning a curriculum over classes and instances lead to a statistically significant gain of 0.7% over the baseline for WideResNet. On ImageNet dataset, using a dynamic curriculum translates to a gain of 0.7% over the baseline. In the table, we show that

54、 learning a dynamic curriculum over classes alone performs better than baseline, but has a degradation of 0.2% in accuracy when compared with class and instance level curriculum. This highlights the importance of using a curriculum over instances, and validates our hypothesis: instances within a cla

55、ss have varying levels of difficulty, and learning the order within a class is important. In Figure 1 (right), we plot data parameter for two instances of the same class as it evolves during training. The two instances are learnt at different points during training, as per their difficulty. For desc

56、ription of experiments on WebVision dataset, see section 3.4.Comparison with the state-of-the art: To the best of our knowledge, we are the first work to report gains on ImageNet dataset due to curriculum learning3. There are existing works which report results of curriculum learning on CIFAR100 dat

57、aset, but a direct comparison is not possible, since these works report results in different settings. Nevertheless, below, we report key results from the existing state-of-the-art:2 proposes a curriculum learning framework, where the sampling of data (curriculum) for SGD is based on lightweight estimate of sample uncertainty. With ResNet27 they obtain an improvement of 0.4% in accuracy. Inspired from the recent work of Learning to Teach 7, 36 proposes an extension, where the teacher dynamically alters the loss function for the st

人人文庫> 全部分類> 應(yīng)用文書

溫馨提示

1. 本站所有資源如無特殊說明，都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
2. 本站的文檔不包含任何第三方提供的附件圖紙等，如果需要附件，請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
3. 本站RAR壓縮包中若帶圖紙，網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽，若沒有圖紙預(yù)覽就沒有圖紙。
4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
5. 人人文庫網(wǎng)僅提供信息存儲空間，僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理，對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯，并不能對任何下載內(nèi)容負(fù)責(zé)。
6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容，請與我們聯(lián)系，我們立即糾正。
7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

nips20199289-data-parameters-a-new-family-of-parameters-for-learning-a-differentiable-curriculum_W

文檔簡介

溫馨提示

最新文檔

評論

nips20199289-data-parameters-a-new-family-of-parameters-for-learning-a-differentiable-curriculum_W

文檔簡介

溫馨提示

最新文檔

評論

相關(guān)文檔