In mathematical optimization and decision theory, a loss function or cost function is a function that maps an event or values of one or more variables onto a real number intuitively representing some "cost" associated with the event. An optimization problem seeks to minimize a loss function. An objective function is either a loss function or its negative (in specific domains, variously called a reward function, a profit function, a utility function, a fitness function, etc.), in which case it is to be maximized.
In statistics, typically a loss function is used for parameter estimation, and the event in question is some function of the difference between estimated and true values for an instance of data. The concept, as old as Laplace, was reintroduced in statistics by Abraham Wald in the middle of the 20th century.^{[1]} In the context of economics, for example, this is usually economic cost or regret. In classification, it is the penalty for an incorrect classification of an example. In actuarial science, it is used in an insurance context to model benefits paid over premiums, particularly since the works of Harald Cramér in the 1920s.^{[2]} In optimal control, the loss is the penalty for failing to achieve a desired value. In financial risk management, the function is mapped to a monetary loss.
In classical statistics (both frequentist and Bayesian), a loss function is typically treated as something of a background mathematical convention. Critics such as W. Edwards Deming and Nassim Nicholas Taleb have argued that loss functions require much greater attention than they have traditionally been given and that loss functions used in real world decision making need to reflect actual empirical experience. They argue that realworld loss functions are often very different from the smooth, symmetric ones used by classical convention, and are often highly asymmetric, nonlinear, and discontinuous.
YouTube Encyclopedic

1/5Views:205 6589786 59810 883128 681

✪ Lecture 3  Loss Functions and Optimization

✪ A Cost Function for SimilarityBased Hierarchical Clustering

✪ 4.Logistic Regression – Log Loss Function

✪ (ML 11.3) Frequentist risk, Bayesian expected loss, and Bayes risk

✪ Lecture 10  Recurrent Neural Networks
Transcription
 Okay so welcome to CS 231N Lecture three. Today we're going to talk about loss functions and optimization but as usual, before we get to the main content of the lecture, there's a couple administrative things to talk about. So the first thing is that assignment one has been released. You can find the link up on the website. And since we were a little bit late in getting this assignment out to you guys, we've decided to change the due date to Thursday, April 20th at 11:59 p.m., this will give you a full two weeks from the assignment release date to go and actually finish and work on it, so we'll update the syllabus for this new due date in a little bit later today. And as a reminder, when you complete the assignment, you should go turn in the final zip file on Canvas so we can grade it and get your grades back as quickly as possible. So the next thing is always check out Piazza for interesting administrative stuff. So this week I wanted to highlight that we have several example project ideas as a pinned post on Piazza. So we went out and solicited example of project ideas from various people in the Stanford community or affiliated to Stanford, and they came up with some interesting suggestions for projects that they might want students in the class to work on. So check out this pinned post on Piazza and if you want to work on any of these projects, then feel free to contact the project mentors directly about these things. Aditionally we posted office hours on the course website, this is a Google calendar, so this is something that people have been asking about and now it's up there. The final administrative note is about Google Cloud, as a reminder, because we're supported by Google Cloud in this class, we're able to give each of you an additional $100 credit for Google Cloud to work on your assignments and projects, and the exact details of how to redeem that credit will go out later today, most likely on Piazza. So if there's, I guess if there's no questions about administrative stuff then we'll move on to course content. Okay cool. So recall from last time in lecture two, we were really talking about the challenges of recognition and trying to hone in on this idea of a datadriven approach. We talked about this idea of image classification, talked about why it's hard, there's this semantic gap between the giant grid of numbers that the computer sees and the actual image that you see. We talked about various challenges regarding this around illumination, deformation, et cetera, and why this is actually a really, really hard problem even though it's super easy for people to do with their human eyes and human visual system. Then also recall last time we talked about the knearest neighbor classifier as kind of a simple introduction to this whole datadriven mindset. We talked about the CIFAR10 data set where you can see an example of these images on the upper left here, where CIFAR10 gives you these 10 different categories, airplane, automobile, whatnot, and we talked about how the knearest neighbor classifier can be used to learn decision boundaries to separate these data points into classes based on the training data. This also led us to a discussion of the idea of cross validation and setting hyper parameters by dividing your data into train, validation and test sets. Then also recall last time we talked about linear classification as the first sort of building block as we move toward neural networks. Recall that the linear classifier is an example of a parametric classifier where all of our knowledge about the training data gets summarized into this parameter matrix W that is set during the process of training. And this linear classifier recall is super simple, where we're going to take the image and stretch it out into a long vector. So here the image is x and then we take that image which might be 32 by 32 by 3 pixels, stretch it out into a long column vector of 32 times 32 times 3 entries, where the 32 and 32 are the height and width, and the 3 give you the three color channels, red, green, blue. Then there exists some parameter matrix, W which will take this long column vector representing the image pixels, and convert this and give you 10 numbers giving scores for each of the 10 classes in the case of CIFAR10. Where we kind of had this interpretation where larger values of those scores, so a larger value for the cat class means the classifier thinks that the cat is more likely for that image, and lower values for maybe the dog or car class indicate lower probabilities of those classes being present in the image. Also, so I think this point was a little bit unclear last time that linear classification has this interpretation as learning templates per class, where if you look at the diagram on the lower left, you think that, so for every pixel in the image, and for every one of our 10 classes, there exists some entry in this matrix W, telling us how much does that pixel influence that class. So that means that each of these rows in the matrix W ends up corresponding to a template for the class. And if we take those rows and unravel, so each of those rows again corresponds to a weighting between the values of, between the pixel values of the image and that class, so if we take that row and unravel it back into an image, then we can visualize the learned template for each of these classes. We also had this interpretation of linear classification as learning linear decision boundaries between pixels in some high dimensional space where the dimensions of the space correspond to the values of the pixel intensity values of the image. So this is kind of where we left off last time. And so where we kind of stopped, where we ended up last time is we got this idea of a linear classifier, and we didn't talk about how to actually choose the W. How to actually use the training data to determine which value of W should be best. So kind of where we stopped off at is that for some setting of W, we can use this W to come up with 10 with our class scores for any image. So and some of these class scores might be better or worse. So here in this simple example, we've shown maybe just a training data set of three images along with the 10 class scores predicted for some value of W for those images. And you can see that some of these scores are better or worse than others. So for example in the image on the left, if you look up, it's actually a cat because you're a human and you can tell these things, but if we look at the assigned probabilities, cat, well not probabilities but scores, then the classifier maybe for this setting of W gave the cat class a score of 2.9 for this image, whereas the frog class gave 3.78. So maybe the classifier is not doing not so good on this image, that's bad, we wanted the true class to be actually the highest class score, whereas for some of these other examples, like the car for example, you see that the automobile class has a score of six which is much higher than any of the others, so that's good. And the frog, the predicted scores are maybe negative four, which is much lower than all the other ones, so that's actually bad. So this is kind of a hand wavy approach, just kind of looking at the scores and eyeballing which ones are good and which ones are bad. But to actually write algorithms about these things and to actually to determine automatically which W will be best, we need some way to quantify the badness of any particular W. And that's this function that takes in a W, looks at the scores and then tells us how bad quantitatively is that W, is something that we'll call a loss function. And in this lecture we'll see a couple examples of different loss functions that you can use for this image classification problem. So then once we've got this idea of a loss function, this allows us to quantify for any given value of W, how good or bad is it? But then we actually need to find and come up with an efficient procedure for searching through the space of all possible Ws and actually come up with what is the correct value of W that is the least bad, and this process will be an optimization procedure and we'll talk more about that in this lecture. So I'm going to shrink this example a little bit because 10 classes is a little bit unwieldy. So we'll kind of work with this tiny toy data set of three examples and three classes going forward in this lecture. So again, in this example, the cat is maybe not so correctly classified, the car is correctly classified, and the frog, this setting of W got this frog image totally wrong, because the frog score is much lower than others. So to formalize this a little bit, usually when we talk about a loss function, we imagine that we have some training data set of xs and ys, usually N examples of these where the xs are the inputs to the algorithm in the image classification case, the xs would be the actually pixel values of your images, and the ys will be the things you want your algorithm to predict, we usually call these the labels or the targets. So in the case of image classification, remember we're trying to categorize each image for CIFAR10 to one of 10 categories, so the label y here will be an integer between one and 10 or maybe between zero and nine depending on what programming language you're using, but it'll be an integer telling you what is the correct category for each one of those images x. And now our loss function will denote L_i to denote the, so then we have this prediction function x which takes in our example x and our weight matrix W and makes some prediction for y, in the case of image classification these will be our 10 numbers. Then we'll define some loss function L_i which will take in the predicted scores coming out of the function f together with the true target or label Y and give us some quantitative value for how bad those predictions are for that training example. And now the final loss L will be the average of these losses summed over the entire data set over each of the N examples in our data set. So this is actually a very general formulation, and actually extends even beyond image classification. Kind of as we move forward and see other tasks, other examples of tasks and deep learning, the kind of generic setup is that for any task you have some xs and ys and you want to write down some loss function that quantifies exactly how happy you are with your particular parameter settings W and then you'll eventually search over the space of W to find the W that minimizes the loss on your training data. So as a first example of a concrete loss function that is a nice thing to work with in image classification, we'll talk about the multiclass SVM loss. You may have seen the binary SVM, our support vector machine in CS 229 and the multiclass SVM is a generalization of that to handle multiple classes. In the binary SVM case as you may have seen in 229, you only had two classes, each example x was going to be classified as either positive or negative example, but now we have 10 categories, so we need to generalize this notion to handle multiple classes. So this loss function has kind of a funny functional form, so we'll walk through it in a bit more, in quite a bit of detail over the next couple of slides. But what this is saying is that the loss L_i for any individual example, the way we'll compute it is we're going to perform a sum over all of the categories, Y, except for the true category, Y_i, so we're going to sum over all the incorrect categories, and then we're going to compare the score of the correct category, and the score of the incorrect category, and now if the score for the correct category is greater than the score of the incorrect category, greater than the incorrect score by some safety margin that we set to one, if that's the case that means that the true score is much, or the score for the true category is if it's much larger than any of the false categories, then we'll get a loss of zero. And we'll sum this up over all of the incorrect categories for our image and this will give us our final loss for this one example in the data set. And again we'll take the average of this loss over the whole training data set. So this kind of like if then statement, like if the true class score is much larger than the others, this kind of if then formulation we often compactify into this single max of zero S_j minus S_Yi plus one thing, but I always find that notation a little bit confusing, and it always helps me to write it out in this sort of case based notation to figure out exactly what the two cases are and what's going on. And by the way, this style of loss function where we take max of zero and some other quantity is often referred to as some type of a hinge loss, and this name comes from the shape of the graph when you go and plot it, so here the x axis corresponds to the S_Yi, that is the score of the true class for some training example, and now the y axis is the loss, and you can see that as the score for the true category for this example increases, then the loss will go down linearly until we get to above this safety margin, after which the loss will be zero because we've already correctly classified this example. So let's, oh, question?  [Student] Sorry, in terms of notation what is S underscore Yi? Is that your right score?  Yeah, so the question is in terms of notation, what is S and what is SYI in particular, so the Ss are the predicted scores for the classes that are coming out of the classifier. So if one is the cat class and two is the dog class then S1 and S2 would be the cat and dog scores respectively. And remember we said that Yi was the category of the ground truth label for the example which is some integer. So then S sub Y sub i, sorry for the double subscript, that corresponds to the score of the true class for the ith example in the training set. Question?  [Student] So what exactly is this computing?  Yeah the question is what exactly is this computing here? It's a little bit funny, I think it will become more clear when we walk through an explicit example, but in some sense what this loss is saying is that we are happy if the true score is much higher than all the other scores. It needs to be higher than all the other scores by some safety margin, and if the true score is not high enough, greater than any of the other scores, then we will incur some loss and that would be bad. So this might make a little bit more sense if we walk through an explicit example for this tiny three example data set. So here remember I've sort of removed the case space notation and just switching back to the zero one notation, and now if we look at, if we think about computing this multiclass SVM loss for just this first training example on the left, then remember we're going to loop over all of the incorrect classes, so for this example, cat is the correct class, so we're going to loop over the car and frog classes, and now for car, we're going to compare the, we're going to look at the car score, 5.1, minus the cat score, 3.2 plus one, when we're comparing cat and car we expect to incur some loss here because the car score is greater than the cat score which is bad. So for this one class, for this one example, we'll incur a loss of 2.9, and then when we go and compare the cat score and the frog score we see that cat is 3.2, frog is minus 1.7, so cat is more than one greater than frog, which means that between these two classes we incur zero loss. So then the multiclass SVM loss for this training example will be the sum of the losses across each of these pairs of classes, which will be 2.9 plus zero which is 2.9. Which is sort of saying that 2.9 is a quantitative measure of how much our classifier screwed up on this one training example. And then if we repeat this procedure for this next car image, then again the true class is car, so we're going to iterate over all the other categories when we compare the car and the cat score, we see that car is more than one greater than cat so we get no loss here. When we compare car and frog, we again see that the car score is more than one greater than frog, so we get again no loss here, and our total loss for this training example is zero. And now I think you hopefully get the picture by now, but, if you go look at frog, now frog, we again compare frog and cat, incur quite a lot of loss because the frog score is very low, compare frog and car, incur a lot of loss because the score is very low, and then our loss for this example is 12.9. And then our final loss for the entire data set is the average of these losses across the different examples, so when you sum those out it comes to about 5.3. So then it's sort of, this is our quantitative measure that our classifier is 5.3 bad on this data set. Is there a question?  [Student] How do you choose the plus one?  Yeah, the question is how do you choose the plus one? That's actually a really great question, it seems like kind of an arbitrary choice here, it's the only constant that appears in the loss function and that seems to offend your aesthetic sensibilities a bit maybe. But it turns out that this is somewhat of an arbitrary choice, because we don't actually care about the absolute values of the scores in this loss function, we only care about the relative differences between the scores. We only care that the correct score is much greater than the incorrect scores. So in fact if you imagine scaling up your whole W up or down, then it kind of rescales all the scores correspondingly and if you kind of work through the details and there's a detailed derivation of this in the course notes online, you find this choice of one actually doesn't matter. That this free parameter of one kind of washes out and is canceled with this scale, like the overall setting of the scale in W. And again, check the course notes for a bit more detail on that. So then I think it's kind of useful to think about a couple different questions to try to understand intuitively what this loss is doing. So the first question is what's going to happen to the loss if we change the scores of the car image just a little bit? Any ideas? Everyone's too scared to ask a question? Answer? [student speaking faintly]  Yeah, so the answer is that if we jiggle the scores for this car image a little bit, the loss will not change. So the SVM loss, remember, the only thing it cares about is getting the correct score to be greater than one more than the incorrect scores, but in this case, the car score is already quite a bit larger than the others, so if the scores for this class changed for this example changed just a little bit, this margin of one will still be retained and the loss will not change, we'll still get zero loss. The next question, what's the min and max possible loss for SVM? [student speaking faintly] Oh I hear some murmurs. So the minimum loss is zero, because if you can imagine that across all the classes, if our correct score was much larger then we'll incur zero loss across all the classes and it will be zero, and if you think back to this hinge loss plot that we had, then you can see that if the correct score goes very, very negative, then we could incur potentially infinite loss. So the min is zero and the max is infinity. Another question, sort of when you initialize these things and start training from scratch, usually you kind of initialize W with some small random values, as a result your scores tend to be sort of small uniform random values at the beginning of training. And then the question is that if all of your Ss, if all of the scores are approximately zero and approximately equal, then what kind of loss do you expect when you're using multiclass SVM?  [Student] Number of classes minus one.  Yeah, so the answer is number of classes minus one, because remember that if we're looping over all of the incorrect classes, so we're looping over C minus one classes, within each of those classes the two Ss will be about the same, so we'll get a loss of one because of the margin and we'll get C minus one. So this is actually kind of useful because when you, this is a useful debugging strategy when you're using these things, that when you start off training, you should think about what you expect your loss to be, and if the loss you actually see at the start of training at that first iteration is not equal to C minus one in this case, that means you probably have a bug and you should go check your code, so this is actually kind of a useful thing to be checking in practice. Another question, what happens if, so I said we're summing an SVM over the incorrect classes, what happens if the sum is also over the correct class if we just go over everything?  [Student] The loss increases by one.  Yeah, so the answer is that the loss increases by one. And I think the reason that we do this in practice is because normally loss of zero is kind of, has this nice interpretation that you're not losing at all, so that's nice, so I think your answers wouldn't really change, you would end up finding the same classifier if you actually looped over all the categories, but if just by conventions we omit the correct class so that our minimum loss is zero. So another question, what if we used mean instead of sum here?  [Student] Doesn't change.  Yeah, the answer is that it doesn't change. So the number of classes is going to be fixed ahead of time when we select our data set, so that's just rescaling the whole loss function by a constant, so it doesn't really matter, it'll sort of wash out with all the other scale things because we don't actually care about the true values of the scores, or the true value of the loss for that matter. So now here's another example, what if we change this loss formulation and we actually added a square term on top of this max? Would this end up being the same problem or would this be a different classification algorithm?  [Student] Different.  Yes, this would be different. So here the idea is that we're kind of changing the tradeoffs between good and badness in kind of a nonlinear way, so this would end up actually computing a different loss function. This idea of a squared hinge loss actually does get used sometimes in practice, so that's kind of another trick to have in your bag when you're making up your own loss functions for your own problems. So now you'll end up, oh, was there a question?  [Student] Why would you use a squared loss instead of a nonsquared loss?  Yeah, so the question is why would you ever consider using a squared loss instead of a nonsquared loss? And the whole point of a loss function is to kind of quantify how bad are different mistakes. And if the classifier is making different sorts of mistakes, how do we weigh off the different tradeoffs between different types of mistakes the classifier might make? So if you're using a squared loss, that sort of says that things that are very, very bad are now going to be squared bad so that's like really, really bad, like we don't want anything that's totally catastrophically misclassified, whereas if you're using this hinge loss, we don't actually care between being a little bit wrong and being a lot wrong, being a lot wrong kind of like, if an example is a lot wrong, and we increase it and make it a little bit less wrong, that's kind of the same goodness as an example which was only a little bit wrong and then increasing it to be a little bit more right. So that's a little bit hand wavy, but this idea of using a linear versus a square is a way to quantify how much we care about different categories of errors. And this is definitely something that you should think about when you're actually applying these things in practice, because the loss function is the way that you tell your algorithm what types of errors you care about and what types of errors it should trade off against. So that's actually super important in practice depending on your application. So here's just a little snippet of sort of vectorized code in numpy, and you'll end up implementing something like this for the first assignment, but this kind of gives you the sense that this sum is actually like pretty easy to implement in numpy, it only takes a couple lines of vectorized code. And you can see in practice, like one nice trick is that we can actually go in here and zero out the margins corresponding to the correct class, and that makes it easy to then just, that's sort of one nice vectorized trick to skip, iterate over all but one class. You just kind of zero out the one you want to skip and then compute the sum anyway, so that's a nice trick you might consider using on the assignment. So now, another question about this loss function. Suppose that you were lucky enough to find a W that has loss of zero, you're not losing at all, you're totally winning, this loss function is crushing it, but then there's a question, is this W unique or were there other Ws that could also have achieved zero loss?  [Student] There are other Ws.  Answer, yeah, so there are definitely other Ws. And in particular, because we talked a little bit about this thing of scaling the whole problem up or down depending on W, so you could actually take W multiplied by two and this doubled W (Is it quad U now? I don't know.) [laughing] This would also achieve zero loss. So as a concrete example of this, you can go back to your favorite example and maybe work through the numbers a little bit later, but if you're taking W and we double W, then the margins between the correct and incorrect scores will also double. So that means that if all these margins were already greater than one, and we doubled them, they're still going to be greater than one, so you'll still have zero loss. And this is kind of interesting, because if our loss function is the way that we tell our classifier which W we want and which W we care about, this is a little bit weird, now there's this inconsistency and how is the classifier to choose between these different versions of W that all achieve zero loss? And that's because what we've done here is written down only a loss in terms of the data, and we've only told our classifier that it should try to find the W that fits the training data. But really in practice, we don't actually care that much about fitting the training data, the whole point of machine learning is that we use the training data to find some classifier and then we'll apply that thing on test data. So we don't really care about the training data performance, we really care about the performance of this classifier on test data. So as a result, if the only thing we're telling our classifier to do is fit the training data, then we can lead ourselves into some of these weird situations sometimes, where the classifier might have unintuitive behavior. So a concrete, canonical example of this sort of thing, by the way, this is not linear classification anymore, this is a little bit of a more general machine learning concept, is that suppose we have this data set of blue points, and we're going to fit some curve to the training data, the blue points, then if the only thing we've told our classifier to do is to try and fit the training data, it might go in and have very wiggly curves to try to perfectly classify all of the training data points. But this is bad, because we don't actually care about this performance, we care about the performance on the test data. So now if we have some new data come in that sort of follows the same trend, then this very wiggly blue line is going to be totally wrong. And in fact, what we probably would have preferred the classifier to do was maybe predict this straight green line, rather than this very complex wiggly line to perfectly fit all the training data. And this is a core fundamental problem in machine learning, and the way we usually solve it, is this concept of regularization. So here we're going to add an additional term to the loss function. In addition to the data loss, which will tell our classifier that it should fit the training data, we'll also typically add another term to the loss function called a regularization term, which encourages the model to somehow pick a simpler W, where the concept of simple kind of depends on the task and the model. There's this whole idea of Occam's Razor, which is this fundamental idea in scientific discovery more broadly, which is that if you have many different competing hypotheses, that could explain your observations, you should generally prefer the simpler one, because that's the explanation that is more likely to generalize to new observations in the future. And the way we operationalize this intuition in machine learning is typically through some explicit regularization penalty that's often written down as R. So then your standard loss function usually has these two terms, a data loss and a regularization loss, and there's some hyperparameter here, lambda, that trades off between the two. And we talked about hyperparameters and crossvalidation in the last lecture, so this regularization hyperparameter lambda will be one of the more important ones that you'll need to tune when training these models in practice. Question?  [Student] What does that lambda R W term have to do with [speaking faintly].  Yeah, so the question is, what's the connection between this lambda R W term and actually forcing this wiggly line to become a straight green line? I didn't want to go through the derivation on this because I thought it would lead us too far astray, but you can imagine, maybe you're doing a regression problem, in terms of different polynomial basis functions, and if you're adding this regression penalty, maybe the model has access to polynomials of very high degree, but through this regression term you could encourage the model to prefer polynomials of lower degree, if they fit the data properly, or if they fit the data relatively well. So you could imagine there's two ways to do this, either you can constrain your model class to just not contain the more powerful, more complex models, or you can add this soft penalty where the model still has access to more complex models, maybe high degree polynomials in this case, but you add this soft constraint saying that if you want to use these more complex models, you need to overcome this penalty for using their complexity. So that's the connection here, that is not quite linear classification, this is the picture that many people have in mind when they think about regularization at least. So there's actually a lot of different types of regularization that get used in practice. The most common one is probably L2 regularization, or weight decay. But there's a lot of other ones that you might see. This L2 regularization is just the euclidean norm of this weight vector W, or sometimes the squared norm. Or sometimes half the squared norm because it makes your derivatives work out a little bit nicer. But the idea of L2 regularization is you're just penalizing the euclidean norm of this weight vector. You might also sometimes see L1 regularization, where we're penalizing the L1 norm of the weight vector, and the L1 regularization has some nice properties like encouraging sparsity in this matrix W. Some other things you might see would be this elastic net regularization, which is some combination of L1 and L2. You sometimes see max norm regularization, penalizing the max norm rather than the L1 or L2 norm. But these sorts of regularizations are things that you see not just in deep learning, but across many areas of machine learning and even optimization more broadly. In some later lectures, we'll also see some types of regularization that are more specific to deep learning. For example dropout, we'll see in a couple lectures, or batch normalization, stochastic depth, these things get kind of crazy in recent years. But the whole idea of regularization is just any thing that you do to your model, that sort of penalizes somehow the complexity of the model, rather than explicitly trying to fit the training data. Question? [student speaking faintly] Yeah, so the question is, how does the L2 regularization measure the complexity of the model? Thankfully we have an example of that right here, maybe we can walk through. So here we maybe have some training example, x, and there's two different Ws that we're considering. So x is just this vector of four ones, and we're considering these two different possibilities for W. One is a one in the first, one is a single one and three zeros, and the other has this 0.25 spread across the four different entries. And now, when we're doing linear classification, we're really taking dot products between our x and our W. So in terms of linear classification, these two Ws are the same, because they give the same result when dot producted with x. But now the question is, if you look at these two examples, which one would L2 regression prefer? Yeah, so L2 regression would prefer W2, because it has a smaller norm. So the answer is that the L2 regression measures complexity of the classifier in this relatively coarse way, where the idea is that, remember the Ws in linear classification had this interpretation of how much does this value of the vector x correspond to this output class? So L2 regularization is saying that it prefers to spread that influence across all the different values in x. Maybe this might be more robust, in case you come up with xs that vary, then our decisions are spread out and depend on the entire x vector, rather than depending only on certain elements of the x vector. And by the way, L1 regularization has this opposite interpretation. So actually if we were using L1 regularization, then we would actually prefer W1 over W2, because L1 regularization has this different notion of complexity, saying that maybe the model is less complex, maybe we measure model complexity by the number of zeros in the weight vector, so the question of how do we measure complexity and how does L2 measure complexity? They're kind of problem dependent. And you have to think about for your particular setup, for your particular model and data, how do you think that complexity should be measured on this task? Question?  [Student] So why would L1 prefer W1? Don't they sum to the same one?  Oh yes, you're right. So in this case, L1 is actually the same between these two. But you could construct a similar example to this where W1 would be preferred by L1 regularization. I guess the general intuition behind L1 is that it generally prefers sparse solutions, that it drives all your entries of W to zero for most of the entries, except for a couple where it's allowed to deviate from zero. The way of measuring complexity for L1 is maybe the number of nonzero entries, and then for L2, it thinks that things that spread the W across all the values are less complex. So it depends on your data, depends on your problem. Oh and by the way, if you're a hardcore Bayesian, then using L2 regularization has this nice interpretation of MAP inference under a Gaussian prior on the parameter vector. I think there was a homework problem about that in 229, but we won't talk about that for the rest of the quarter. That's sort of my long, deep dive into the multiclass SVM loss. Question?  [Student] Yeah, so I'm still confused about what the kind of stuff I need to do when the linear versus polynomial thing, because the use of this loss function isn't going to change the fact that you're just doing, you're looking at a linear classifier, right?  Yeah, so the question is that, adding a regularization is not going to change the hypothesis class. This is not going to change us away from a linear classifier. The idea is that maybe this example of this polynomial regression is definitely not linear regression. That could be seen as linear regression on top of a polynomial expansion of the input, and in which case, this regression sort of says that you're not allowed to use as many polynomial coefficients as maybe you should have. Right, so you can imagine this is like, when you're doing polynomial regression, you can write out a polynomial as f of x equals A zero plus A one x plus A two x squared plus A three x whatever, in that case your parameters, your Ws, would be these As, in which case, penalizing the W could force it towards lower degree polynomials. Except in the case of polynomial regression, you don't actually want to parameterize in terms of As, there's some other paramterization that you want to use, but that's the general idea, that you're sort of penalizing the parameters of the model to force it towards the simpler hypotheses within your hypothesis class. And maybe we can take this offline if that's still a bit confusing. So then we've sort of seen this multiclass SVM loss, and just by the way as a side note, this is one extension or generalization of the SVM loss to multiple classes, there's actually a couple different formulations that you can see around in literature, but I mean, my intuition is that they all tend to work similarly in practice, at least in the context of deep learning. So we'll stick with this one particular formulation of the multiclass SVM loss in this class. But of course there's many different loss functions you might imagine. And another really popular choice, in addition to the multiclass SVM loss, another really popular choice in deep learning is this multinomial logistic regression, or a softmax loss. And this one is probably actually a bit more common in the context of deep learning, but I decided to present this second for some reason. So remember in the context of the multiclass SVM loss, we didn't actually have an interpretation for those scores. Remember, when we're doing some classification, our model F, spits our these 10 numbers, which are our scores for the classes, and for the multiclass SVM, we didn't actually give much interpretation to those scores. We just said that we want the true score, the score of the correct class to be greater than the incorrect classes, and beyond that we don't really say what those scores mean. But now, for the multinomial logistic regression loss function, we actually will endow those scores with some additional meaning. And in particular we're going to use those scores to compute a probability distribution over our classes. So we use this socalled softmax function where we take all of our scores, we exponentiate them so that now they become positive, then we renormalize them by the sum of those exponents so now after we send our scores through this softmax function, now we end up with this probability distribution, where now we have probabilities over our classes, where each probability is between zero and one, and the sum of probabilities across all classes sum to one. And now the interpretation is that we want, there's this computed probability distribution that's implied by our scores, and we want to compare this with the target or true probability distribution. So if we know that the thing is a cat, then the target probability distribution would put all of the probability mass on cat, so we would have probability of cat equals one, and zero probability for all the other classes. So now what we want to do is encourage our computed probability distribution that's coming out of this softmax function to match this target probability distribution that has all the mass on the correct class. And the way that we do this, I mean, you can do this equation in many ways, you can do this as a KL divergence between the target and the computed probability distribution, you can do this as a maximum likelihood estimate, but at the end of the day, what we really want is that the probability of the true class is high and as close to one. So then our loss will now be the negative log of the probability of the true class. This is confusing 'cause we're putting this through multiple different things, but remember we wanted the probability to be close to one, so now log is a monotonic function, it goes like this, and it turns out mathematically, it's easier to maximize log than it is to maximize the raw probability, so we stick with log. And now log is monotonic, so if we maximize log P of correct class, that means we want that to be high, but loss functions measure badness not goodness so we need to put in the minus one to make it go the right way. So now our loss function for SVM is going to be the minus log of the probability of the true class. Yeah, so that's the summary here, is that we take our scores, we run through the softmax, and now our loss is this minus log of the probability of the true class. Okay, so then if you look at what this looks like on a concrete example, then we go back to our favorite beautiful cat with our three examples and we've got these three scores that are coming out of our linear classifier, and these scores are exactly the way that they were in the context of the SVM loss. But now, rather than taking these scores and putting them directly into our loss function, we're going to take them all and exponentiate them so that they're all positive, and then we'll normalize them to make sure that they all sum to one. And now our loss will be the minus log of the true class score. So that's the softmax loss, also called multinomial logistic regression. So now we asked several questions to try to gain intuition about the multiclass SVM loss, and it's useful to think about some of the same questions to contrast with the softmax loss. So then the question is, what's the min and max value of the softmax loss? Okay, maybe not so sure, there's too many logs and sums and stuff going on in here. So the answer is that the min loss is zero and the max loss is infinity. And the way that you can see this, the probability distribution that we want is one on the correct class, zero on the incorrect classes, the way that we do that is, so if that were the case, then this thing inside the log would end up being one, because it's the log probability of the true class, then log of one is zero, minus log of one is still zero. So that means that if we got the thing totally right, then our loss would be zero. But by the way, in order to get the thing totally right, what would our scores have to look like? Murmuring, murmuring. So the scores would actually have to go quite extreme, like towards infinity. So because we actually have this exponentiation, this normalization, the only way we can actually get a probability distribution of one and zero, is actually putting an infinite score for the correct class, and minus infinity score for all the incorrect classes. And computers don't do so well with infinities, so in practice, you'll never get to zero loss on this thing with finite precision. But you still have this interpretation that zero is the theoretical minimum loss here. And the maximum loss is unbounded. So suppose that if we had zero probability mass on the correct class, then you would have minus log of zero, log of zero is minus infinity, so minus log of zero would be plus infinity, so that's really bad. But again, you'll never really get here because the only way you can actually get this probability to be zero, is if e to the correct class score is zero, and that can only happen if that correct class score is minus infinity. So again, you'll never actually get to these minimum, maximum values with finite precision. So then, remember we had this debugging, sanity check question in the context of the multiclass SVM, and we can ask the same for the softmax. If all the Ss are small and about zero, then what is the loss here? Yeah, answer?  [Student] Minus log one over C.  So minus log of one over C? I think that's, yeah, so then it'd be minus log of one over C, because log can flip the thing so then it's just log of C. Yeah, so it's just log of C. And again, this is a nice debugging thing, if you're training a model with this softmax loss, you should check at the first iteration. If it's not log C, then something's gone wrong. So then we can compare and contrast these two loss functions a bit. In terms of linear classification, this setup looks the same. We've got this W matrix that gets multiplied against our input to produce this specter of scores, and now the difference between the two loss functions is how we choose to interpret those scores to quantitatively measure the badness afterwards. So for SVM, we were going to go in and look at the margins between the scores of the correct class and the scores of the incorrect class, whereas for this softmax or crossentropy loss, we're going to go and compute a probability distribution and then look at the minus log probability of the correct class. So sometimes if you look at, in terms of, nevermind, I'll skip that point. [laughing] So another question that's interesting when contrasting these two loss functions is thinking, suppose that I've got this example point, and if you change its scores, so assume that we've got three scores for this, ignore the part on the bottom. But remember if we go back to this example where in the multiclass SVM loss, when we had the car, and the car score was much better than all the incorrect classes, then jiggling the scores for that car image didn't change the multiclass SVM loss at all, because the only thing that the SVM loss cared about was getting that correct score to be greater than a margin above the incorrect scores. But now the softmax loss is actually quite different in this respect. The softmax loss actually always wants to drive that probability mass all the way to one. So even if you're giving very high score to the correct class, and very low score to all the incorrect classes, softmax will want you to pile more and more probability mass on the correct class, and continue to push the score of that correct class up towards infinity, and the score of the incorrect classes down towards minus infinity. So that's the interesting difference between these two loss functions in practice. That SVM, it'll get this data point over the bar to be correctly classified and then just give up, it doesn't care about that data point any more. Whereas softmax will just always try to continually improve every single data point to get better and better and better and better. So that's an interesting difference between these two functions. In practice, I think it tends not to make a huge difference which one you choose, they tend to perform pretty similarly across, at least a lot of deep learning applications. But it is very useful to keep some of these differences in mind. Yeah, so to recap where we've come to from here, is that we've got some data set of xs and ys, we use our linear classifier to get some score function, to compute our scores S, from our inputs, x, and then we'll use a loss function, maybe softmax or SVM or some other loss function to compute how quantitatively bad were our predictions compared to this ground true targets, y. And then we'll often augment this loss function with a regularization term, that tries to trade off between fitting the training data and preferring simpler models. So this is a pretty generic overview of a lot of what we call supervised learning, and what we'll see in deep learning as we move forward, is that generally you'll want to specify some function, f, that could be very complex in structure, specify some loss function that determines how well your algorithm is doing, given any value of the parameters, some regularization term for how to penalize model complexity and then you combine these things together and try to find the W that minimizes this final loss function. But then the question is, how do we actually go about doing that? How do we actually find this W that minimizes the loss? And that leads us to the topic of optimization. So when we're doing optimization, I usually think of things in terms of walking around some large valley. So the idea is that you're walking around this large valley with different mountains and valleys and streams and stuff, and every point on this landscape corresponds to some setting of the parameters W. And you're this little guy who's walking around this valley, and you're trying to find, and the height of each of these points, sorry, is equal to the loss incurred by that setting of W. And now your job as this little man walking around this landscape, you need to somehow find the bottom of this valley. And this is kind of a hard problem in general. You might think, maybe I'm really smart and I can think really hard about the analytic properties of my loss function, my regularization all that, maybe I can just write down the minimizer, and that would sort of correspond to magically teleporting all the way to the bottom of this valley. But in practice, once your prediction function, f, and your loss function and your regularizer, once these things get big and complex and using neural networks, there's really not much hope in trying to write down an explicit analytic solution that takes you directly to the minima. So in practice we tend to use various types of iterative methods where we start with some solution and then gradually improve it over time. So the very first, stupidest thing that you might imagine is random search, that will just take a bunch of Ws, sampled randomly, and throw them into our loss function and see how well they do. So spoiler alert, this is a really bad algorithm, you probably shouldn't use this, but at least it's one thing you might imagine trying. And we can actually do this, we can actually try to train a linear classifier via random search, for CIFAR10 and for this there's 10 classes, so random chance is 10%, and if we did some number of random trials, we eventually found just through sheer dumb luck, some setting of W that got maybe 15% accuracy. So it's better than random, but state of the art is maybe 95% so we've got a little bit of gap to close here. So again, probably don't use this in practice, but you might imagine that this is something you could potentially do. So in practice, maybe a better strategy is actually using some of the local geometry of this landscape. So if you're this little guy who's walking around this landscape, maybe you can't see directly the path down to the bottom of the valley, but what you can do is feel with your foot and figure out what is the local geometry, if I'm standing right here, which way will take me a little bit downhill? So you can feel with your feet and feel where is the slope of the ground taking me down a little bit in this direction? And you can take a step in that direction, and then you'll go down a little bit, feel again with your feet to figure out which way is down, and then repeat over and over again and hope that you'll end up at the bottom of the valley eventually. So this also seems like a relatively simple algorithm, but actually this one tends to work really well in practice if you get all the details right. So this is generally the strategy that we'll follow when training these large neural networks and linear classifiers and other things. So then, that was a little hand wavy, so what is slope? If you remember back to your calculus class, then at least in one dimension, the slope is the derivative of this function. So if we've got some onedimensional function, f, that takes in a scalar x, and then outputs the height of some curve, then we can compute the slope or derivative at any point by imagining, if we take a small step, h, in any direction, take a small step, h, and compare the difference in the function value over that step and then drag the step size to zero, that will give us the slope of that function at that point. And this generalizes quite naturally to multivariable functions as well. So in practice, our x is maybe not a scalar but a whole vector, 'cause remember, x might be a whole vector, so we need to generalize this notion to multivariable things. And the generalization that we use of the derivative in the multivariable setting is the gradient, so the gradient is a vector of partial derivatives. So the gradient will have the same shape as x, and each element of the gradient will tell us what is the slope of the function f, if we move in that coordinate direction. And the gradient turns out to have these very nice properties, so the gradient is now a vector of partial derivatives, but it points in the direction of greatest increase of the function and correspondingly, if you look at the negative gradient direction, that gives you the direction of greatest decrease of the function. And more generally, if you want to know, what is the slope of my landscape in any direction? Then that's equal to the dot product of the gradient with the unit vector describing that direction. So this gradient is super important, because it gives you this linear, firstorder approximation to your function at your current point. So in practice, a lot of deep learning is about computing gradients of your functions and then using those gradients to iteratively update your parameter vector. So one naive way that you might imagine actually evaluating this gradient on a computer, is using the method of finite differences, going back to the limit definition of gradient. So here on the left, we imagine that our current W is this parameter vector that maybe gives us some current loss of maybe 1.25 and our goal is to compute the gradient, dW, which will be a vector of the same shape as W, and each slot in that gradient will tell us how much will the loss change is we move a tiny, infinitesimal amount in that coordinate direction. So one thing you might imagine is just computing these finite differences, that we have our W, we might try to increment the first element of W by a small value, h, and then recompute the loss using our loss function and our classifier and all that. And maybe in this setting, if we move a little bit in the first dimension, then our loss will decrease a little bit from 1.2534 to 1.25322. And then we can use this limit definition to come up with this finite differences approximation to the gradient in this first dimension. And now you can imagine repeating this procedure in the second dimension, where now we take the first dimension, set it back to the original value, and now increment the second direction by a small step. And again, we compute the loss and use this finite differences approximation to compute an approximation to the gradient in the second slot. And now repeat this again for the third, and on and on and on. So this is actually a terrible idea because it's super slow. So you might imagine that computing this function, f, might actually be super slow if it's a large, convolutional neural network. And this parameter vector, W, probably will not have 10 entries like it does here, it might have tens of millions or even hundreds of millions for some of these large, complex deep learning models. So in practice, you'll never want to compute your gradients for your finite differences, 'cause you'd have to wait for hundreds of millions potentially of function evaluations to get a single gradient, and that would be super slow and super bad. But thankfully we don't have to do that. Hopefully you took a calculus course at some point in your lives, so you know that thanks to these guys, we can just write down the expression for our loss and then use the magical hammer of calculus to just write down an expression for what this gradient should be. And this'll be way more efficient than trying to compute it analytically via finite differences. One, it'll be exact, and two, it'll be much faster since we just need to compute this single expression. So what this would look like is now, if we go back to this picture of our current W, rather than iterating over all the dimensions of W, we'll figure out ahead of time what is the analytic expression for the gradient, and then just write it down and go directly from the W and compute the dW or the gradient in one step. And that will be much better in practice. So in summary, this numerical gradient is something that's simple and makes sense, but you won't really use it in practice. In practice, you'll always take an analytic gradient and use that when actually performing these gradient computations. However, one interesting note is that these numeric gradients are actually a very useful debugging tool. Say you've written some code, and you wrote some code that computes the loss and the gradient of the loss, then how do you debug this thing? How do you make sure that this analytic expression that you derived and wrote down in code is actually correct? So a really common debugging strategy for these things is to use the numeric gradient as a way, as sort of a unit test to make sure that your analytic gradient was correct. Again, because this is super slow and inexact, then when doing this numeric gradient checking, as it's called, you'll tend to scale down the parameter of the problem so that it actually runs in a reasonable amount of time. But this ends up being a super useful debugging strategy when you're writing your own gradient computations. So this is actually very commonly used in practice, and you'll do this on your assignments as well. So then once we know how to compute the gradient, then it leads us to this super simple algorithm that's like three lines, but turns out to be at the heart of how we train even these very biggest, most complex deep learning algorithms, and that's gradient descent. So gradient descent is first we initialize our W as some random thing, then while true, we'll compute our loss and our gradient and then we'll update our weights in the opposite of the gradient direction, 'cause remember that the gradient was pointing in the direction of greatest increase of the function, so minus gradient points in the direction of greatest decrease, so we'll take a small step in the direction of minus gradient, and just repeat this forever and eventually your network will converge and you'll be very happy, hopefully. But this step size is actually a hyperparameter, and this tells us that every time we compute the gradient, how far do we step in that direction. And this step size, also sometimes called a learning rate, is probably one of the single most important hyperparameters that you need to set when you're actually training these things in practice. Actually for me when I'm training these things, trying to figure out this step size or this learning rate, is the first hyperparameter that I always check. Things like model size or regularization strength I leave until a little bit later, and getting the learning rate or the step size correct is the first thing that I try to set at the beginning. So pictorially what this looks like here's a simple example in two dimensions. So here we've got maybe this bowl that's showing our loss function where this red region in the center is this region of low loss we want to get to and these blue and green regions towards the edge are higher loss that we want to avoid. So now we're going to start of our W at some random point in the space, and then we'll compute the negative gradient direction, which will hopefully point us in the direction of the minima eventually. And if we repeat this over and over again, we'll hopefully eventually get to the exact minima. And what this looks like in practice is, oh man, we've got this mouse problem again. So what this looks like in practice is that if we repeat this thing over and over again, then we will start off at some point and eventually, taking tiny gradient steps each time, you'll see that the parameter will arc in toward the center, this region of minima, and that's really what you want, because you want to get to low loss. And by the way, as a bit of a teaser, we saw in the previous slide, this example of very simple gradient descent, where at every step, we're just stepping in the direction of the gradient. But in practice, over the next couple of lectures, we'll see that there are slightly fancier step, what they call these update rules, where you can take slightly fancier things to incorporate gradients across multiple time steps and stuff like that, that tend to work a little bit better in practice and are used much more commonly than this vanilla gradient descent when training these things in practice. And then, as a bit of a preview, we can look at some of these slightly fancier methods on optimizing the same problem. So again, the black will be this same gradient computation, and these, I forgot which color they are, but these two other curves are using slightly fancier update rules to decide exactly how to use the gradient information to make our next step. So one of these is gradient descent with momentum, the other is this Adam optimizer, and we'll see more details about those later in the course. But the idea is that we have this very basic algorithm called gradient descent, where we use the gradient at every time step to determine where to step next, and there exist different update rules which tell us how exactly do we use that gradient information. But it's all the same basic algorithm of trying to go downhill at every time step. But there's actually one more little wrinkle that we should talk about. So remember that we defined our loss function, we defined a loss that computes how bad is our classifier doing at any single training example, and then we said that our full loss over the data set was going to be the average loss across the entire training set. But in practice, this N could be very very large. If we're using the image net data set for example, that we talked about in the first lecture, then N could be like 1.3 million, so actually computing this loss could be actually very expensive and require computing perhaps millions of evaluations of this function. So that could be really slow. And actually, because the gradient is a linear operator, when you actually try to compute the gradient of this expression, you see that the gradient of our loss is now the sum of the gradient of the losses for each of the individual terms. So now if we want to compute the gradient again, it sort of requires us to iterate over the entire training data set all N of these examples. So if our N was like a million, this would be super super slow, and we would have to wait a very very long time before we make any individual update to W. So in practice, we tend to use what is called stochastic gradient descent, where rather than computing the loss and gradient over the entire training set, instead at every iteration, we sample some small set of training examples, called a minibatch. Usually this is a power of two by convention, like 32, 64, 128 are common numbers, and then we'll use this small minibatch to compute an estimate of the full sum, and an estimate of the true gradient. And now this is stochastic because you can view this as maybe a Monte Carlo estimate of some expectation of the true value. So now this makes our algorithm slightly fancier, but it's still only four lines. So now it's well true, sample some random minibatch of data, evaluate your loss and gradient on the minibatch, and now make an update on your parameters based on this estimate of the loss, and this estimate of the gradient. And again, we'll see slightly fancier update rules of exactly how to integrate multiple gradients over time, but this is the basic training algorithm that we use for pretty much all deep neural networks in practice. So we have another interactive web demo actually playing around with linear classifiers, and training these things via stochastic gradient descent, but given how miserable the web demo was last time, I'm not actually going to open the link. Instead, I'll just play this video. [laughing] But I encourage you to go check this out and play with it online, because it actually helps to build some intuition about linear classifiers and training them via gradient descent. So here you can see on the left, we've got this problem where we're categorizing three different classes, and we've got these green, blue and red points that are our training samples from these three classes. And now we've drawn the decision boundaries for these classes, which are the colored background regions, as well as these directions, giving you the direction of increase for the class scores for each of these three classes. And now if you see, if you actually go and play with this thing online, you can see that we can go in and adjust the Ws and changing the values of the Ws will cause these decision boundaries to rotate. If you change the biases, then the decision boundaries will not rotate, but will instead move side to side or up and down. Then we can actually make steps that are trying to update this loss, or you can change the step size with this slider. You can hit this button to actually run the thing. So now with a big step size, we're running gradient descent right now, and these decision boundaries are flipping around and trying to fit the data. So it's doing okay now, but we can actually change the loss function in real time between these different SVM formulations and the different softmax. And you can see that as you flip between these different formulations of loss functions, it's generally doing the same thing. Our decision regions are mostly in the same place, but exactly how they end up relative to each other and exactly what the tradeoffs are between categorizing these different things changes a little bit. So I really encourage you to go online and play with this thing to try to get some intuition for what it actually looks like to try to train these linear classifiers via gradient descent. Now as an aside, I'd like to talk about another idea, which is that of image features. So so far we've talked about linear classifiers, which is just maybe taking our raw image pixels and then feeding the raw pixels themselves into our linear classifier. But as we talked about in the last lecture, this is maybe not such a great thing to do, because of things like multimodality and whatnot. So in practice, actually feeding raw pixel values into linear classifiers tends to not work so well. So it was actually common before the dominance of deep neural networks, was instead to have this twostage approach, where first, you would take your image and then compute various feature representations of that image, that are maybe computing different kinds of quantities relating to the appearance of the image, and then concatenate these different feature vectors to give you some feature representation of the image, and now this feature representation of the image would be fed into a linear classifier, rather than feeding the raw pixels themselves into the classifier. And the motivation here is that, so imagine we have a training data set on the left of these red points, and red points in the middle and blue points around that. And for this kind of data set, there's no way that we can draw a linear decision boundary to separate the red points from the blue points. And we saw more examples of this in the last lecture. But if we use a clever feature transform, in this case transforming to polar coordinates, then now after we do the feature transform, then this complex data set actually might become linearly separable, and actually could be classified correctly by a linear classifier. And the whole trick here now is to figure out what is the right feature transform that is computing the right quantities for the problem that you care about. So for images, maybe converting your pixels to polar coordinates, doesn't make sense, but you actually can try to write down feature representations of images that might make sense, and actually might help you out and might do better than putting in raw pixels into the classifier. So one example of this kind of feature representation that's super simple, is this idea of a color histogram. So you'll take maybe each pixel, you'll take this hue color spectrum and divide it into buckets and then for every pixel, you'll map it into one of those color buckets and then count up how many pixels fall into each of these different buckets. So this tells you globally what colors are in the image. Maybe if this example of a frog, this feature vector would tell us there's a lot of green stuff, and maybe not a lot of purple or red stuff. And this is kind of a simple feature vector that you might see in practice. Another common feature vector that we saw before the rise of neural networks, or before the dominance of neural networks was this histogram of oriented gradients. So remember from the first lecture, that Hubel and Wiesel found these oriented edges are really important in the human visual system, and this histogram of oriented gradients feature representation tries to capture the same intuition and measure the local orientation of edges on the image. So what this thing is going to do, is take our image and then divide it into these little eight by eight pixel regions. And then within each of those eight by eight pixel regions, we'll compute what is the dominant edge direction of each pixel, quantize those edge directions into several buckets and then within each of those regions, compute a histogram over these different edge orientations. And now your fullfeature vector will be these different bucketed histograms of edge orientations across all the different eight by eight regions in the image. So this is in some sense dual to the color histogram classifier that we saw before. So color histogram is saying, globally, what colors exist in the image, and this is saying, overall, what types of edge information exist in the image. And even localized to different parts of the image, what types of edges exist in different regions. So maybe for this frog on the left, you can see he's sitting on a leaf, and these leaves have these dominant diagonal edges, and if you visualize the histogram of oriented gradient features, then you can see that in this region, we've got a lot of diagonal edges, that this histogram of oriented gradient feature representation's capturing. So this was a super common feature representation and was used a lot for object recognition actually not too long ago. Another feature representation that you might see out there is this idea of bag of words. So this is taking inspiration from natural language processing. So if you've got a paragraph, then a way that you might represent a paragraph by a feature vector is counting up the occurrences of different words in that paragraph. So we want to take that intuition and apply it to images in some way. But the problem is that there's no really simple, straightforward analogy of words to images, so we need to define our own vocabulary of visual words. So we take this twostage approach, where first we'll get a bunch of images, sample a whole bunch of tiny random crops from those images and then cluster them using something like K means to come up with these different cluster centers that are maybe representing different types of visual words in the images. So if you look at this example on the right here, this is a real example of clustering actually different image patches from images, and you can see that after this clustering step, our visual words capture these different colors, like red and blue and yellow, as well as these different types of oriented edges in different directions, which is interesting that now we're starting to see these oriented edges come out from the data in a datadriven way. And now, once we've got these set of visual words, also called a codebook, then we can encode our image by trying to say, for each of these visual words, how much does this visual word occur in the image? And now this gives us, again, some slightly different information about what is the visual appearance of this image. And actually this is a type of feature representation that FeiFei worked on when she was a grad student, so this is something that you saw in practice not too long ago. So then as a bit of teaser, tying this all back together, the way that this image classification pipeline might have looked like, maybe about five to 10 years ago, would be that you would take your image, and then compute these different feature representations of your image, things like bag of words, or histogram of orientated gradients, concatenate a whole bunch of features together, and then feed these feature extractors down into some linear classifier. I'm simplifying a little bit, the pipelines were a little bit more complex than that, but this is the general intuition. And then the idea here was that after you extracted these features, this feature extractor would be a fixed block that would not be updated during training. And during training, you would only update the linear classifier if it's working on top of features. And actually, I would argue that once we move to convolutional neural networks, and these deep neural networks, it actually doesn't look that different. The only difference is that rather than writing down the features ahead of time, we're going to learn the features directly from the data. So we'll take our raw pixels and feed them into this to convolutional network, which will end up computing through many different layers some type of feature representation driven by the data, and then we'll actually train this entire weights for this entire network, rather than just the weights of linear classifier on top. So, next time we'll really start diving into this idea in more detail, and we'll introduce some neural networks, and start talking about backpropagation as well.
Contents
Examples
Regret
Leonard J. Savage argued that using nonBayesian methods such as minimax, the loss function should be based on the idea of regret, i.e., the loss associated with a decision should be the difference between the consequences of the best decision that could have been made had the underlying circumstances been known and the decision that was in fact taken before they were known.
Quadratic loss function
The use of a quadratic loss function is common, for example when using least squares techniques. It is often more mathematically tractable than other loss functions because of the properties of variances, as well as being symmetric: an error above the target causes the same loss as the same magnitude of error below the target. If the target is t, then a quadratic loss function is
for some constant C; the value of the constant makes no difference to a decision, and can be ignored by setting it equal to 1.
Many common statistics, including ttests, regression models, design of experiments, and much else, use least squares methods applied using linear regression theory, which is based on the quadratric loss function.
The quadratic loss function is also used in linearquadratic optimal control problems. In these problems, even in the absence of uncertainty, it may not be possible to achieve the desired values of all target variables. Often loss is expressed as a quadratic form in the deviations of the variables of interest from their desired values; this approach is tractable because it results in linear firstorder conditions. In the context of stochastic control, the expected value of the quadratic form is used.
01 loss function
In statistics and decision theory, a frequently used loss function is the 01 loss function
where is the indicator function.
Expected loss
In some contexts, the value of the loss function itself is a random quantity because it depends on the outcome of a random variable X.
Statistics
Both frequentist and Bayesian statistical theory involve making a decision based on the expected value of the loss function; however, this quantity is defined differently under the two paradigms.
Frequentist expected loss
We first define the expected loss in the frequentist context. It is obtained by taking the expected value with respect to the probability distribution, P_{θ}, of the observed data, X. This is also referred to as the risk function^{[3]}^{[4]}^{[5]}^{[6]} of the decision rule δ and the parameter θ. Here the decision rule depends on the outcome of X. The risk function is given by:
Here, θ is a fixed but possibly unknown state of nature, X is a vector of observations stochastically drawn from a population, is the expectation over all population values of X, dP_{θ} is a probability measure over the event space of X (parametrized by θ) and the integral is evaluated over the entire support of X.
Bayesian expected loss
In a Bayesian approach, the expectation is calculated using the posterior distribution π^{*} of the parameter θ:
One then should choose the action a^{*} which minimises the expected loss. Although this will result in choosing the same action as would be chosen using the frequentist risk, the emphasis of the Bayesian approach is that one is only interested in choosing the optimal action under the actual observed data, whereas choosing the actual frequentist optimal decision rule, which is a function of all possible observations, is a much more difficult problem.
Examples in statistics
 For a scalar parameter θ, a decision function whose output is an estimate of θ, and a quadratic loss function (squared error loss)
 the risk function becomes the mean squared error of the estimate,
 In density estimation, the unknown parameter is probability density itself. The loss function is typically chosen to be a norm in an appropriate function space. For example, for L^{2} norm,
 the risk function becomes the mean integrated squared error
Economic choice under uncertainty
In economics, decisionmaking under uncertainty is often modelled using the von NeumannMorgenstern utility function of the uncertain variable of interest, such as endofperiod wealth. Since the value of this variable is uncertain, so is the value of the utility function; it is the expected value of utility that is maximized.
Decision rules
A decision rule makes a choice using an optimality criterion. Some commonly used criteria are:
 Minimax: Choose the decision rule with the lowest worst loss — that is, minimize the worstcase (maximum possible) loss:
 Invariance: Choose the optimal decision rule which satisfies an invariance requirement.
 Choose the decision rule with the lowest average loss (i.e. minimize the expected value of the loss function):
Selecting a loss function
Sound statistical practice requires selecting an estimator consistent with the actual acceptable variation experienced in the context of a particular applied problem. Thus, in the applied use of loss functions, selecting which statistical method to use to model an applied problem depends on knowing the losses that will be experienced from being wrong under the problem's particular circumstances.^{[7]}
A common example involves estimating "location". Under typical statistical assumptions, the mean or average is the statistic for estimating location that minimizes the expected loss experienced under the squarederror loss function, while the median is the estimator that minimizes expected loss experienced under the absolutedifference loss function. Still different estimators would be optimal under other, less common circumstances.
In economics, when an agent is risk neutral, the objective function is simply expressed as the expected value of a monetary quantity, such as profit, income, or endofperiod wealth.
But for riskaverse (or riskloving) agents, loss is measured as the negative of a utility function, and the objective function to be optimized is the expected value of utility.
Other measures of cost are possible, for example mortality or morbidity in the field of public health or safety engineering.
For most optimization algorithms, it is desirable to have a loss function that is globally continuous and differentiable.
Two very commonly used loss functions are the squared loss, , and the absolute loss, . However the absolute loss has the disadvantage that it is not differentiable at . The squared loss has the disadvantage that it has the tendency to be dominated by outliers—when summing over a set of 's (as in ), the final sum tends to be the result of a few particularly large avalues, rather than an expression of the average avalue.
The choice of a loss function is not arbitrary. It is very restrictive and sometimes the loss function may be characterized by its desirable properties.^{[8]} Among the choice principles are, for example, the requirement of completeness of the class of symmetric statistics in the case of i.i.d. observations, the principle of complete information, and some others.
W. Edwards Deming and Nassim Nicholas Taleb argue that empirical reality, not nice mathematical properties, should be the sole basis for selecting loss functions, and real losses often aren’t mathematically nice and aren’t differentiable, continuous, symmetric, etc. For example, a person who arrives before a plane gate closure can still make the plane, but a person who arrives after can not, a discontinuity and asymmetry which makes arriving slightly late much more costly than arriving slightly early. In drug dosing, the cost of too little drug may be lack of efficacy, while the cost of too much may be tolerable toxicity, another example of assymmetry. Traffic, pipes, beams, ecologies, climates, etc. may tolerate increased load or stress with little noticeable change up to a point, then become backed up or break catastrophically. These situations, Deming and Taleb argue, are common in reallife problems, perhaps more common than classical smooth, continuous, symmetric, differentials cases.^{[citation needed]}
Taleb especially has argued that the practice of selecting loss functions based on mathematical niceness rather than actual loss experience is not really different from selecting data based on niceness rather than empirical observation, in other words, Taleb has argued that it should be considered a kind of scientific fraud.^{[citation needed]}
See also
 Bayesian regret
 Loss functions for classification
 Discounted maximum loss
 Hinge loss
 Scoring rule
 Statistical risk
References
 ^ Wald, A. (1950). Statistical Decision Functions. Wiley.
 ^ Cramér, H. (1930). On the mathematical theory of risk. Centraltryckeriet.
 ^ Nikulin, M.S. (2001) [1994], "Risk of a statistical procedure", in Hazewinkel, Michiel (ed.), Encyclopedia of Mathematics, Springer Science+Business Media B.V. / Kluwer Academic Publishers, ISBN 9781556080104
 ^ Berger, James O. (1985). Statistical decision theory and Bayesian Analysis (2nd ed.). New York: SpringerVerlag. ISBN 9780387960982. MR 0804611.
 ^ DeGroot, Morris (2004) [1970]. Optimal Statistical Decisions. Wiley Classics Library. ISBN 9780471680291. MR 2288194.
 ^ Robert, Christian P. (2007). The Bayesian Choice. Springer Texts in Statistics (2nd ed.). New York: Springer. doi:10.1007/0387715991. ISBN 9780387952314. MR 1835885.
 ^ Pfanzagl, J. (1994). Parametric Statistical Theory. Berlin: Walter de Gruyter. ISBN 9783110138634.
 ^ Detailed information on mathematical principles of the loss function choice is given in Chapter 2 of the book Klebanov, B.; Rachev, Svetlozat T.; Fabozzi, Frank J. (2009). Robust and NonRobust Models in Statistics. New York: Nova Scientific Publishers, Inc. (and references there).
Further reading
 Aretz, Kevin; Bartram, Söhnke M.; Pope, Peter F. (April–June 2011). "Asymmetric Loss Functions and the Rationality of Expected Stock Returns". International Journal of Forecasting. 27 (2): 413–437. doi:10.1016/j.ijforecast.2009.10.008. SSRN 889323.
 Berger, James O. (1985). Statistical decision theory and Bayesian Analysis (2nd ed.). New York: SpringerVerlag. ISBN 9780387960982. MR 0804611.
 Cecchetti, Stephen G, 2000. "Making Monetary Policy: Objectives and Rules," Oxford Review of Economic Policy, Oxford University Press, vol. 16(4), pages 4359, Winter.
 Horowitz, Ann R., 1987. "Loss functions and public policy," Journal of Macroeconomics, Elsevier, vol. 9(4), pages 489504.
 Waud, Roger N, 1976. "Asymmetric Policymaker Utility Functions and Optimal Policy Under Uncertainty," Econometrica, Econometric Society, vol. 44(1), pages 5366, January.