brunch

You can make anything
by writing

C.S.Lewis

by 라인하트 Oct 21. 2020

앤드류 응의 머신러닝(7-2): 비용 함수 정규화

온라인 강의 플랫폼 코세라의 창립자인 앤드류 응 (Andrew Ng) 교수는 인공지능 업계의 거장입니다. 그가 스탠퍼드 대학에서 머신 러닝 입문자에게 한 강의를 그대로 코세라 온라인 강의 (Coursera.org)에서 무료로 배울 수 있습니다. 이 강의는 머신러닝 입문자들의 필수코스입니다. 인공지능과 머신러닝을 혼자 공부하면서 자연스럽게 만나게 되는 강의입니다.

Regularization

정규화

Solving the Problem of Overfitting

(과적합 문제 해결하기)

Cost Function (비용 함수)

In this video, I'd like to convey to you, the main intuitions behind how regularization works. And, we'll also write down the cost function that we'll use, when we were using regularization. With the hand drawn examples that we have on these slides, I think I'll be able to convey part of the intuition. But, an even better way to see for yourself, how regularization works, is if you implement it, and, see it work for yourself. And, if you do the appropriate exercises after this, you get the chance to self see regularization in action for yourself.

이번 강의는 정규화가 동작 방법의 원리를 설명합니다. 정규화를 적용한 비용 함수도 설명합니다. 손으로 그린 그래프들을 보면서 직관적인 감각을 익힐 수 있을 것입니다. 그러나, 정규화의 동작 원리를 이해하는 가장 좋은 방법은 직접 구현하는 것입니다. 여러분들이 나중에 적절한 연습을 한다면 정규화를 직접 다룰 수 있을 것입니다.

So, here is the intuition. In the previous video, we saw that, if we were to fit a quadratic function to this data, it gives us a pretty good fit to the data. Whereas, if we were to

fit an overly high order degree polynomial, we end up with a curve that may fit the training set very well, but, really not be a, but overfit the data poorly, and, not generalize well.

여기 주택 가격을 예측하는 예제가 있습니다. 지난 강의에서 본 그림들입니다. 왼쪽 그림은 데이터 셋에 적합한 2차 함수 가설이고, 오른쪽 그림은 데이터 셋에 지나치게 적합한 고차 다항식 가설입니다. 고차 다항식 가설은 과적합 문제를 일으켜 일반화할 수 없습니다.

Consider the following, suppose we were to penalize, and, make the parameters theta 3 and theta 4 really small. Here's what I mean, here is our optimization objective, or here is our optimization problem, where we minimize our usual squared error cause function. Let's say I take this objective and modify it and add to it, plus 1000 theta 3 squared, plus 1000 theta 4 squared. 1000, I am just writing down as some huge number. Now, if we were to minimize this function, the only way to make this new cost function small is if theta 3 and data 4 are small, right? Because otherwise, if you have a thousand times theta 3, this new cost functions gonna be big. So when we minimize this new function we are going to end up with theta 3 close to 0 and theta 4 close to 0, and as if we're getting rid of these two terms over there. And if we do that, well then, if theta 3 and theta 4 close to 0 then we are being left with a quadratic function, and, so, we end up with a fit to the data, that's, you know, quadratic function plus maybe, tiny contributions from small terms, theta 3, theta 4, that they may be very close to 0. And, so, we end up with essentially, a quadratic function, which is good. Because this is a much better hypothesis. In this particular example, we looked at the effect of penalizing two of the parameter values being large.

따라서, 과적합을 피하기 위해 파라미터 θ3와 θ4를 정말 작게 만드는 페널티를 줍니다. 여기 비용 함수와 평균 제곱 오차 함수를 최소화하는 최적화 목표가 있습니다.

비용 함수 J(θ)를 수정하여 '1,000*θ3^2와 1,000*θ4^2를 추가합니다. 1,000은 정말 큰 숫자입니다. 따라서, 이 비용 함수를 최소화하는 유일한 방법은 θ3와 θ4가 0에 가까운 값이어야 합니다. 그렇지 않다면 θ3의 1,000배의 값이 비용 함수의 값을 크게 만들기 때문에 비용 함수의 최적화 목표에 도달할 수 없습니다. 마치 두항을 제거한 것처럼 보입니다.

오른쪽 그림의 가설 함수 hθ(x)에서 θ3와 θ4가 0에 가까운 값을 가지면 가설 함수 hθ(x) = θ0 + θ1x + θ2x^2는 2차 함수가 됩니다. 그래서 데이터에 적합한 가설 함수를 얻습니다. 즉, θ3와 θ4가 0에 근접한 값으로 예측에 매우 작은 기여를 합니다. 결국 2차 함수와 같습니다. 이것은 훨씬 더 나은 가설입니다. 이것이 파라미터에 매우 큰 값의 페널티를 부여하는 효과입니다.

More generally, here is the idea behind regularization. The idea is that, if we have small values for the parameters, then, having small values for the parameters, will somehow, will usually correspond to having a simpler hypothesis. So, for our last example, we penalize just theta 3 and theta 4 and when both of these were close to zero, we wound up with a much simpler hypothesis that was essentially a quadratic function. But more broadly, if we penalize all the parameters usually that, we can think of that, as trying to give us a simpler hypothesis as well because when, you know, these parameters are as close as you in this example, that gave us a quadratic function.

But more generally, it is possible to show that having smaller values of the parameters

corresponds to usually smoother functions as well for the simpler. And which are therefore, also, less prone to overfitting. I realize that the reasoning for why having all the parameters be small. Why that corresponds to a simpler hypothesis; I realize that reasoning may not be entirely clear to you right now. And it is kind of hard to explain unless you implement yourself and see it for yourself. But I hope that the example of having theta 3 and theta 4 be small and how that gave us a simpler hypothesis, I hope that helps explain why, at least give some intuition as to why this might be true.

일반적으로 정규화(Regularization)는 특정 파라미터에 매우 작은 값을 부여하여 가설 함수를 단순화합니다. 비용 함수 J(θ)에서 θ3과 θ4만 큰 값의 페널티를 부여하여 가설 hθ(x)은 0에 가까운 θ3과 θ4의 값으로 인해 간단한 2차 함수가 됩니다. 좀 더 넓게 모든 파라미터에 페널티를 부여하면 더 간단한 가설을 만들 수 있습니다. 4차 함수를 2차 함수로 단순한 것처럼 가능합니다.

일반적으로 파라미터 θ값이 작을수록 함수는 곡선은 부드럽고 단순해집니다. 모든 파라미터 θ가 작은 값을 가지면 과적합일 확률을 줄입니다. 직접 구현하고 확인하지 않으면 이해하기 어렵습니다. 그러나 파라미터 θ3와 θ4가 매우 작은 값을 가지면 가설은 4차 함수가 2차 함수로 바뀌고 복잡한 곡선이 단순한 곡선으로 변합니다.

Lets look at the specific example. For housing price prediction we may have our hundred features that we talked about where may be x1 is the size, x2 is the number of bedrooms, x3

is the number of floors and so on. And we may we may have a hundred features. And unlike the polynomial example, we don't know, right, we don't know that theta 3, theta 4, are the high order polynomial terms. So, if we have just a bag, if we have just a set of a hundred features, it's hard to pick in advance which are the ones that are less likely to be relevant. So we have a hundred or a hundred one parameters. And we don't know which ones to pick, we don't know which parameters to try to pick, to try to shrink.

So, in regularization, what we're going to do, is take our cost function, here's my cost function for linear regression. And what I'm going to do is, modify this cost function to shrink all of my parameters, because, you know, I don't know which one or two to try to shrink. So I am going to modify my cost function to add a term at the end. Like so we have square brackets here as well. When I add an extra regularization term at the end to shrink every single parameter and so this term we tend to shrink all of my parameters theta 1, theta 2, theta 3 up to theta 100. By the way, by convention the summation here starts from one so I am not actually going penalize theta zero being large. That sort of the convention that, the sum I equals one through N, rather than I equals zero through N. But in practice, it makes very little difference, and, whether you include, you know, theta zero or not, in practice, make very little difference to the results. But by convention, usually, we regularize only theta through theta 100.

여기 주택 가격을 예측하는 예제가 있습니다. 피처 x1은 주택 크기, x2는 방의 수, 등등은 100개의 피처가 있습니다. 주택 가격 예측과 관련성이 낮은 피처를 미리 선택하기 어렵습니다. 가설은 고차 다항식이 있는지 없는지도 모르고, 100 개의 피처에 대응하는 100 개의 파라미터가 있습니다. 어떤 파라미터 θ를 선택하고 작은 값을 부여해야 할지를 모릅니다.

여기 비용 함수 J(θ)가 있습니다. 비용 함수 J(θ)를 수정하여 파라미터 θ의 값을 축소합니다. 어떤 것을 축소해야 할지를 모르기 때문에 모든 파라미터에 적용하는 정규화항을 추가합니다.

새로운 정규화 항은 모든 파라미터 θ1, θ2, θ3,... , θ100까지 축소합니다. 그런데, 관습적으로 정규항은 피처 1부터 시작하고 θ0은 페널티를 주지 않습니다. 그래서, 시그마 합산에서 i=1에서 n까지로 나타냅니다. 사실 0에서 n까지입니다. 실제로 θ0를 포함하거나 하지 않거나 결과는 거의 차이가 없습니다. 그러나 관습에 따라 θ1에서 θ100까지만 정규화합니다.

Writing down our regularized optimization objective, our regularized cost function again. Here it is. Here's J of theta where, this term on the right is a regularization term and lambda here is called the regularization parameter and what lambda does, is it controls a trade off between two different goals. The first goal, capture it by the first goal objective, is that we would like to train, is that we would like to fit the training data well. We would like to fit the training set well. And the second goal is, we want to keep the parameters small, and that's captured by the second term, by the regularization objective. And by the regularization term. And what lambda, the regularization parameter does is the controls the trade of between these two goals, between the goal of fitting the training set well and the goal of keeping the parameter plan small and therefore keeping the hypothesis relatively simple to avoid overfitting.

여기 정규화된 최적화 목표이자 정규화된 인 비용 함수 J(θ)가 있습니다.

오른쪽 항은 정규화 항입니다. λ (람다)는 정규화 파라미터입니다. 정규화 파라미터 λ 는 두 가지 목표 사이에서 트레이드오프를 제어합니다. 첫 번째 목표는 첫 번째 항으로 학습 데이터 셋에 적합한 파라미터 θ를 찾아 비용 함수 J(θ)를 최소화하는 것입니다. 두 번째 목표는 두 번째 항으로 파라미터 θ를 작은 값으로 유지하는 것입니다. 정규화 파라미터 λ (람다)는 두 목표 사이에서 균형을 잡습니다. 두 목표는 학습 데이터 셋에 적합하게 하는 목표와 파라미터를 작게 유지하는 것입니다. 그래서 정규화항은 가설 함수가 과적합을 피하고 단순하게 유지하도록 합니다.

For our housing price prediction example, whereas, previously, if we had fit a very high order polynomial, we may have wound up with a very, sort of wiggly or curvy function like this. If you still fit a high order polynomial with all the polynomial features in there, but instead, you just make sure, to use this sole of regularized objective, then what you can get out is in fact a curve that isn't quite a quadratic function, but is much smoother and much simpler and maybe a curve like the magenta line that, you know, gives a much better hypothesis for this data. Once again, I realize it can be a bit difficult to see why strengthening the parameters can have this effect, but if you implement yourselves with regularization you will be able to see this effect firsthand.

여기 주택 가격을 예측하는 가설을 도식화합니다. 모든 고차 다항식을 적용한 가설은 학습 데이터 셋에 적합하지만 굴곡이 심한 곡선을 그립니다. 비용 함수 J(θ)에 정규화 항을 추가한 가설은 2차 함수는 아니지만 부드럽고 단순한 곡선을 그립니다. 정규화 항을 추가한 것이 학습 데이터 셋에 더 나은 가설을 만듭니다. 다시 한번, 파라미터 θ를 강화하는 것이 효과적이라는 것을 이해하기 조금 어려울 수 있습니다. 정규화로 직접 구현하면 효과를 직접 체험할 수 있습니다.

In regularized linear regression, if the regularization parameter monitor is set to be very large, then what will happen is we will end up penalizing the parameters theta 1, theta 2, theta 3, theta 4 very highly. That is, if our hypothesis is this is one down at the bottom. And if we end up penalizing theta 1, theta 2, theta 3, theta 4 very heavily, then we end up with all of these parameters close to zero, right? Theta 1 will be close to zero; theta 2 will be close to zero. Theta three and theta four will end up being close to zero. And if we do that, it's as if we're getting rid of these terms in the hypothesis so that we're just left with a hypothesis that will say that. It says that, well, housing prices are equal to theta zero, and that is akin to fitting a flat horizontal straight line to the data. And this is an example of underfitting, and in particular this hypothesis, this straight line it just fails to fit the training set well. It's just a fat straight line, it doesn't go, you know, go near. It doesn't go anywhere near most of the training examples. And another way of saying this is that this hypothesis has too strong a preconception or too high bias that housing prices are just equal to theta zero, and despite the clear data to the contrary, you know chooses to fit a sort of, flat line, just a flat horizontal line. I didn't draw that very well. This just a horizontal flat line to the data. So for regularization to work well, some care should be taken, to choose a good choice for the regularization parameter lambda as well. And when we talk about multi-selection later in this course, we'll talk about a way, a variety of ways for automatically choosing the regularization parameter lambda as well.

정규화된 선형 회귀에서 정규화 파라미터가 매우 큰 값을 가지고 파라미터 θ1, θ2, θ3, θ4에 매우 높은 페널티를 부과한다고 가정합니다.

따라서, 파라미터 θ1, θ2, θ3, θ4는 모두 0에 가까운 값입니다. 가설 hθ(x)는 상수 θ0만 남습니다. 주택 가격을 예측하는 가설 hθ(x)는 θ0의 값을 가진 수평의 굵은 직선입니다. 가설은 학습 데이터 셋에 적합하지도 않은 과소 적합(Underfit)입니다. 가설은 주택 가격이 θ0와 같다는 선입견이 너무 강하고(Too Strong a Preconception)과 너무 높은 편향성 (Too High Bias)이 있습니다. 명확한 데이터에도 불구하고 일종의 평평한 선에 맞춘 것뿐입니다. 따라서, 정규화가 제대로 동작하려면 정규화 파라미터 λ (람다)를 잘 선택해야 합니다. 이 과정에 뒷부분에서 다중 선택(Multi-Selection)에 대해 이야기할 때 정규화 파라미터 λ (람다)를 자동으로 선택하는 다양한 방법에 대해서 설명할 것입니다.

So, that's the idea of the high regularization and the cost function reviews in order to use regularization In the next two videos, lets take these ideas and apply them to linear regression and to logistic regression, so that we can then get them to avoid overfitting.

지금까지 정규화를 위한 비용 함수였습니다. 다음 두 개의 강의에서 선형 회귀와 로지스틱 회귀에 적용하여 과적합을 방지하는 것을 배울 것입니다.

앤드류 응의 머신러닝 동영상 강의

정리하며

가설 함수가 지나치게 높은 고차 다항식으로 구성하면 훈련용 데이터 셋에 매우 잘 맞는 불규칙한 곡선이 만들어지고 과적합 문제를 일으킵니다.

하지만, 현실에서는 어떤 Featue를 정규화할지 선택할 수 없기 때문에 모든 파라미터에 페널티를 부여하여 간단한 가설을 생성합니다. 파라미터 θ의 값이 작을수록 함수는 곡선은 부드럽고 단순해지고 과적합 가능성이 줄어듭니다. 모든 항에 페널티를 부과하는 방법은 비용 함수 J(θ)에 정규화 항을 추가하는 것입니다.

정규화 파라미터 λ 는 두 가지 목표 사이에서 트레이드오프를 제어합니다. 첫 번째 목표는 첫 번째 항으로 학습 데이터 셋에 적합한 파라미터 θ를 찾아 비용 함수 J(θ)를 최소화하는 것입니다. 두 번째 목표는 두 번째 항으로 파라미터 θ를 작은 값으로 유지하는 것입니다.