brunch

You can make anything
by writing

C.S.Lewis

by 라인하트 Nov 02. 2020

앤드류 응의 머신러닝(10-5): 정규화와 편향/분산

온라인 강의 플랫폼 코세라의 창립자인 앤드류 응 (Andrew Ng) 교수는 인공지능 업계의 거장입니다. 그가 스탠퍼드 대학에서 머신 러닝 입문자에게 한 강의를 그대로 코세라 온라인 강의 (Coursera.org)에서 무료로 배울 수 있습니다. 이 강의는 머신러닝 입문자들의 필수코스입니다. 인공지능과 머신러닝을 혼자 공부하면서 자연스럽게 만나게 되는 강의입니다.

Advice for Applying Machine Learning

머신 러닝 적용을 위한 조언

Bias vs. Variance (편향과 분산)

Regularization and Bias / Variance (정규화와 편향/분산)

You've seen how regularization can help prevent over-fitting. But how does it affect the bias and variances of a learning algorithm? In this video I'd like to go deeper into the issue of bias and variances and talk about how it interacts with and is affected by the regularization of your learning algorithm.

일반적으로 과적합을 방지하기 위해 정규화를 활용합니다. 그러나 정규화는 학습 알고리즘의 편향과 분산에 어떤 영향을 미칠까요? 이번 강의에서 편향과 분산 문제를 더 깊게 공부하고 학습 알고리즘의 정규화와 어떻게 상호 작용하지를 다룹니다.

Suppose we're fitting a high auto polynomial, like that showed here, but to prevent over fitting we need to use regularization, like that shown here. So we have this regularization term to try to keep the values of the parameter to small. And as usual, the regularizations comes from J = 1 to m, rather than j = 0 to m. Let's consider three cases.

The first is the case of the very large value of the regularization parameter lambda, such as if lambda were equal to 10,000. Some huge value. In this case, all of these parameters, theta 1, theta 2, theta 3, and so on would be heavily penalized and so we end up with most of these parameter values being closer to zero. And the hypothesis will be roughly h of x, just equal or approximately equal to theta zero. So we end up with a hypothesis that more or less looks like that, more or less a flat, constant straight line. And so this hypothesis has high bias and it badly under fits this data set, so the horizontal straight line is just not a very good model for this data set.

At the other extreme is if we have a very small value of lambda, such as if lambda were equal to zero. In that case, given that we're fitting a high order polynomial, this is a usual over-fitting setting. In that case, given that we're fitting a high-order polynomial, basically, without regularization or with very minimal regularization, we end up with our usual high-variance, over fitting setting. This is basically if lambda is equal to zero, we're just fitting with our regularization, so that over fits the hypothesis.

And it's only if we have some intermediate value of longer that is neither too large nor too small that we end up with parameters data that give us a reasonable fit to this data. So, how can we automatically choose a good value for the regularization parameter?

여기 고차 다항식 가설과 비용 함수가 있습니다.

비용 함수 J(θ)에 정규화 항을 추가하여 과적합을 방지합니다. 정규화 항은 파라미터 θ의 값을 작게 유지하게 만듭니다. 정규화 항의 합산 Σ가 j = 0에서 m까지가 아니라 j = 1에서 m까지입니다. 세 가지 경우를 고려합니다.

왼쪽 그래프는 정규화 파라미터 λ가 10,000으로 매우 큰 경우입니다. 고차 다항식 가설의 모든 파라미터 θ1, θ2, θ3, θ4는 막대한 페널티를 받게 되므로 파라미터의 값은 0에 가까워집니다. 가설 hθ(x) = θ0와 거의 같습니다. 가설 그래프는 다소 평평한 형태의 일직선처럼 보입니다. 따라서, 가설은 편향이 높고 데이터 셋에 적합하지 않습니다.

오른쪽 그림은 정규화 파라미터 λ가 0으로 매우 작은 경우입니다. 고차 다항식 가설은 과적합을 일으킵니다. 정규화가 없거나 아주 작은 값일 때 고차 다항식은 매우 높은 분산을 일으킵니다. 정규화 파라미터 λ가 0과 같으면 정규화 항의 값이 0이 되므로 가설이 데이터에 과적합합니다.

중간 그림은 정규화 파라미터 λ가 너무 크지도 너무 작지도 않은 중간 값인 경우입니다. 이 경우 고차 다항식의 모든 파라미터가 적당한 값을 가집니다. 그렇다면 정규화 파라미터 λ에 적합한 값을 어떻게 자동으로 선택할 수 있을까요?

Just to reiterate, here's our model, and here's our learning algorithm's objective. For the setting where we're using regularization, let me define J train(theta) to be something different, to be the optimization objective, but without the regularization term. Previously, in an earlier video, when we were not using regularization I define J train of data to be the same as J of theta as the cost function but when we're using regularization when the six well under term we're going to define J train my training set to be just my sum of squared errors on the training set or my average squared error on the training set without taking into account that regularization.

즉, 여기에 모델이 있고 학습 알고리즘의 목표인 비용 함수 J(θ)가 있습니다. Jtrain(θ)는 정규화 항이 없는 최적화 목표입니다. 지난 강의에서 정규화를 사용하지 않았을 때 Jtrain(θ)와 J(θ)가 같다고 했지만, 정규화가 있을 때 Jtrain(θ)는 정규화를 고려하지 않은 학습 셋에 대한 평균 오차의 제곱이거나 학습 셋에 대한 오차의 제곱의 합으로 정의합니다.

And similarly I'm then also going to define the cross validation sets error and to test that error as before to be the average sum of squared errors on the cross validation in the test sets so just to summarize my definitions of J train J CU and J test are just the average square there one half of the other square record on the training validation of the test set without the extra regularization term. So, this is how we can automatically choose the regularization parameter lambda.

유사하게 교차 검증 셋 오류를 정의하고 전과 같이 테스트 셋의 교차 검증에 대한 평균 오차의 제곱에 대한 오류를 테스트할 것입니다. 그래서 Jtrain(θ), Jcv(θ) 및 Jtest(θ)는 정규화 항이 없이 평균 오차의 제곱입니다.

따라서, 이것이 정규화 파라미터 λ를 자동으로 선택할 수 있는 방법입니다.

So what I usually do is maybe have some range of values of lambda I want to try out. So I might be considering not using regularization or here are a few values I might try lambda considering lambda = 0.01, 0.02, 0.04, and so on. And I usually set these up in multiples of two, until some maybe larger value if I were to do these in multiples of 2 I'd end up with a 10.24. It's 10 exactly, but this is close enough. And the three to four decimal places won't effect your result that much. So, this gives me maybe 12 different models. And I'm trying to select a model corresponding to 12 different values of the regularization of the parameter lambda. And of course you can also go to values less than 0.01 or values larger than 10 but I've just truncated it here for convenience.

그래서 정규화 파라미터 λ의 값의 범위를 설정합니다. 정규화를 사용하지 않는 λ = 0부터 시작해서 λ = 0.01, λ = 0.02, λ = 0.04... 등을 시도합니다. 보통 2의 배수로 설정합니다. 2의 배수로 설정하여 10.24가 될 때까지 시도합니다. 정확히 λ = 10이지만 충분합니다. 소수점 이하 3 ~ 4 자리는 결과에 거의 영향을 주지 않습니다. 그래서, 여기 12 개의 모델이 있습니다. 정규화 파라미터 λ의 12 가지 모델 중에서 하나를 선택합니다. 0.01보다 작은 값이나 10 보다 큰 값을 선택할 수 있지만 편의상 잘라냈습니다.

Given the issue of these 12 models, what we can do is then the following, we can take this first model with lambda equals zero and minimize my cost function J of data and this will give me some parameter of active data. And similar to the earlier video, let me just denote this as theta super script one. And then I can take my second model with lambda set to 0.01 and minimize my cost function now using lambda equals 0.01 of course. To get some different parameter vector theta. Let me denote that theta(2). And for that I end up with theta(3). So if part for my third model. And so on until for my final model with lambda set to 10 or 10.24, I end up with this theta(12).

Next, I can talk all of these hypotheses, all of these parameters and use my cross validation set to validate them so I can look at my first model, my second model, fit to these different values of the regularization parameter, and evaluate them with my cross validation set based in measure the average square error of each of these square vector parameters theta on my cross validation sets. And I would then pick whichever one of these 12 models gives me the lowest error on the trans validation set. And let's say, for the sake of this example, that I end up picking theta 5, the 5th order polynomial, because that has the lowest cause validation error. Having done that, finally what I would do if I wanted to report each test set error, is to take the parameter theta 5 that I've selected, and look at how well it does on my test set. So once again, here is as if we've fit this parameter, theta, to my cross-validation set, which is why I'm setting aside a separate test set that I'm going to use to get a better estimate of how well my parameter vector, theta, will generalize to previously unseen examples. So that's model selection applied to selecting the regularization parameter lambda.

λ = 0 인 첫 번째 모델로 학습 셋에 대한 비용 함수 J(θ)를 최소화하면 파라미터 벡터 θ^(1)를 얻습니다. 구별을 위해 θ에 위 첨자 1을 붙이겠습니다. 다음으로 λ = 0.01인 두 번째 모델로 학습 셋에 대한 비용 함수 J(θ)를 최소화하면 파라미터 벡터 θ^(2)를 얻습니다. 다음으로 λ = 0.02인 세 번째 모델로 학습 셋에 대한 비용 함수 J(θ)를 최소화하면 파라미터 벡터 θ^(3)를 얻습니다. 그리고 다음으로 λ = 0.10인 열두 번째 모델로 학습 셋에 대한 비용 함수 J(θ)를 최소화하면 파라미터 벡터 θ^(12)를 얻습니다.

다음으로 모든 파라미터 벡터 θ를 교차 검증 셋에서 Jcv(θ^(1)), Jcv(θ^(2)), Jcv(θ^(3)),..., Jcv(θ^(12)) 교차 검증 오류를 확인합니다. 모든 파라미터 벡터 θ에 대한 평균 제곱의 오차 함수 Jcv(θ)를 측정합니다. 12 가지 모델 중 교차 검증 오류가 가장 낮은 값을 가진 모델을 선택합니다. 여기서 다섯 번째 모델인 5차 다항식인 θ^(5)를 선택한다고 가정합니다. 왜냐하면 이것이 가장 낮은 교차 검증 오류를 갖기 때문입니다. 마지막으로 θ^(5)를 테스트 셋에서 얼마나 잘 작동하는 지를 확인합니다. 즉, 파라미터 θ를 교차 검증 세트에 맞춘 것입니다. 이것이 더 나은 추정치를 얻기 위해 사용할 별도의 테스트 셋을 따로 설정하는 이유입니다. 파라미터 벡터 θ는 새로운 예제로 일반화됩니다. 이것이 정규화 파라미터 λ를 선택하는 방법입니다.

The last thing I'd like to do in this video is get a better understanding of how cross validation and training error vary as we vary the regularization parameter lambda. And so just a reminder right, that was our original cost on j of theta. But for this purpose we're going to define training error without using a regularization parameter, and cross validation error without using the regularization parameter. And what I'd like to do is plot this Jtrain and plot this Jcv, meaning just how well does my hypothesis do on the training set and how does my hypothesis do when it cross validation sets.

마지막으로 정규화 파라미터 λ를 변경함에 따라 교차 검증 오차 및 학습 오차가 어떻게 달라지는 지를 설명합니다. 즉, 원래의 비용 함수 J(θ)입니다. 여기서 정규화 파라미터 λ 를 사용하지 않고 학습 오류를 정의하고 정규화 파라미터 λ를 사용하지 않고 교차 검증 오류를 정의할 것입니다. 그리고 Jtrain(θ)와 Jcv(θ)를 도식화합니다. 즉, 정규화 파라미터 λ를 변경하면서 학습 셋에서 가설이 얼마나 잘 자동하는 지를 교차 검증 셋에서 확인합니다.

As I vary my regularization parameter lambda. So as we saw earlier if lambda is small then we're not using much regularization and we run a larger risk of over fitting whereas if lambda is large that is if we were on the right part of this horizontal axis then, with a large value of lambda, we run the higher risk of having a biased problem, so if you plot J train and J cv, what you find is that, for small values of lambda, you can fit the trading set relatively way cuz you're not regularizing. So, for small values of lambda, the regularization term basically goes away, and you're just minimizing pretty much just gray arrows. So when lambda is small, you end up with a small value for Jtrain, whereas if lambda is large, then you have a high bias problem, and you might not feel your training that well, so you end up the value up there. So Jtrain of theta will tend to increase when lambda increases, because a large value of lambda corresponds to high bias where you might not even fit your trainings that well, whereas a small value of lambda corresponds to, if you can really fit a very high degree polynomial to your data, let's say.

정규화 파라미터 λ를 변경합니다. 정규화 파라미터 λ가 작으면 정규화를 거의 사용하지 않고 과적합의 위험이 큽니다. 반면에 정규화 파라미터 λ가 크면 가로축의 오른쪽 부분이 큰 값을 갖습니다. 이런 경우 편향 문제가 발생할 위험이 높습니다. Jtrain(θ)와 Jcv(θ)를 도식화하면 작은 λ 의 값의 경우는 정규화하지 않고도 학습 셋에 적합할 수 있습니다. 따라서 λ 값이 작은 경우 정규화 항은 사라지고 Jtrain(θ)은 작은 값으로 반환하고 반면 λ 값이 큰 경우 편향 문제가 크고 학습이 잘 되지 않아 결국 높은 값을 반환합니다. 따라서, Jtrain(θ)은 λ 값이 증가할 때 증가합니다. 왜냐하면 큰 λ 값이 학습에 적합하지 않은 높은 편향에 해당하는 반면 작은 λ 값은 데이터에 고차 다항식에 적합합니다.

After the cost validation error we end up with a figure like this, where over here on the right, if we have a large value of lambda, we may end up under fitting, and so this is the bias regime. And so the cross validation error will be high. Let me just leave all of that to this Jcv (theta) because so, with high bias, we won't be fitting, we won't be doing well in cross validation sets, whereas here on the left, this is the high variance regime, where we have two smaller value with longer, then we may be over fitting the data. And so by over fitting the data, then the cross validation error will also be high. And so, this is what the cross validation error and what the training error may look like on a training stance as we vary the regularization parameter lambda. And so once again, it will often be some intermediate value of lambda that is just right or that works best In terms of having a small cross validation error or a small test theta.

학습 오류는 파란색 선이고, 교차 검증 오류 그래프는 분홍 색선입니다. 오른쪽의 높은 편향의 값을 사용하면 데이터에 적합하지 않고 교차 검증 세트에 잘 동작하지 않습니다. 왼쪽의 높은 분산의 값을 사용하면 데이터에 과적합합니다. 따라서 데이터에 과적 합하면 교차 검증 오류도 높아집니다. 이것은 교차 검증 오류와 정규화 파라미터 λ에 따라 학습 오류와 교차 검증 오류가 어떻게 보이는 지에 대한 것입니다. 즉, 작은 교차 검증 오류와 작은 학습 오류 측면에서 가장 잘 동작하는 λ의 최적 값은 중간 값입니다.

And whereas the curves I've drawn here are somewhat cartoonish and somewhat idealized so on the real data set the curves you get may end up looking a little bit more messy and just a little bit more noisy then this. For some data sets you will really see these for sorts of trends and by looking at a plot of the hold-out cross validation error you can either manual, automatically try to select a point that minimizes the cross validation error and select the value of lambda corresponding to low cross validation error. When I'm trying to pick the regularization parameter lambda for learning algorithm, often I find that plotting a figure like this one shown here helps me understand better what's going on and helps me verify that I am indeed picking a good value for the regularization parameter monitor.

여기에 그린 그래프는 다소 만화적이고 이상적입니다. 실제 데이터 셋에서는 곡선이 약간 더 지저분하고 조잡합니다. 일부 데이터 셋의 경우 일종의 추세에 대해 실제로 볼 수 있습니다. 교차 검증 오류를 도식화하면 교차 검증 오류를 최소화하는 포인트를 자동으로 또는 수동으로 λ 값을 선택할 수 있습니다. 즉, 낮은 교차 검증 오류입니다. 알고리즘 학습을 위해 정규화 파라미터 λ를 선택하려고 할 때 여기에 표시된 것과 같은 그림을 그리는 것이 무슨 일이 일어나는지 더 잘 이해할 수 있습니다. 실제로 정규화 파라미터 λ를 위한 좋은 값을 선택합니다.

So hopefully that gives you more insight into regularization and it's effects on the bias and variance of a learning algorithm. By now you've seen bias and variance from a lot of different perspectives. And what we like to do in the next video is take all the insights we've gone through and build on them to put together a diagnostic that's called learning curves, which is a tool that I often use to diagnose if the learning algorithm may be suffering from a bias problem or a variance problem, or a little bit of both.

따라서 정규화에 대한 더 많은 통찰력을 제공하고 학습 알고리즘의 편향과 분산에 영향에 대한 통찰력을 가지길 바랍니다. 지금까지 다양한 관점에서 편향과 분산을 공부했습니다. 다음 강의에서 이런 통찰력을 바탕으로 학습곡선이라는 진단 도구를 작성할 것입니다. 학습 곡선은 학습 알고리즘에서 편향과 분산 문제를 진단하기 위해 자주 사용합니다.

앤드류 응의 머신러닝 동영상 강의

정리하며

비용 함수와 정규화 파라미터가 적용된 비용 함수의 정의는 다음과 같습니다.

정규화 파라미터 λ의 값에 따라 편향과 분산이 나타납니다.

1) 정규화 파라미터 λ가 10,000으로 매우 큰 경우

모든 파라미터 θ1, θ2, θ3, θ4는 막대한 페널티를 받아 가설 hθ(x) = θ0와 거의 같음

가설 그래프는 다소 평평한 형태의 일직선

편향이 높고 데이터 셋에 적합하지 않음

2) 정규화 파라미터 λ가 0으로 매우 작은 경우

정규화 항의 값은 0이 되고 모든 파라미터의 값은 그대로 유지

학습 데이터를 모두 만족하는 곡선

분산가 높고 데이터 셋에 과적합함

선형 회귀 모델에서 학습 오류는 다음과 같습니다. 특히, 정규화가 있을 때 Jtrain(θ)는 정규화를 고려하지 않은 학습 셋에 대한 평균 오차의 제곱이거나 학습 셋에 대한 오차의 제곱의 합으로 정의합니다.

학습 데이터에 가장 적합한 정규화 파라미터 λ 의 값을 알아내기 위해 λ = 0 , λ = 0.01, λ = 0.02, λ = 0.04,..., λ = 10 등의 값을 적용하여 Jtrain(θ)의 값을 구합니다. 교차 검증 셋에 대입하여 오류가 가장 작은 λ 의 값을 선택합니다.

학습 오류 Jtrain(θ)은 λ 값이 증가할 때 증가합니다. 교차 검증 오류 Jcv(θ)는 λ =0 일 때는 편향이 크고 λ 가 클 때도 산가 큽니다. 즉, 작은 교차 검증 오류와 작은 학습 오류 측면에서 가장 잘 동작하는 λ의 최적 값은 중간 값입니다.