brunch

You can make anything
by writing

C.S.Lewis

by 라인하트 Oct 21. 2020

앤드류 응의 머신러닝(7-1): 과적합 문제

온라인 강의 플랫폼 코세라의 창립자인 앤드류 응 (Andrew Ng) 교수는 인공지능 업계의 거장입니다. 그가 스탠퍼드 대학에서 머신 러닝 입문자에게 한 강의를 그대로 코세라 온라인 강의 (Coursera.org)에서 무료로 배울 수 있습니다. 이 강의는 머신러닝 입문자들의 필수코스입니다. 인공지능과 머신러닝을 혼자 공부하면서 자연스럽게 만나게 되는 강의입니다.

Regularization

정규화

Solving the Problem of Overfitting

(과적합 문제 해결하기)

The Problem of Overfitting (과적합 문제)

By now, you've seen a couple different learning algorithms, linear regression and logistic regression. They work well for many problems, but when you apply them to certain machine learning applications, they can run into a problem called overfitting that can cause them to perform very poorly. What I'd like to do in this video is explain to you what is this overfitting problem, and in the next few videos after this, we'll talk about a technique called regularization, that will allow us to ameliorate or to reduce this overfitting problem and get these learning algorithms to maybe work much better. So what is overfitting?

지금까지 몇 종류의 학습 알고리즘을 배웠습니다. 선형 회귀와 로지스틱 회귀는 많은 머신 러닝 문제들에 적용할 수 있지만 과적합 문제를 일으킬 수 있습니다. 과적합 문제는 알고리즘의 성능에 좋지 않은 영향을 미칩니다. 이번 강의에서 과적합의 개념을 다루고 다음 강의에서 정규화를 다룹니다. 정규화는 과적합 문제를 개선하여 학습 알고리즘의 성능을 향상합니다. 그렇다면 과적합이란 무엇일까요?

Let's keep using our running example of predicting housing prices with linear regression where we want to predict the price as a function of the size of the house.

여기 선형 회귀에서 주택 가격을 예측하는 예제가 있습니다. 주택 크기에 대해 주택 가격을 예측하는 함수입니다.

One thing we could do is fit a linear function to this data, and if we do that, maybe we get that sort of straight line fit to the data. But this isn't a very good model. Looking at the data, it seems pretty clear that as the size of the housing increases, the housing prices plateau, or kind of flattens out as we move to the right and so this algorithm does not fit the training and we call this problem underfitting, and another term for this is that this algorithm has high bias. Both of these roughly mean that it's just not even fitting the training data very well. The term is kind of a historical or technical one, but the idea is that if a fitting a straight line to the data, then, it's as if the algorithm has a very strong preconception, or a very strong bias that housing prices are going to vary linearly with their size and despite the data to the contrary. Despite the evidence of the contrary is preconceptions still are bias, still closes it to fit a straight line and this ends up being a poor fit to the data.

데이터에 적합한 선형 함수를 그립니다. 왼쪽 그림은 1차 함수이므로 데이터에 적합한 직선을 그립니다. 이 모델은 좋은 모델은 아닌 것 같습니다. 데이터는 집의 크기가 늘어나면 집값이 증가하지만, 어느 정도가 지나면 집값의 평탄해집니다. 알고리즘이 학습 데이터 셋에 적합하지 않습니다. 이것을 과소 적합 (Underfit) 또는 높은 편향성(High Bias)이라고 합니다. 두 용어는 가설 함수 모델이 데이터에 적합하지 않다는 의미입니다. 좀 오랜 전통의 전문 용어입니다. 주택 크기와 주택 가격에 대한 데이터 분포와 상관없이 가설은 단순히 주택 크기에 따라 선형적으로 증가하기 때문에 알고리즘은 강한 선입견(Strong preconception)과 강한 편향(Stron Bias)이 있습니다. 데이터가 알고리즘의 예측과 다른 증거에도 불구하고, 데이터에 직선을 맞추려다 보니 편향(Bias)이 발생하고 결국 형편없는 결과가 나타납니다.

Now, in the middle, we could fit a quadratic functions enter and, with this data set, we fit the quadratic function, maybe, we get that kind of curve and, that works pretty well. And, at the other extreme, would be if we were to fit, say a fourth other polynomial to the data. So, here we have five parameters, theta zero through theta four, and, with that, we can actually fill a curve that process through all five of our training examples. You might get a curve that looks like this. That, on the one hand, seems to do a very good job fitting the training set and, that is processed through all of my data, at least. But, this is still a very wiggly curve, right? So, it's going up and down all over the place, and, we don't actually think that's such a good model for predicting housing prices. So, this problem we call overfitting, and, another term for this is that this algorithm has high variance.The term high variance is another historical or technical one. But, the intuition is that, if we're fitting such a high order polynomial, then, the hypothesis can fit, you know, it's almost as if it can fit almost any function and this face of possible hypothesis is just too large, it's too variable. And we don't have enough data to constrain it to give us a good hypothesis so that's called overfitting.

중간 그림의 가설은 2차 함수이고 데이터에 적합한 곡선을 그립니다. 이건 꽤 좋습니다. 그리고, 오른쪽 그림은 극단적인 예제입니다. 가설은 4차 함수이고 데이터에 적합한 불규칙 곡선입니다. θ0에서 θ5까지 5개의 파라미터를 활용해 5개의 데이터에 완전히 적합한 선을 만들 수 있습니다. 알고리즘은 학습 데이터 셋에 잘 맞고 성능도 좋습니다. 하지만, 곡선은 엄청나게 꼬여 있어 보기 좋지 않기 때문에 곡선이 주택 가격을 제대로 예측하지 못할 것입니다. 이것을 과적합 (Overfit) 또는 높은 분산(High Variance)이 있다고 합니다. 높은 분산(High Variance) 용어도 오랜 전통의 전문 용어입니다. 고차 다항식으로 이루어진 가설은 어떤 데이터에도 완벽히 맞출 수 있습니다. 큰 변동성을 가진 고차 다항식 가설을 제한할 수 있는 충분한 데이터가 없을 때 과적합이 발생합니다.

And in the middle, there isn't really a name but I'm just going to write, you know, just right. Where a second degree polynomial, quadratic function seems to be just right for fitting this data.

중간의 그림은 특별히 이름이 없지만, 데이터에 적합(Just right)하다고 합니다. 데이터에 2차 함수의 곡선이 적합합니다.

To recap a bit the problem of over fitting comes when if we have too many features, then to learn hypothesis may fit the training set very well. So, your cost function may actually be very close to zero or may be even zero exactly, but you may then end up with a curve like this that, you know tries too hard to fit the training set, so that it even fails to generalize to new examples and fails to predict prices on new examples as well, and here the term generalized refers to how well a hypothesis applies even to new examples. That is to data to houses that it has not seen in the training set.

정리하자면, 과적합(Overfit) 문제은 피처가 너무 많아서 가설이 학습 데이터 셋에 과적합할 때 발생합니다. 학습 데이터 셋에 대한 비용 함수는 거의 0 이거나 0에 가까운 값이지만 엄청 복잡한 곡선을 그립니다. 따라서, 학습 데이터 셋에 완벽히 적합하지만 새로운 학습 예제를 제대로 예측하지 못합니다. 일반화(Generalized)는 가설이 새로운 데이터에 얼마나 잘 맞는 지를 의미합니다. 여기서 새로운 데이터는 학습 데이터 셋에 없는 새로운 주택 크기와 주택 가격에 대한 데이터입니다.

On this slide, we looked at over fitting for the case of linear regression. A similar thing can apply to logistic regression as well. Here is a logistic regression example with two features X1 and x2. One thing we could do, is fit logistic regression with just a simple hypothesis like this, where, as usual, G is my sigmoid function. And if you do that, you end up with a hypothesis, trying to use, maybe, just a straight line to separate the positive and the negative examples. And this doesn't look like a very good fit to the hypothesis. So, once again, this is an example of underfitting or of the hypothesis having high bias.

과적합은 선형 회귀뿐만 아니라 로지스틱 회귀에서도 발생합니다. 여기 피처 x1, x2를 가진 로지스틱 회귀 예제가 있습니다. 왼쪽 그림은 로지스틱 회귀 가설 시그모이드 함수 g(z)는 간단한 1차 함수를 사용한 것입니다. 로지스틱 회귀 가설은 간단한 직선으로 파지티브 클래스와 네거티브 클래스를 나눕니다. 직선은 데이터에 잘 맞지 않습니다. 이것은 과소 적합 (Underfit) 또는 높은 편향 (High Bias)입니다.

In contrast, if you were to add to your features these quadratic terms, then, you could get a decision boundary that might look more like this. And, you know, that's a pretty good fit to the data. Probably, about as good as we could get, on this training set.

중간의 그림은 시그모이드 함수 g(z)에 2차 다항식을 추가하여 2차 함수의 그래프 모양의 결정 경계를 만들었습니다. 데이터에 잘 맞습니다. 이 곡선이 학습 데이터 셋에 최적인 것 같습니다.

And, finally, at the other extreme, if you were to fit a very high-order polynomial, if you were to generate lots of high-order polynomial terms of speeches, then, logistical regression may contort itself, may try really hard to find a decision boundary that fits your training data or go to great lengths to contort itself, to fit every single training example well. And, you know, if the features X1 and X2 offer predicting, maybe, the cancer to the, you know, cancer is a malignant, benign breast tumors. This doesn't, this really doesn't look like a very good hypothesis, for making predictions. And so, once again, this is an instance of overfitting and, of a hypothesis having high variance and not really, and, being unlikely to generalize well to new examples.

오른쪽 그림은 극단적인 예제입니다. 엄청 많은 고차 다항식으로 구성된 로지스틱 회귀 가설은 데이터에 잘 맞는 자잘 자잘하게 꼬아진 결정 경계를 만듭니다. 학습 데이터 셋의 모든 예제에 적합하기 위해 엄청 길게 꼬여 있습니다. 피처 x1과 x2로 암이 악성인지 아닌지를 판단할 때 정말 좋지 못한 예측을 합니다. 이것은 과적합(Overfit) 또는 높은 분산 (High Variance)입니다. 가설은 새로운 예제에 제대로 일반화할 수 없습니다.

Later, in this course, when we talk about debugging and diagnosing things that can go wrong with learning algorithms, we'll give you specific tools to recognize when overfitting and, also, when underfitting may be occurring. But, for now, lets talk about the problem of, if we think overfitting is occurring, what can we do to address it? In the previous examples, we had one or two dimensional data so, we could just plot the hypothesis and see what was going on and select the appropriate degree polynomial. So, earlier for the housing prices example, we could just plot the hypothesis and, you know, maybe see that it was fitting the sort of very wiggly function that goes all over the place to predict housing prices. And we could then use figures like these to select an appropriate degree polynomial. So plotting the hypothesis, could be one way to try to decide what degree polynomial to use.

이 과정의 뒷부분에서 학습 알고리즘이 과적합인지 아닌지를 확인하는 디버깅과 분석 툴을 설명할 것입니다. 과소 적합도 다룰 것이지만 여기서는 과적합 문제를 다룹니다. 과적합이 발생하는 지를 어떻게 알 수 있을 까요? 지금까지 예제들은 모두 1차원과 2차원 데이터이므로 간단하게 도식화한 후에 가설의 상태를 파악하고 적당한 차원의 다항식을 선택합니다. 주택 가격을 예측하는 예를 그래프로 그리면 적당히 꼬불꼬불한 선이 학습 데이터 셋을 지나갈 것입니다. 그래프를 보고 적당한 수준의 다항식을 선택합니다. 따라서, 가설을 그래프로 그린 후에 어떤 다항식을 사용하는지 보는 것도 한 가지 방법입니다.

But that doesn't always work. And, in fact more often we may have learning problems that where we just have a lot of features. And there is not just a matter of selecting what degree polynomial. And, in fact, when we have so many features, it also becomes much harder to plot the data and it becomes much harder to visualize it, to decide what features to keep or not. So concretely, if we're trying predict housing prices sometimes we can just have a lot of different features. And all of these features seem, you know, maybe they seem kind of useful. But, if we have a lot of features, and, very little training data, then, over fitting can become a problem.

하지만 이 방법을 항상 사용할 수 없습니다. 다항식의 차수가 높을 때와 피처가 매우 많을 때는 그림으로 표현할 수 없습니다. 피처가 증가할수록 데이터를 시각화하는 것은 더욱 어렵습니다. 어떤 피처를 그래프로 표현할지 선택하는 것은 쉽지 않습니다. 예를 들면, 주택 가격을 예측하기 위한 알고리즘의 피처가 100개입니다. 더욱이 모든 피처를 활용할 것입니다. 지금처럼 피처는 많지만 학습 데이터셋이 적을 때 과적합 (Overfit)이 발생합니다.

In order to address over fitting, there are two main options for things that we can do. The first option is, to try to reduce the number of features. Concretely, one thing we could do is manually look through the list of features, and, use that to try to decide which are the more important features, and, therefore, which are the features we should keep, and, which are the features we should throw out. Later in this course, where also talk about model selection algorithms. Which are algorithms for automatically deciding which features to keep and, which features to throw out. This idea of reducing the number of features can work well, and, can reduce over fitting. And, when we talk about model selection, we'll go into this in much greater depth. But, the disadvantage is that, by throwing away some of the features, is also throwing away some of the information you have about the problem. For example, maybe, all of those features are actually useful for predicting the price of a house, so, maybe, we don't actually want to throw some of our information or throw some of our features away.

과적합 문제를 해결하는 두 가지 옵션이 있습니다. 첫 번째 옵션은 피처의 개수를 줄이는 것입니다. 예를 들면, 쓸만한 피처와 쓸모없는 피처를 구분합니다. 무엇을 남기고 무엇을 버릴지를 결정합니다. 이 과정의 후반부에서 모델 선택 알고리즘을 배울 것입니다. 알고리즘이 어떤 피처를 사용할지와 버릴지를 자동으로 결정합니다. 피처의 수를 줄이면 과적합 문제를 해결할 수 있습니다. 모델 선택 알고리즘의 단점은 피처를 버리면서 문제에 포함된 정보까지 같이 버릴 수 있습니다. 예를 들어, 모든 피처들이 주택 가격을 예측하기 위해 필요하다면 버릴 수 없습니다.

The second option, which we'll talk about in the next few videos, is regularization. Here, we're going to keep all the features, but we're going to reduce the magnitude or the values of the parameters theta J. And, this method works well, we'll see, when we have a lot of features, each of which contributes a little bit to predicting the value of Y, like we saw in the housing price prediction example. Where we could have a lot of features, each of which are, you know, somewhat useful, so, maybe, we don't want to throw them away. So, this subscribes the idea of regularization at a very high level. And, I realize that, all of these details probably don't make sense to you yet.

두 번째 옵션은 정규화(Regularization)입니다. 다음 강의에서 다룰 것입니다. 모든 피처를 남기지만, 피처가 주는 영향의 규모를 줄입니다. 즉, 파라미터 θ의 값을 조정하여 영향의 규모를 줄입니다. 엄청 많은 피처가 있을 때 각각의 피처는 예측값에 상대적으로 작은 영향을 미칠 것입니다. 예를 들면, 주택 가격을 예측하는 피처가 있고, 피처가 예측에 영향을 미치기 때문에 제거할 수 없습니다. 정규화(Regularization)가 필요한 순간입니다. 정규화의 큰 그림만 설명하고 자세한 사항은 설명하지 않았습니다.

But, in the next video, we'll start to formulate exactly how to apply regularization and, exactly what regularization means.And, then we'll start to figure out, how to use this, to make how learning algorithms work well and avoid overfitting.

그러나 다음 강의에서 수학적으로 정규화의 정의와 정규화를 적용하는 방법을 설명할 것입니다. 그리고 학습 알고리즘이 성능을 개선하고 과적합을 피하기 위해 정규화를 사용하는 법부터 이해할 것입니다.

앤드류 응이 머신 러닝 동영상 강의

정리하며

왼쪽 그림의 가설 함수는 1차 함수이고 데이터에 적합한 파란색 직선을 그립니다. 이렇게 학습 데이터 셋에 적합하지 않은 것을 과소 적합(Underfit)이라고 합니다. 또는 알고리즘이 높은 편향성(bias), 강한 선입견(preconception)이 있다고 있습니다.

가운데 그림의 가설 함수는 2차 함수이고 데이터에 적합한 파란색 곡선을 그립니다. 이건 데이터에 잘 맞는다 (Jist right)라고 합니다.

오른쪽 그림의 가설 함수는 4차 함수이고 데이터에 완전히 적합한 불규칙 곡선입니다. 다항식을 가진 함수들은 데이터에 완적 적합한 곡선을 그릴 수 있습니다. 직관적으로 곡선이 보기에 좋지 않고 이 선이 주택 가격을 제대로 예측하지 못합니다. 이것을 과적합이라고 합니다. 또는 알고리즘이 높은 분산 (High Variance)를 갖는다고 합니다.

과적합은 많은 피처들이 있을 때 가설 함수가 학습 데이터 셋에 과적합하여 발생합니다. 학습 데이터 셋에 대한 비용 함수는 거의 0에 가까운 값이나 0이 나오지만 엄청 복잡한 곡선을 그립니다. 따라서, 학습 데이터 셋이 아닌 새로운 예제에 예측을 잘하지 못합니다. 일반화는 가설이 새로운 데이터에 얼마나 잘 맞는 지를 의미합니다.

과적합을 해결하는 방법은 몇 가지가 있습니다. 첫 번째 방법은 가설 함수를 그래프로 그려 봅니다. 피처가 개수가 많을수록 시각적으로 표현할 수 없기 때문에 사용이 제한적입니다. 두 번째 방법은 피처의 개수를 줄이는 것입니다. 하지만, 어떤 피처를 제거할지 선택하는 것이 어렵습니다. 따라서 모든 피처를 활용하면서 과적합 문제를 해결하는 방법은 정규화입니다.