brunch

You can make anything
by writing

C.S.Lewis

by 라인하트 Nov 01. 2020

앤드류 응의 머신러닝(10-3): 모델 선택과 검증

온라인 강의 플랫폼 코세라의 창립자인 앤드류 응 (Andrew Ng) 교수는 인공지능 업계의 거장입니다. 그가 스탠퍼드 대학에서 머신 러닝 입문자에게 한 강의를 그대로 코세라 온라인 강의 (Coursera.org)에서 무료로 배울 수 있습니다. 이 강의는 머신러닝 입문자들의 필수코스입니다. 인공지능과 머신러닝을 혼자 공부하면서 자연스럽게 만나게 되는 강의입니다.

Advice for Applying Machine Learning

머신 러닝 적용을 위한 조언

Evaluating a Learning Algorithm

(학습 알고리즘 평가)

Model Selection and Train/Validation/Test Sets

(모델 선택과 학습 / 검증/ 테스트 셋)

Suppose you're left to decide what degree of polynomial to fit to a data set. So that what features to include that gives you a learning algorithm. Or suppose you'd like to choose the regularization parameter longer for learning algorithm. How do you do that? This is called model selection problem. A nd in our discussion of how to do this, we'll talk about not just how to split your data into the train and test sets, but how to switch data into what we discover is called the train, validation, and test sets. We'll see in this video just what these things are, and how to use them to do model selection

학습 데이터 셋에 적합한 고차 다항식의 차수나 정규화 파라미터의 값 또는 피처를 결정할 때 어떤 방식으로 결정해야 할까요? 이것이 모델 선택 문제입니다. 데이터를 학습 셋과 테스트 셋으로 분할하는 것뿐만 아니라 학습 셋, 검증 셋, 테스트 셋으로 분할하는 방법을 설명합니다. 각각의 역할은 무엇이고 모델 선택 시 어떻게 사용할지를 살펴봅니다.

We've already seen a lot of times the problem of overfitting, in which just because a learning algorithm fits a training set well, that doesn't mean it's a good hypothesis. More generally, this is why the training set's error is not a good predictor for how well the hypothesis will do on new example. Concretely, if you fit some set of parameters. Theta0, theta1, theta2, and so on, to your training set. Then the fact that your hypothesis does well on the training set. Well, this doesn't mean much in terms of predicting how well your hypothesis will generalize to new examples not seen in the training set. And a more general principle is that once your parameter is what fit to some set of data. Maybe the training set, maybe something else. Then the error of your hypothesis as measured on that same data set, such as the training error, that's unlikely to be a good estimate of your actual generalization error. The hypothesis will generalize to new examples.

여러 번 과적합 문제를 다루었습니다. 학습 알고리즘이 학습 셋에 잘 맞는다고 좋은 가설은 아니기 때문입니다. 일반적으로 과적합 문제는 학습 셋의 오차가 새로운 예제에 얼마나 잘 동작하는지를 측정하는 괜찮은 도구가 아닌 이유를 설명합니다. 학습 데이터 셋에 파라미터 θ0, θ1, θ2, θ3, θ4가 적합하다는 것이 가설이 새로운 학습 예제에 얼마나 잘 적합할지를 알려주지 않습니다. 단지 파라미터가 학습 데이터 셋에만 적합하다는 것을 설명할 뿐입니다. 학습 데이터 셋에서 측정한 가설의 오차는 실제 일반화 오차에 대한 추정치가 될 수 없습니다. 하지만, 가설은 새로운 예제에도 적합해야 합니다.

Now let's consider the model selection problem. Let's say you're trying to choose what degree polynomial to fit to data. So, should you choose a linear function, a quadratic function, a cubic function? All the way up to a 10th-order polynomial. So it's as if there's one extra parameter in this algorithm, which I'm going to denote d, which is, what degree of polynomial. Do you want to pick. So it's as if, in addition to the theta parameters, it's as if there's one more parameter, d, that you're trying to determine using your data set. So, the first option is d equals one, if you fit a linear function. We can choose d equals two, d equals three, all the way up to d equals 10. So, we'd like to fit this extra sort of parameter which I'm denoting by d. And concretely let's say that you want to choose a model, that is choose a degree of polynomial, choose one of these 10 models. And fit that model and also get some estimate of how well your fitted hypothesis was generalize to new examples.

모델 선택 문제를 정리합니다.

학습 데이터 셋에 적합한 고차 다항식을 선택합니다. 선형 함수, 2차 함수, 3차 함수,.. 10차 함수 중에 어떤 다항식을 선택할까요? 이것은 알고리즘에 새로운 파라미터 d를 새로 추가하는 것과 같습니다. 파라미터 d는 다항식의 차수입니다. 즉, 첫 번째 옵션은 d =1이고, 두 번째 옵션은 d = 2이고, 세 번째 옵션은 d = 3이고, 마지막 옵션은 d = 10입니다. 학습 데이터 셋에 적합한 파라미터 d를 찾는 것입니다. 모델을 선택한다는 것은 다항식의 차수를 선택하는 것이고 10 개 차수 중에 하나를 선택하는 것입니다. 파라미터 d는 모델에 적합하고 학습 데이터 셋에 적합한 가설이 새로운 예제에 얼마나 잘 일반화할 수 있었는지에 대한 추정치입니다.

Here's one thing you could do. What you could, first take your first model and minimize the training error. And this would give you some parameter vector theta. And you could then take your second model, the quadratic function, and fit that to your training set and this will give you some other. Parameter vector theta. In order to distinguish between these different parameter vectors, I'm going to use a superscript one superscript two there where theta superscript one just means the parameters I get by fitting this model to my training data. And theta superscript two just means the parameters I get by fitting this quadratic function to my training data and so on. By fitting a cubic model I get parenthesis three up to, well, say theta 10. And one thing we ccould do is that take these parameters and look at test error. So I can compute on my test set J test of one, J test of theta two, and so on. J test of theta three, and so on. So I'm going to take each of my hypotheses with the corresponding parameters and just measure the performance of on the test set. Now, one thing I could do then is, in order to select one of these models, I could then see which model has the lowest test set error.

첫 번째 모델을 선택하고 학습 오차를 최소화합니다. 첫 번째 모델은 학습 후 파라미터 벡터 θ^(1)를 반환합니다. 그다음 두 번째 모델을 선택하고 학습 오차를 최소화합니다. 두 번째 모델은 파라미터 벡터 θ^(2)를 반환합니다. 각 모델의 파라미터 벡터 θ를 구별하기 위해 위 첨자를 사용합니다. θ^(1)은 학습 데이터 셋에 맞춘 첫 번째 모델의 파라미터 벡터이고, θ^(2)는 학습 데이터 셋에 맞춘 두 번째 모델의 파라미터 벡터이고, θ^(3)는 학습 데이터 셋에 맞춘 세 번째 모델의 파라미터 벡터입니다. 마지막으로 θ^(10)는 학습 데이터 셋에 맞춘 열 번째 모델의 파라미터 벡터입니다. 10개의 파라미터 벡터를 테스트 셋에 적용하여 테스트 오차를 계산합니다. Jtest(θ^(1)), Jtest(θ^(2)), Jtest(θ^(3)),..., Jtest(θ^(10))까지 테스트 셋으로 성능을 측정하고, 테스트 셋 오류가 가장 낮은 모델을 선택합니다.

And let's just say for this example that I ended up choosing the fifth order polynomial. So, this seems reasonable so far. But now let's say I want to take my fifth hypothesis, this, this, fifth order model, and let's say I want to ask, how well does this model generalize? One thing I could do is look at how well my fifth order polynomial hypothesis had done on my test set. But the problem is this will not be a fair estimate of how well my hypothesis generalizes. And the reason is what we've done is we've fit this extra parameter d, that is this degree of polynomial. And what fits that parameter d, using the test set, namely, we chose the value of d that gave us the best possible performance on the test set. And so, the performance of my parameter vector theta5, on the test set, that's likely to be an overly optimistic estimate of generalization error. Right, so, that because I had fit this parameter d to my test set is no longer fair to evaluate my hypothesis on this test set, because I fit my parameters to this test set, I've chose the degree d of polynomial using the test set.

And so my hypothesis is likely to do better on this test set than it would on new examples that it hasn't seen before, and that's which is, which is what I really care about. So just to reiterate, on the previous slide, we saw that if we fit some set of parameters, you know, say theta0, theta1, and so on, to some training set, then the performance of the fitted model on the training set is not predictive of how well the hypothesis will generalize to new examples. Is because these parameters were fit to the training set, so they're likely to do well on the training set, even if the parameters don't do well on other examples.

테스트 오차가 가장 작은 5차 다항식으로 이루어진 5 번째 가설을 선택합니다. 5번째 가설이 얼마나 잘 일반화되는 지를 확인하기 위해 테스트 셋에 적용합니다. 그러나 테스트 셋에서 최적화된 파라미터 d를 찾았기 때문에 테스트 셋에 대한 결과는 공정한 추정치가 아닙니다. 테스트 셋에 대한 파라미터 벡터 θ^(5)의 성능은 일반화 오류에 지나치게 높은 추정치를 제시합니다. 파라미터 d를 테스트 셋에 적합한 값으로 선택했기 때문에 테스트 셋에서 가설을 평가하는 것은 공정하지 않습니다. 학습 데이터 셋에 적합한 파라미터 벡터 θ^(0), θ^(1) 등을 테스트 셋에서 비용 함수 Jtest(θ)를 최소화하는 모델을 선택했습니다. 가설이 새로운 예제에 얼마나 잘 일반화되는지 예측하지 못합니다. 즉 , 새로운 학습 예제에 적합하지 않더라도 학습 데이터 셋에 아주 적합할 가능성이 높습니다.

And, in the procedure I just described on this line, we just did the same thing. And specifically, what we did was, we fit this parameter d to the test set. And by having fit the parameter to the test set, this means that the performance of the hypothesis on that test set may not be a fair estimate of how well the hypothesis is, is likely to do on examples we haven't seen before. To address this problem, in a model selection setting, if we want to evaluate a hypothesis, this is what we usually do instead. Given the data set, instead of just splitting into a training test set, what we're going to do is then split it into three pieces. And the first piece is going to be called the training set as usual.

So let me call this first part the training set. And the second piece of this data, I'm going to call the cross validation set. Cross validation. And the cross validation, as V-D. Sometimes it's also called the validation set instead of cross validation set. And then the loss can be to call the usual test set. And the pretty, pretty typical ratio at which to split these things will be to send 60% of your data's, your training set, maybe 20% to your cross validation set, and 20% to your test set. And these numbers can vary a little bit but this integration be pretty typical. And so our training sets will now be only maybe 60% of the data, and our cross-validation set, or our validation set, will have some number of examples. I'm going to denote that m subscript cv. So that's the number of cross-validation examples. Following our early notational convention I'm going to use xi cv comma y i cv, to denote the i cross validation example. And finally we also have a test set over here with our m subscript test being the number of test examples.

그리고, 데이터 셋을 학습 셋과 테스트 셋으로 분리한 것과 똑같습니다. 테스트 셋에 적합한 파라미터 d를 찾았기 때문에 가설이 새로운 예제에 얼마나 잘 일반화할 수 있는 지를 테스트 셋으로 확인할 수 없습니다. 공정한 추정치를 얻을 수 없습니다. 이 문제를 해결하기 위해 데이터 셋에서 학습 데이터 셋과 테스트 셋으로 분할하는 대신에 세 부분으로 분할합니다.

데이터의 첫 번째 부분은 학습 셋입니다. 두 번째 부분은 교차 검증 셋 또는 검증 셋입니다. 그리고 나머지는 테스트 셋입니다. 데이터를 일반적으로 60%의 학습 셋, 20%의 교차 검증 셋, 20%의 테스트 셋의 비율로 삼등분합니다. 이제 학습 셋은 전체 데이터의 60%에 불과합니다. mcv의 cv는 아래 첨자이고, 교차 검증 셋의 예제의 총 수입니다. 교차 검증의 예제는 (xcv^(i), ycv^(i))로 표시합니다. 마지막 테스트 셋 예제의 총 수는 mtest입니다.

So, now that we've defined the training validation or cross validation and test sets. We can also define the training error, cross validation error, and test error. So here's my training error, and I'm just writing this as J subscript train of theta. This is pretty much the same things. These are the same thing as the J of theta that I've been writing so far, this is just a training set error you know, as measuring a training set and then J subscript cv my cross validation error, this is pretty much what you'd expect, just like the training error you've set measure it on a cross validation data set, and here's my test set error same as before.

이제 학습 데이터 셋, 교차 검증 셋, 테스트 셋을 정의했습니다. 또한, 학습 오차, 교차 검증 오차, 테스트 오차를 정의합니다.

여기에 학습 오차 Jtrain(θ)가 있습니다. 교차 검증 오차 Jcv(θ), 테스트 오차 Jtest(θ)가 있습니다. 학습 오류는 J(θ)와 똑같습니다. 학습 셋을 측정하고 교차 검증 셋을 측정합니다.

So when faced with a model selection problem like this, what we're going to do is, instead of using the test set to select the model, we're instead going to use the validation set, or the cross validation set, to select the model. Concretely, we're going to first take our first hypothesis, take this first model, and say, minimize the cost function, and this would give me some parameter vector theta for the new model. And, as before, I'm going to put a superscript 1, just to denote that this is the parameter for the new model. We do the same thing for the quadratic model. Get some parameter vector theta two. Get some para, parameter vector theta three, and so on, down to theta ten for the polynomial. And what I'm going to do is, instead of testing these hypotheses on the test set, I'm instead going to test them on the cross validation set. And measure J subscript cv, to see how well each of these hypotheses do on my cross validation set. And then I'm going to pick the hypothesis with the lowest cross validation error.

So for this example, let's say for the sake of argument, that it was my 4th order polynomial, that had the lowest cross validation error. So in that case I'm going to pick this fourth order polynomial model. And finally, what this means is that that parameter d, remember d was the degree of polynomial, right? So d equals two, d equals three, all the way up to d equals 10. What we've done is we'll fit that parameter d and we'll say d equals four. And we did so using the cross-validation set. And so this degree of polynomial, so the parameter, is no longer fit to the test set, and so we've not saved away the test set, and we can use the test set to measure, or to estimate the generalization error of the model that was selected. By the of them. So, that was model selection and how you can take your data, split it into a training, validation, and test set. And use your cross validation data to select the model and

evaluate it on the test set.

따라서, 모델 선택 문제에서 테스트 셋 대신에 교차 검증 셋을 활용합니다. 첫 번째 가설이자 모델을 가지고 비용 함수 J(θ)를 최소화하면 파라미터 벡터 θ^(1)을 얻습니다. 위 첨자 1일 삽입하여 첫 번째 모델의 결과임을 표시합니다. 두 번째 모델을 가지고 비용 함수 J(θ)를 최소화하면 파라미터 벡터 θ^(2)을 얻습니다. 세 번째 모델을 가지고 비용 함수 J(θ)를 최소화하면 파라미터 벡터 θ^(3)을 얻습니다. 열 번째 모델까지 똑같이 진행하여 파라미터 벡터 θ^(10)을 얻습니다. 다음으로 교차 검증 오류 Jcv(θ)를 측정합니다. Jcv(θ^(1), Jcv(θ^(2), Jcv(θ^(3),..., Jcv(θ^(10)까지 측정합니다. 교차 검증 오류 Jcv(θ)는 가설이 교차 검증 셋에서 얼마나 잘 동작하는 지를 알려줍니다. 그리고, 오류가 가장 낮은 가설을 선택합니다.

여기서 4차 다항식을 가진 네 번째 모델이 가장 낮은 교차 검증 셋 오류를 표시합니다. 마지막으로 다항식의 차수를 표시하는 파라미터 d를 상기합니다. 각 모델에서 d = 1, d = 2, d = 3,..., d= 10입니다. 네 번째 모델은 d = 4입니다. 교차 검증 셋에서 똑같이 활용했습니다. 따라서, 파라미터는 더 이상 테스트 셋에 적합하지 않습니다. 테스트 셋에 적용하지 않았으므로 선택된 모델의 일반화 오류를 평가하거나 측정하기 위해 테스트 셋을 활용할 수 있습니다. 지금까지 적당한 모델 선택을 위한 학습, 검증, 테스트 셋으로 분할하는 방법이었습니다. 학습 셋으로 학습하고, 교차 검증 셋으로 모델을 선택하고 테스트 셋으로 평가합니다.

One final note, I should say that in. The machine learning, as of this practice today, there aren't many people that will do that early thing that I talked about, and said that, you know, it isn't such a good idea, of selecting your model using this test set. And then using the same test set to report the error as though selecting your degree of polynomial on the test set, and then reporting the error on the test set as though that were a good estimate of generalization error. That sort of practice is unfortunately many, many people do do it. If you have a massive, massive test that is maybe not a terrible thing to do, but many practitioners, most practitioners that machine learnimg tend to advise against that.

And it's considered better practice to have separate train validation and test sets. I just warned you to sometimes people to do, you know, use the same data for the purpose of the validation set, and for the purpose of the test set. You need a training set and a test set, and that's good, that's practice, though you will see some people do it. But, if possible, I would recommend against doing that yourself.

마지막으로 머신 러닝 분야에서 이렇게 작업을 하는 사람들은 많지 않습니다. 테스트 셋을 사용하여 모델을 선택하는 것은 좋은 방법이 아닙니다. 동일한 테스트 셋을 사용하여 다항식의 차수를 선택하고 동일한 테스트 셋을 사용하여 일반화의 오류를 계산합니다. 불행히도 많은 실무자들이 그렇게 합니다.

실무적으로 더 나은 방법은 별도의 학습 셋, 교차 검증 셋과 테스트 셋을 구분하는 것입니다. 교차 검증 셋의 목적과 테스트 셋의 목적에 맞는 별도의 데이터 셋을 사용합니다. 가끔 연습할 때 훈련 셋과 테스트 셋으로 구분하기도 합니다. 가능하다면 교차 검증 세트를 활용하는 것이 좋습니다.

앤드류 응의 머신러닝 동영상 강의

정리하며

가설의 다항식의 차수를 선택하기 위해 파라미터 d를 사용합니다. 파라미터 d는 다항식의 차수를 나타냅니다. 데이터 셋을 학습 셋과 테스트 셋으로 나눌 경우 문제점이 있습니다. 다항식의 차수를 선택하기 위해 테스트 오류가 제일 작은 모델을 선택한 후 제대로 동작하는 지를 확인할 방법이 없습니다. 테스트 셋으로 다항식의 차수를 선택하고 동일한 테스트 셋으로 검증할 수 없기 때문입니다.

그래서, 데이터 셋을 60%의 학습 셋, 20%의 교차 검증 셋, 20%의 테스트 셋의 비율로 삼등분합니다. 학습 셋으로 학습하고, 교차 검증 셋으로 모델을 선택하고, 테스트 셋으로 제대로 동작하는 지를 평가합니다.

선형 회귀 모델에 대한 오류는 다음과 같이 검증합니다.