brunch

You can make anything
by writing

C.S.Lewis

by 라인하트 Nov 03. 2020

앤드류 응의 머신러닝(10-6): 학습곡선

온라인 강의 플랫폼 코세라의 창립자인 앤드류 응 (Andrew Ng) 교수는 인공지능 업계의 거장입니다. 그가 스탠퍼드 대학에서 머신 러닝 입문자에게 한 강의를 그대로 코세라 온라인 강의 (Coursera.org)에서 무료로 배울 수 있습니다. 이 강의는 머신러닝 입문자들의 필수코스입니다. 인공지능과 머신러닝을 혼자 공부하면서 자연스럽게 만나게 되는 강의입니다.

Advice for Applying Machine Learning

머신 러닝 적용을 위한 조언

Bias vs. Variance (편향과 분산)

Learning Curves (학습 곡선)

In this video, I'd like to tell you about learning curves. Learning curves is often a very useful thing to plot. If either you wanted to sanity check that your algorithm is working correctly, or if you want to improve the performance of the algorithm. And learning curves is a tool that I actually use very often to try to diagnose if a physical learning algorithm may be suffering from bias, sort of variance problem or a bit of both.

이번 강의는 학습 곡선을 설명합니다. 학습 곡선은 학습 알고리즘이 올바르게 작동하는지 타당성 검사(Sanity Check)를 할 때 유용합니다. 알고리즘의 성능을 향상하기 위해 문제를 파악할 때도 유용합니다. 학습 곡선은 실제로 물리적 학습 알고리즘이 편향이나 분산 문제가 있는지 진단하는 도구입니다.

Here's what a learning curve is. To plot a learning curve, what I usually do is plot j train which is, say, average squared error on my training set or Jcv which is the average squared error on my cross validation set. And I'm going to plot that as a function of m, that is as a function of the number of training examples I have. And so m is usually a constant like maybe I just have, you know, a 100 training examples but what I'm going to do is artificially with use my training set exercise.

여기 학습 곡선이 있습니다. 학습 곡선을 도식화하기 위해 일반적으로 학습 셋의 평균 오차의 제곱 Jtrain(θ)과 교차 검증 셋의 평균 오차의 제곱인 Jcv(θ)를 그립니다. m은 학습 데이터 셋의 총예제의 수입니다. 현재 보유한 100 개의 학습 예제가 있다면 m = 100 인 상수입니다.

So, I deliberately limit myself to using only, say, 10 or 20 or 30 or 40 training examples and plot what the training error is and what the cross validation is for this smallest training set exercises. So let's see what these plots may look like. Suppose I have only one training example like that shown in this this first example here and let's say I'm fitting a quadratic function. Well, I have only one training example. I'm going to be able to fit it perfectly right? You know, just fit the quadratic function. I'm going to have 0 error on the one training example. If I have two training examples. Well the quadratic function can also fit that very well. So, even if I am using regularization, I can probably fit this quite well. And if I am using no neural regularization, I'm going to fit this perfectly and if I have three training examples again. Yeah, I can fit a quadratic function perfectly so if m equals 1 or m equals 2 or m equals 3, my training error on my training set is going to be 0 assuming I'm not using regularization or it may slightly large in 0 if I'm using regularization and by the way if I have a large training set and I'm artificially restricting the size of my training set in order to J train. Here if I set M equals 3, say, and I train on only three examples, then, for this figure I am going to measure my training error only on the three examples that actually fit my data too and so even I have to say a 100 training examples but if I want to plot what my training error is the m equals 3. What I'm going to do is to measure the training error on the three examples that I've actually fit to my hypothesis 2.

의도적으로 학습 예제를 10개, 20개, 30개 또는 40개만 사용하도록 제한하고 학습 오차가 무엇인지 그리고 가장 작은 학습 셋에 대한 교차 검증이 무엇인지 도식화합니다. 가설 hθ(x) = θ0 + θ1x1 + θ2x^2으로 2차 함수입니다. 첫 번째 그림은 한 개의 예제만 있습니다. m=1이고 가설 함수를 그립니다. 한 개의 데이터에 완벽하게 적합할 수 있고, 학습 오류는 0입니다. 두 번째 그림은 두 개의 예제가 있습니다. m=2이고 가설 함수를 그립니다. 두 개의 데이터에 완벽하게 적합할 수 있고, 학습 오류는 0입니다. 정규화를 사용하지 않더라도 아주 정확히 적합할 것입니다. 세 번째 그림은 세 개의 예제만 있습니다. m = 3이고 가설 함수를 그립니다. m = 1, m = 2, m = 3 모두 정규화 항을 사용하지 않아도 학습 오류는 0이거나 0에 가깝습니다. 여기서는 정규화 항도 있고 학습 셋도 큽니다. Jtrain(θ)의 학습 셋의 크기를 인위적으로 제한합니다. m = 3으로 설정하고 3 개의 학습 예제만 있습니다. 100 개인 학습 예제가 있지만 3개의 예제에서 대해서 학습 오류를 측정합니다. 가설 2에 실제로 맞는 3 개의 예제에서 학습 오류를 측정합니다.

And not all the other examples that I have deliberately omitted from the training process. So just to summarize what we've seen is that if the training set size is small then the training error is going to be small as well. Because you know, we have a small training set is going to be very easy to fit your training set very well may be even perfectly now say we have m equals 4 for example. Well then a quadratic function can be a longer fit this data set perfectly and if I have m equals 5 then you know, maybe quadratic function will fit to stay there so so, then as my training set gets larger.It becomes harder and harder to ensure that I can find the quadratic function that process through all my examples perfectly. So in fact as the training set size grows what you find is that my average training error actually increases and so if you plot this figure what you find is that the training set error that is the average error on your hypothesis grows as m grows and just to repeat when the intuition is that when m is small when you have very few training examples. It's pretty easy to fit every single one of your training examples perfectly and so your error is going to be small whereas when m is larger then gets harder all the training examples perfectly and so your training set error becomes more larger now,

의도적으로 학습 과정에서 생략했던 예제가 아닙니다. 학습 셋의 크기가 작으면 학습 오류도 작습니다.

따라서, m = 4 일 때 2차 함수는 데이터 셋에 완벽하게 적합합니다. 만약 m = 5이면 2차 함수가 거의 적합할 것입니다. 학습 셋이 커질수록 완벽하게 모든 예제에 적합한 이차함수를 찾는 것이 더욱더 어려워집니다. 그래서, 학습 셋의 크기가 커질수록 평균 오차는 증가합니다. 가설에 대한 평균 오차는 m이 증가할수록 증가하고 m이 감소할수록 감소합니다. 즉, m이 작을 때는 모든 훈련 예제를 완벽하게 맞추는 것은 쉽기 때문에 오류는 작습니다. 반면에 m이 클 때는 모든 훈련 예제를 완벽하게 맞추는 것이 어렵기 때문에 오류는 더 커집니다.

How about the cross validation error. Well, the cross validation is my error on this cross validation set that I haven't seen and so, you know, when I have a very small training set, I'm not going to generalize well, just not going to do well on that. So, right, this hypothesis here doesn't look like a good one, and it's only when I get a larger training set that, you know, I'm starting to get hypotheses that maybe fit the data somewhat better. So your cross validation error and your test set error will tend to decrease as your training set size increases because the more data you have, the better you do at generalizing to new examples. So, just the more data you have, the better the hypothesis you fit.

다음은 교차 검증 오차입니다. 교차 검증은 학습하는 동안 사용하지 않은 새로운 학습 예제에 대한 오류입니다. 학습 셋이 작을 때는 일반화하지 않을 것입니다. 여기 가설은 좋지 않은 것 같습니다. 학습 셋이 커지면서 데이터에 잘 맞는 가설을 얻기 시작했습니다. 교차 검증 오류와 테스트 오류는 학습 셋의 크기가 증가할수록 감소하는 경향이 있습니다. 왜냐하면 데이터가 많을수록 새로운 예제로 일반화하는 것이 더 좋기 때문입니다. 따라서, 데이터가 많을수록 가설이 더 적합합니다.

So if you plot j train, and Jcv this is the sort of thing that you get. Now let's look at what the learning curves may look like if we have either high bias or high variance problems. Suppose your hypothesis has high bias and to explain this I'm going to use a, set an example, of fitting a straight line to data that, you know, can't really be fit well by a straight line. So we end up with a hypotheses that maybe looks like that. Now let's think what would happen if we were to increase the training set size. So if instead of five examples like what I've drawn there, imagine that we have a lot more training examples. Well what happens, if you fit a straight line to this. What you find is that, you end up with you know, pretty much the same straight line. I mean a straight line that just cannot fit this data and getting a ton more data, well the straight line isn't going to change that much. This is the best possible straight-line fit to this data, but the straight line just can't fit this data set that well.

Jtrain(θ)와 Jcv(θ) 도식화하면 높은 편향과 높은 분산 문제가 있는 학습 곡선이 어떻게 보이는 지 살펴보겠습니다. 높은 편향을 가진 가설을 가정하기 위해 데이터에 적합하지 않은 직선을 그립니다. 이제 학습 셋의 크기를 늘리면 어떻게 될지 생각합니다. 조금 전에 그린 다섯 가지 예 대신에 더 많은 훈련 예제가 있다고 생각합니다. 이것에 직선을 맞추면 무슨 일이 벌어질까요?

So, if you plot across validation error, this is what it will look like. Option on the left, if you have already a miniscule training set size like you know, maybe just one training example and is not going to do well. But by the time you have reached a certain number of training examples, you have almost fit the best possible straight line, and even if you end up with a much larger training set size, a much larger value of m, you know, you're basically getting the same straight line, and so, the cross-validation error - let me label that - or test set error or plateau out, or flatten out pretty soon, once you reached beyond a certain the number of training examples, unless you pretty much fit the best possible straight line.

교차 검증 오류를 그리면 다음과 같습니다. 수평축의 왼쪽은 학습 셋의 크기가 작고 오른쪽은 학습 셋의 크기가 큽니다. 맨 왼쪽은 학습 예제가 하나이고 제대로 작동하지 않을 것입니다. 특정한 수의 학습 예제에 도달할 때까지 직선을 유지하다가 학습 셋의 크기가 훨씬 더 커지면 평평해집니다. 교차 검증 오류는 거의 직선으로 낮아지다가 학습 셋의 크기가 훨씬 더 커지면 평평해집니다.

And how about training error? Well, the training error will again be small. And what you find in the high bias case is that the training error will end up close to the cross validation error, because you have so few parameters and so much data, at least when m is large. The performance on the training set and the cross validation set will be very similar. And so, this is what your learning curves will look like, if you have an algorithm that has high bias.

그리고 학습 오류는 어떻습니까? 학습 오류는 작을 것입니다. 편향이 높은 경우 교차 검증 오류에 가까워집니다. 최소한 m이 클 때 파라미터가 너무 적고 데이터가 너무 많기 때문입니다. 학습 셋과 교차 검증 셋의 성능은 매우 유사합니다. 편향이 높은 알고리즘이 있는 경우 학습 곡선이 이렇게 표시됩니다.

And finally, the problem with high bias is reflected in the fact that both the cross validation error and the training error are high, and so you end up with a relatively high value of both Jcv and the j train. This also implies something very interesting, which is that, if a learning algorithm has high bias, as we get more and more training examples, that is, as we move to the right of this figure, we'll notice that the cross validation error isn't going down much, it's basically fattened up, and so if learning algorithms are really suffering from high bias. Getting more training data by itself will actually not help that much, and as our figure example in the figure on the right, here we had only five training. examples, and we fill certain straight line. And when we had a ton more training data, we still end up with roughly the same straight line. And so if the learning algorithm has high bias give me a lot more training data. That doesn't actually help you get a much lower cross validation error or test set error. So knowing if your learning algorithm is suffering from high bias seems like a useful thing to know because this can prevent you from wasting a lot of time collecting more training data where it might just not end up being helpful.

마지막으로 높은 편향의 문제는 교차 검증 오차와 학습 오차가 모두 높기 때문에 Jtrain(θ)와 Jcv(θ) 모두 상대적으로 높은 값을 얻습니다. 즉, 학습 알고리즘이 높은 편향이 있으면 더 많은 학습 예제를 구하여 그림에서 오른쪽으로 이동한다면, 교차 검증 오류가 거의 일정합니다. 결국, 학습 알고리즘이 높은 편향 문제를 겪고 있다면, 더 많은 학습 데이터를 구하는 것은 도움이 되지 않습니다. 오른쪽 상단의 그림은 5 개의 학습 예제만 있습니다. 예를 들어, 특정 직선을 그립니다. 그리고 오른쪽 하단의 그림은 다수의 학습 예제가 있어도 특정 직선을 그립니다. 학습 알고리즘이 높은 편향을 있다면, 데이터가 아무리 많아도 교차 검증 오류와 테스트 오류를 낮추는 데 도움이 되지 않습니다. 따라서, 높은 편향성이 있는 학습 알고리즘은 더 많은 훈련 데이터를 수집하는 것에 많은 시간을 낭비할 필요가 없습니다.

Next let us look at the setting of a learning algorithm that may have high variance.Let us just look at the training error in a around if you have very smart training set like five training examples shown on the figure on the right and if we're fitting say a very high order polynomial, and I've written a hundredth degree polynomial which really no one uses, but just an illustration. And if we're using a fairly small value of lambda, maybe not zero, but a fairly small value of lambda, then we'll end up, you know, fitting this data very well that with

a function that overfits this. So, if the training set size is small, our training error, that is, j train of theta will be small. And as this training set size increases a bit, you know, we may still be overfitting this data a little bit but it also becomes slightly harder to fit this data set perfectly, and so, as the training set size increases, we'll find that j train increases, because it is just a little harder to fit the training set perfectly when we have more examples, but the training set error will still be pretty low.

다음으로 학습 알고리즘이 분산이 높을 때를 살펴보겠습니다. 오른쪽 상단의 그림은 5 개의 학습 예제가 있을 학습 오류를 살펴보겠습니다. 매우 높은 100 차 다항식과 아주 작은 정규화 파라미터 람다를 사용합니다. 람다는 아주 작은 값으로 0은 아닙니다. 가설은 데이터에 아주 적합하는 과적합입니다. 따라서, 학습 셋의 크기가 작을수록 Jtrain(θ)의 학습 오류는 작아집니다. 학습 셋의 크기가 증가할수록 여전히 과적합할 수 있지만 데이터 셋에 완벽하게 맞추기는 약간 어려워집니다. 따라서, 학습 셋의 크기가 증가할수록 Jtrain(θ)의 학습 오류는 증가하지만 여전히 낮습니다.

Now, how about the cross validation error? Well, in high variance setting, a hypothesis is overfitting and so the cross validation error will remain high, even as we get you know, a moderate number of training examples and, so maybe, the cross validation error may look like that. And the indicative diagnostic that we have a high variance problem, is the fact that there's this large gap between the training error and the cross validation error.

이제 교차 검증 오류는 살펴봅시다. 높은 분산이 있는 가설은 과적합입니다. 교차 검증 오류는 적당한 수의 학습 예제가 증가함에 따라 분홍색 선으로 보입니다. 따라서, 높은 분산이 있는 학습 알고리즘은 학습 오류와 교차 검증 오류 사이에 큰 갭이 있습니다.

And looking at this figure. If we think about adding more training data, that is, taking this figure and extrapolating to the right, we can kind of tell that, you know the two curves, the blue curve and the magenta curve, are converging to each other. And so, if we were to extrapolate this figure to the right, then it seems it likely that the training error will keep on going up and the cross-validation error would keep on going down. And the thing we really care about is the cross-validation error or the test set error, right? So in this sort of figure, we can tell that if we keep on adding training examples and extrapolate to the right, well our cross validation error will keep on coming down. And, so, in the high variance setting, getting more training data is, indeed, likely to help. And so again, this seems like a useful thing to know if your learning algorithm is suffering from a high variance problem, because that tells you, for example that it may be be worth your while to see if you can go and get some more training data.

왼쪽 그림을 봅시다. 그래프를 늘려서 학습 데이터를 추가합니다. 학습 데이터를 추가할수록 파란색 곡선과 분홍색 곡선이 서로 수렴합니다. 따라서, 학습 데이터를 추가할수록 학습 오류는 계속 증가하고 검증 오류는 계속 내려갑니다. 높은 분산이 있는 학습 알고리즘에 더 많은 학습 데이터를 구하는 것은 도움이 됩니다. 더 많은 학습 데이터가 필요한지 아닌지 확인하는 것은 가치가 있습니다.

Now, on the previous slide and this slide, I've drawn fairly clean fairly idealized curves. If you plot these curves for an actual learning algorithm, sometimes you will actually see, you know, pretty much curves, like what I've drawn here. Although, sometimes you see curves that are a little bit noisier and a little bit messier than this. But plotting learning curves like these can often tell you, can often help you figure out if your learning algorithm is suffering from bias, or variance or even a little bit of both. So when I'm trying to improve the performance of a learning algorithm, one thing that I'll almost always do is plot these learning curves, and usually this will give you a better sense of whether there is a bias or variance problem.

편향 그래프와 분산 그래프가 상당히 깔끔하고 이상적인 곡선으로 그렸습니다. 실제 학습 알고리즘은 부드러운 곡선을 볼 수도 있지만 약간 더 지저분하고 복잡한 곡선입니다. 그러나, 이런 곡선을 그려보면 편향과 분산이 있는지 또는 두 개 모두 있는 지를 파악할 수 있습니다. 학습 알고리즘의 성능을 향상하려고 할 때 항상 이런 곡선을 그려봅니다. 학습 알고리즘의 문제를 더 잘 파악할 수 있습니다.

And in the next video we'll see how this can help suggest specific actions is to take, or to not take, in order to try to improve the performance of your learning algorithm.

다음 강의에서 학습 알고리즘의 성능을 향상하기 위해 특정 조치를 취하거나 취하지 않을 것을 선택하는 것을 제안합니다.

앤드류 응의 머신러닝 동영상 강의

정리하며

학습 곡선은 실제로 물리적 학습 알고리즘이 편향이나 분산이 문제가 있는지 진단하는 도구입니다. 학습 데이터 m의 개수를 증가시키면서 학습 오류 Jtrain(θ)를 계산하면, 학습 셋의 크기가 커질수록 평균 오차의 제곱인 학습 곡선은 완만하게 상승합니다. 반면에 교차 검증 데이터 m의 개수를 증가시키면서 교차 검증 오류 Jcv(θ)를 계산하면, 학습 셋의 크기가 커질수록 평균 오차의 제곱인 교차 검증 곡선은 완만하게 하강합니다.

학습 알고리즘이 높은 편향 문제를 겪고 있다면, 학습 예제가 늘어날수록 Jtrain(θ)과 Jcv(θ)는 상대적으로 높은 오류를 보이며 비슷한 값을 보입니다. 이런 경우 더 많은 예제를 구하는 것은 시간 낭비입니다.