brunch

You can make anything
by writing

C.S.Lewis

by 라인하트 Nov 02. 2020

앤드류 응의 머신러닝(10-4): 편향과 분산

온라인 강의 플랫폼 코세라의 창립자인 앤드류 응 (Andrew Ng) 교수는 인공지능 업계의 거장입니다. 그가 스탠퍼드 대학에서 머신 러닝 입문자에게 한 강의를 그대로 코세라 온라인 강의 (Coursera.org)에서 무료로 배울 수 있습니다. 이 강의는 머신러닝 입문자들의 필수코스입니다. 인공지능과 머신러닝을 혼자 공부하면서 자연스럽게 만나게 되는 강의입니다.

Advice for Applying Machine Learning

머신 러닝 적용을 위한 조언

Bias vs. Variance (편향과 분산)

Diagnosing Bias vs. Variance (편향과 분산 진단하기)

If you run a learning algorithm and it doesn't do as long as you are hoping, almost all the time, it will be because you have either a high bias problem or a high variance problem,

in other words, either an underfitting problem or an overfitting problem. In this case, it's very important to figure out which of these two problems is bias or variance or a bit of both that you actually have. Because knowing which of these two things is happening would give a very strong indicator for whether the useful and promising ways to try to improve your algorithm. In this video, I'd like to delve more deeply into this bias and variance issue and understand them better as was figure out how to look in a learning algorithm and evaluate or diagnose whether we might have a bias problem or a variance problem since this will be critical to figuring out how to improve the performance of a learning algorithm that you will implement.

학습 알고리즘은 항상 높은 편향(Bias)이나 높은 분산(Variance) 문제가 있습니다. 즉, 과소 적합 문제(Underfitting Problem)이나 과적합 문제(Overfitting Problem)가 발생합니다. 학습 알고리즘을 개선하기 위해 제일 먼저 편향(Bias) 또는 분산(Variance)인지를 판단합니다. 이것이 다음에 할 작업을 선택하는 가장 확실한 지표입니다. 이번 강의에서 편향(Bias)과 분산 (Variance)를 더 깊게 살펴보고, 학습 알고리즘이 편향인지 분산인지를 평가하고 진단하는 방법을 설명합니다.

So, you've already seen this figure a few times where if you fit two simple hypothesis like a straight line that underfits the data, if you fit a two complex hypothesis, then that might fit the training set perfectly but overfit the data and this may be hypothesis of some intermediate level of complexities of some maybe degree two polynomials or not too low and not too high degree that's like just right and gives you the best generalization error over these options.

여기 그림은 몇 번이나 보았습니다. 왼쪽 가설은 데이터에 과소 적합하는 직선을 만들고, 오른쪽 가설은 데이터 과적합하는 복잡한 굴곡을 만듭니다. 중앙 가설은 데이터에 적합하면서도 부드러운 곡선을 만듭니다. 중앙 가설은 2차 다항식을 활용하여 너무 단순하지도 복잡하지도 않은 딱 맞습니다. 세 가지 옵션 중에서 중앙 가설은 가장 낮은 일반화의 오류를 표시합니다.

Now that we're armed with the notion of chain training and validation in test sets, we can understand the concepts of bias and variance a little bit better. Concretely, let's let our training error and cross validation error be defined as in the previous videos. Just say the squared error, the average squared error, as measured on the training sets or as measured on the cross validation set.

학습 셋, 교차 검증 셋, 테스트 셋의 개념을 바탕으로 편향과 분산을 쉽게 이해할 수 있습니다. 학습 오차와 교차 검증 오차를 지난 강의와 마찬가지로 정의합니다. 학습 셋에서 측정되거나 교차 검증 셋에서 측정된 오차는 평균 오차의 제곱입니다.

Now, let's plot the following figure. On the horizontal axis I'm going to plot the degree of polynomial. So, as I go to the right I'm going to be fitting higher and higher order polynomials. So where the left of this figure where maybe d equals one, we're going to be fitting very simple functions whereas we're here on the right of the horizontal axis, I have much larger values of ds, of a much higher degree polynomial. So here, that's going to correspond to fitting much more complex functions to your training set. Let's look at the training error and the cross validation error and plot them on this figure. Let's start with the training error. As we increase the degree of the polynomial, we're going to be able to fit our training set better and better and so if d equals one, then there is high training error, if we have a very high degree of polynomial our training error is going to be really low, maybe even 0 because will fit the training set really well. So, as we increase the degree of polynomial, we find typically that the training error decreases. So I'm going to write J subscript train of theta there, because our training error tends to decrease with the degree of the polynomial that we fit to the data.

도식화합니다. 수평축에 다항식의 차수를 나타내고 오른쪽으로 이동할수록 더 높은 고차 다항식을 의미합니다. 왼쪽의 1차 다항식 가설은 d = 1입니다. 오른쪽으로 이동할수록 고차 다항식 가설을 나타냅니다. 왼쪽으로 갈수록 데이터에 가설은 단순하고 오른쪽으로 갈수록 가설은 복잡합니다. 학습 오차와 교차 검증 오차를 도식화합니다. 학습 오차부터 그립니다. 다항식의 차수인 d의 값을 증가시키면 학습 셋에 더 잘 적합할 것입니다. d = 1이면 학습 오차가 높고 d의 값이 증가할수록 학습 셋에 잘 적합하기 때문에 0이 될 수도 있습니다. 즉, 다항식의 차수를 늘리면 학습 오차가 감소합니다. 분홍색 곡선은 Jtrain(θ)입니다. 학습 오차는 데이터에 맞는 다항식의 차수에 따라 감소하는 경향이 있습니다.

Next, let's look at the cross-validation error or for that matter, if we look at the test set error, we'll get a pretty similar result as if we were to plot the cross validation error. So, we know that if d equals one, we're fitting a very simple function and so we may be underfitting the training set and so it's going to be very high cross-validation error. If we fit an intermediate degree polynomial, we had d equals two in our example in the previous slide, we're going to have a much lower cross-validation error because we're finding a much better fit to the data. Conversely, if d were too high. So if d took on say a value of four, then we're again overfitting, and so we end up with a high value for cross-validation error. So, if you were to vary this smoothly and plot a curve, you might end up with a curve like that where that's JCV of theta. Again, if you plot J test of theta you get something very similar.

다음으로 교차 검증 오류와 테스트 오류를 살펴보면 비슷한 결과를 얻을 것입니다. d = 1 일 때 학습 셋에 매우 간단한 함수에 적합한 결과는 과소 적합이므로 교차 검증 오차는 매우 높습니다. 중간 정도의 고차 다항식에 적합하면(d=2) 더 적합한 것을 찾기 때문에 교차 검증 오류는 낮아집니다. 반대로 d가 너무 높으면(d=4) 과적합이 되어 교차 검증 오류는 높아집니다. 이 곡선을 매끄럽게 그리면 빨간색 곡선 Jcv(θ)입니다. 다시 말하지만 Jtest(θ)를 도식화하면 교차 검증 오류와 매우 유사한 곡선을 얻습니다.

So, this sort of plot also helps us to better understand the notions of bias and variance.

Concretely, suppose you have applied a learning algorithm and it's not performing as well as you are hoping, so if your cross-validation set error or your test set error is high, how can we figure out if the learning algorithm is suffering from high bias or suffering from high variance?

So, the setting of a cross-validation error being high corresponds to either this regime or this regime. So, this regime on the left corresponds to a high bias problem. That is, if you are fitting a overly low order polynomial such as a d equals one when we really needed a higher order polynomial to fit to data, whereas in contrast this regime corresponds to a high variance problem. That is, if d the degree of polynomial was too large for the data set that we have, and this figure gives us a clue for how to distinguish between these two cases.

Concretely, for the high bias case, that is the case of underfitting, what we find is that both the cross validation error and the training error are going to be high. So, if your algorithm is suffering from a bias problem, the training set error will be high and you might find that the cross validation error will also be high. It might be close, maybe just slightly higher, than the training error. So, if you see this combination, that's a sign that your algorithm may be suffering from high bias. In contrast, if your algorithm is suffering from high variance, then if you look here, we'll notice that J train, that is the training error, is going to be low. That is, you're fitting the training set very well, whereas your cross validation error assuming that this is, say, the squared error which we're trying to minimize say, whereas in contrast your error on a cross validation set or your cross function or cross validation set will be much bigger than your training set error. So, this is a double greater than sign. That's the map symbol for much greater thans, denoted by two greater than signs. So if you see this combination of values, then that's a clue that your learning algorithm may be suffering from high variance and might be overfitting.

The key that distinguishes these two cases is, if you have a high bias problem, your training set error will also be high is your hypothesis just not fitting the training set well. If you have a high variance problem, your training set error will usually be low, that is much lower than your cross-validation error.

그래서, 이런 종류의 그래프가 편향과 분산의 개념을 이해하기 수월합니다. 학습 알고리즘이 제대로 작동하지 않는다고 가정합니다. 교차 검증 셋 오차 또는 테스트 셋 오차가 높으면 학습 알고리즘이 높은 편향이나 높은 분산 문제가 발생한다는 것을 알 수 있을까요? 교차 검증 오류가 높다는 의미는 수칙축의 Error 값이 가장 높은 양 끝을 의미합니다. 왼쪽 끝은 높은 편향 문제를 일으킵니다. 즉, d= 1은 지나치게 낮은 차수의 다항식이므로 고차 다항식이 필요하다는 의미입니다. 오른쪽 끝은 높은 분산 문제를 일으킵니다. d = 4는 지나치게 높은 다항식이므로 낮은 차수의 다항식이 필요하다는 의미입니다.

알고리즘이 높은 편향 문제가 있다면 과소 적합 사례이고 학습 오차와 교차 검증 오차가 모두 높습니다. 즉, 학습 오류와 가깝거나 약간 더 높을 수 있습니다. 알고리즘이 높은 분산 문제가 있다면 과적합 사례이고 학습 오차는 낮지만 교차 검증 오차는 높습니다. 학습 셋에 매우 잘 맞으면 최소화하려는 함수의 평균 오차의 제곱 함수의 오차가 작은 것입니다. 따라서 보통은 학습 오치보다 교차 검증 오차가 두 배이상 큽니다. >> 는 훨씬 크다는 의미로 부등호 2개를 표시합니다.

편향과 분산을 구분하는 방법은 단순합니다. 높은 편향 문제가 있는 경우 학습 오류도 높고 교차 검증 오류도 높습니다. 높은 분산 문제가 있는 경우 학습 오류는 낮고, 교차 검증 오류는 훨씬 높습니다.

So hopefully that gives you a somewhat better understanding of the two problems of bias and variance. I still have a lot more to say about bias and variance in the next few videos, but what we'll see later is that by diagnosing whether a learning algorithm may be suffering from high bias or high variance, I'll show you even more details on how to do that in later videos. But we'll see that by figuring out whether a learning algorithm may be suffering from high bias or high variance or combination of both, that that would give us much better guidance for what might be promising things to try in order to improve the performance of a learning algorithm.

지금까지 편향과 분산 문제를 다루었습니다. 다음 강의부터 편향과 분산에 대한 더 많은 것을 설명할 것입니다. 학습 알고리즘이 제대로 작동하는 지를 진단하는 더 많은 방법을 설명할 것입니다. 진단은 알고리즘의 성능을 향상하기 위해 더 나은 방법을 선택할 수 있는 방향성을 제공합니다.