brunch

You can make anything
by writing

C.S.Lewis

by 라인하트 Oct 10. 2020

앤드류 응의 머신러닝 (4-4) : 학습률

온라인 강의 플랫폼 코세라의 창립자인 앤드류 응 (Andrew Ng) 교수는 인공지능 업계의 거장입니다. 그가 스탠퍼드 대학에서 머신 러닝 입문자에게 한 강의를 그대로 코세라 온라인 강의 (Coursera.org)에서 무료로 배울 수 있습니다. 이 강의는 머신러닝 입문자들의 필수코스입니다. 인공지능과 머신러닝을 혼자 공부하면서 자연스럽게 만나게 되는 강의입니다.

Linear Regression with Mutiple Variables

다변수 선형회귀

Multivarate Linear Regression (다변수 선형 회귀)

Gradient Descent in Practice II : Learning rate

(경사 하강법 실습 II : 학습 비율)

In this video, I want to give you more practical tips for getting gradient descent to work. The ideas in this video will center around the learning rate alpha.

이번 강의는 경가 하강법에 대한 실질적인 팁을 설명합니다. 이번 강의 주제는 학습률 α입니다.

Concretely, here's the gradient descent update rule. And what I want to do in this video is tell you about what I think of as debugging, and some tips for making sure that gradient descent is working correctly. And second, I wanna tell you how to choose the learning rate alpha or at least how I go about choosing it.

여기에 경사 하강법 업데이트 공식이 있습니다.

경사 하강법이 잘 동작하는 지를 확인하는 디버깅과 학습률 α를 선택하는 법을 설명할 것입니다.

Here's something that I often do to make sure that gradient descent is working correctly. The job of gradient descent is to find the value of theta for you that hopefully minimizes the cost function J(theta). What I often do is therefore plot the cost function J(theta) as gradient descent runs. So the x axis here is a number of iterations of gradient descent and as gradient descent runs you hopefully get a plot that maybe looks like this. Notice that the x axis is number of iterations. Previously we where looking at plots of J(theta) where the x axis, where the horizontal axis, was the parameter vector theta but this is not what this is.

여기에 경사 하강법이 잘 동작하는 지를 확인할 수 있는 방법이 있습니다. 경사 하강법의 역할은 비용 함수 J(θ)가 최소값이 되는 파라미터 θ를 찾는 것입니다. 경사 하강법을 실행할 때마다 비용 함수 J(θ)의 값의 변화를 그립니다. 수평 축은 경사 하강법의 반복 횟수입니다. 운이 좋다면 경사 하강법을 반복할 때마다 비용 함수 J(θ) 그래프는 파란석 곡선처럼 그릴 것입니다. 수평축은 반복 횟수라는 것을 잊지 마세요. 일반적으로 비용 함수 J(θ) 그래프를 그릴 때 수평축은 파라미터 벡터였지만 지금은 아닙니다.

Concretely, what this point is, is I'm going to run gradient descent for 100 iterations. And whatever value I get for theta after 100 iterations, I'm going to get some value of theta after 100 iterations. And I'm going to evaluate the cost function J(theta). For the value of theta I get after 100 iterations, and this vertical height is the value of J(theta). For the value of theta I got after 100 iterations of gradient descent. And this point here that corresponds to the value of J(theta) for the theta that I get after I've run gradient descent for 200 iterations. So what this plot is showing is, is it's showing the value of your cost function after each iteration of gradient decent. And if gradient is working properly then J(theta) should decrease after every iteration.

경사 하강법 반복 횟수가 100 일 때 빨간색 점을 찍습니다. 경사 하강법을 100번을 돌렸을 때 파라미터 θ의 값을 구하고 비용 함수 J(θ) 계산합니다. 파라미터 θ의 값도 J(θ)도 경사 하강법을 100번 돌려서 얻은 값입니다. 경사 하강법 반복 횟수가 200 일 때 다음 빨간색 점을 찍습니다. 경사 하강법을 200번을 돌렸을 때 파라미터 θ의 값을 구하고 비용 함수 J(θ) 계산합니다. 이 그래프는 경사 하강법을 실행할 때마다 나오는 비용 함수 그래프입니다. 경사 하강법이 제대로 동작할 경우 반복할 때마다 비용 함수 J(θ)는 감소합니다.

And one useful thing that this sort of plot can tell you also is that if you look at the specific figure that I've drawn, it looks like by the time you've gotten out to maybe 300 iterations, between 300 and 400 iterations, in this segment it looks like J(theta) hasn't gone down much more. So by the time you get to 400 iterations, it looks like this curve has flattened out here. And so way out here 400 iterations, it looks like gradient descent has more or less converged because your cost function isn't going down much more. So looking at this figure can also help you judge whether or not gradient descent has converged.

비용 함수 J(θ) 그래프에 또 다른 유용한 정보가 있습니다. 반복회수가 대략 300에서 400 사이의 구간에서 비용 함수 J(θ)가 거의 줄어들지 않습니다. 결국 반복회수가 대략 400 이후에는 선이 평평합니다. 경사 하강법을 400번의 반복 후에 경사 하강법은 거의 최소값에 근접한 것입니다. 왜냐하면 비용 함수 J(θ)의 값이 더 이상 줄어들지 않기 때문입니다. 비용 함수 J(θ) 그래프는 경사 하강법이 수렴하는지 아닌 지를 보여줍니다.

By the way, the number of iterations the gradient descent takes to converge for a physical application can vary a lot, so maybe for one application, gradient descent may converge after just thirty iterations. For a different application, gradient descent may take 3,000 iterations, for another learning algorithm, it may take 3 million iterations. It turns out to be very difficult to tell in advance how many iterations gradient descent needs to converge. And is usually by plotting this sort of plot, plotting the cost function as we increase in number in iterations, is usually by looking at these plots. But I try to tell if gradient descent has converged. It's also possible to come up with automatic convergence test, namely to have a algorithm try to tell you if gradient descent has converged.

그런데, 경사 하강법이 최저값에 수렴하는 반복 회수는 일정하지 않습니다. 어떤 경우에는 경사 하강법이 30번 만에 수렴하기도 하지만, 어떤 경우에는 경사 하강법이 3,000번 또는 3백만 번을 반복하기도 합니다. 즉, 경사 하강법이 최저값에 수렴할 때까지 얼마나 반복해야 하는 지를 사전에 알기 어렵습니다. 일반적으로 반복 회수의 증가에 따른 비용 함수 J(θ) 그래프의 변화를 직접 그리면서 경사 하강법이 수렴하는지 아닌지를 평가합니다. 물론, 경사 하강법이 수렴하는 지를 알려주는 알고리즘인 자동 수렴 테스트 (automatic convergence test)도 있습니다.

And here's maybe a pretty typical example of an automatic convergence test. And such a test may declare convergence if your cost function J(theta) decreases by less than some small value epsilon, some small value 10 to the minus 3 in one iteration. But I find that usually choosing what this threshold is is pretty difficult. And so in order to check your gradient descent's converge I actually tend to look at plots like these, like this figure on the left, rather than rely on an automatic convergence test.

여기 자동 수렴 테스트 사례가 있습니다. 비용 함수 J(θ)의 감소량이 어떤 작은 값 앱실론(ε) 보다 작을 때

경사 하강법이 수렴한다고 판단합니다. 예를 들면, 앱실론(ε)의 값을 0.001로 설정한다면, 경사 하강법을 한 번 실행할 때마다 비용 함수 J(θ)이 전보다 0.001이 작다면 수렴한다고 합니다. 하지만, 실제로 임계값을 결정하는 것이 꽤 어렵습니다. 그래서 자동 수렴 테스트보다 비용 함수 J(θ)의 그래프를 그립니다.

Looking at this sort of figure can also tell you, or give you an advance warning, if maybe gradient descent is not working correctly. Concretely, if you plot J(theta) as a function of the number of iterations. Then if you see a figure like this where J(theta) is actually increasing, then that gives you a clear sign that gradient descent is not working. And a theta like this usually means that you should be using learning rate alpha.If J(theta) is actually increasing, the most common cause for that is if you're trying to minimize a function, that maybe looks like this. But if your learning rate is too big then if you start off there, gradient descent may overshoot the minimum and send you there. And if the learning rate is too big, you may overshoot again and it sends you there, and so on. So that, what you really wanted was for it to start here and for it to slowly go downhill, right? But if the learning rate is too big, then gradient descent can instead keep on overshooting the minimum. So that you actually end up getting worse and worse instead of getting to higher values of the cost function J(theta). So you end up with a plot like this and if you see a plot like this, the fix is usually just to use a smaller value of alpha. Oh, and also, of course, make sure your code doesn't have a bug of it. But usually too large a value of alpha could be a common problem.

경사 하강법이 제대로 동작하지 않을 때를 봅시다. 경사 하강법을 반복할 때마다 비용 함수 J(θ)가 증가하는 경우는 명백하게 경가 하강법이 제대로 동작하지 않는 것입니다. 해결책은 학습률 α의 크기를 줄이는 것입니다. 비용 함수 J(θ)가 증가하는 사례를 하나 배웠습니다. 바로 학습률 α가 너무 커서 경사 하강 알고리즘이 최소값을 지나치면서 방향 전환에 실패하는 것입니다. 경사 하강 알고리즘은 학습률이 크기 때문에 최저값을 계속 지나치면서 더 높은 값으로 이동합니다. 학습률이 너무 크면 최소값을 계속 지나치면서 상황은 점점 더 나빠지면서 비용 함수 J(θ) 그래프는 증가합니다. 해결책은 학습률 α의 크기를 작게 해야 합니다. 물론, 프로그래밍 코드에 오류가 없다는 전제입니다.

Similarly sometimes you may also see J(theta) do something like this, it may go down for a while then go up then go down for a while then go up go down for a while go up and so on. And a fix for something like this is also to use a smaller value of alpha. I'm not going to prove it here, but under other assumptions about the cost function J, that does hold true for linear regression.

가끔 비용 함수 J(θ) 그래프가 파도 모양일 수 있습니다. 잠시 내려가다가 올라가고 다시 내려가다가 올라가는 것을 반복합니다. 이때도 학습률 α의 값을 작게 합니다. 여기에서 직접 증명하지 않겠지만, 선형 회귀 가설의 비용 함수 J(θ)에서는 사실입니다.

Mathematicians have shown that if your learning rate alpha is small enough, then J(theta) should decrease on every iteration. So if this doesn't happen probably means the alpha's too big, you should set it smaller. But of course, you also don't want your learning rate to be too small because if you do that then the gradient descent can be slow to converge. And if alpha were too small, you might end up starting out here, say, and end up taking just minuscule baby steps. And just taking a lot of iterations before you finally get to the minimum, and so if alpha is too small, gradient descent can make very slow progress and be slow to converge.

수학자들은 학습 비율 α가 충분히 작을 때 비용 함수 J(θ)는 매 반복마다 감소한다고 말합니다. 비용 함수 J(θ)가 감소하지 않는다면 학습률 α가 너무 크다는 것입니다. 반대로 학습률 α가 너무 작으면 경사 하강법이 너무 천천히 최소값에 수렴합니다. 경사 하강 알고리즘은 베이비 스텝으로 최소값에 도달하기까지 수많은 반복을 해야 합니다. 즉, 학습률 α가 너무 작으면 경사 하강법은 천천히 최소값에 수렴합니다.

To summarize, if the learning rate is too small, you can have a slow convergence problem, and if the learning rate is too large, J(theta) may not decrease on every iteration and it may not even converge. In some cases if the learning rate is too large, slow convergence is also possible. But the more common problem you see is just that J(theta) may not decrease on every iteration. And in order to debug all of these things, often plotting that J(theta) as a function of the number of iterations can help you figure out what's going on.

정리하면, 학습률 α가 너무 작으면 비용 함수 J(θ)는 천천히 최소값에 수렴하고, 학습률 α가 너무 크면 비용 함수 J(θ)는 감소하지 않거나 최소값에 수렴하지 않습니다. 간혹 학습률 α가 너무 커서 천천히 수렴하기도 하지만, 대부분 비용 함수 J(θ)가 매 번 감소하지 않는 것입니다. 이런 문제를 디버깅하는 좋은 방법은 경사 하강법 반복 회수에 따른 비용 함수 J(θ)의 그래프를 그리는 것입니다.

Concretely, what I actually do when I run gradient descent is I would try a range of values. So just try running gradient descent with a range of values for alpha, like 0.001 and 0.01. So these are factor of ten differences. And for these different values of alpha are just plot J(theta) as a function of number of iterations, and then pick the value of alpha that seems to be causing J(theta) to decrease rapidly. In fact, what I do actually isn't these steps of ten. So this is a scale factor of ten of each step up. What I actually do is try this range of values. And so on, where this is 0.001. I'll then increase the learning rate threefold to get 0.003. And then this step up, this is another roughly threefold increase from 0.003 to 0.01. And so these are, roughly, trying out gradient descents with each value I try being about 3x bigger than the previous value. So what I'll do is try a range of values until I've found one value that's too small and made sure that I've found one value that's too large. And then I'll sort of try to pick the largest possible value, or just something slightly smaller than the largest reasonable value that I found. And when I do that usually it just gives me a good learning rate for my problem. And if you do this too, maybe you'll be able to choose a good learning rate for your implementation of gradient descent.

경사 하강법을 실행할 때 학습률 α 값의 범위를 설정하는 법을 생각해 봅시다. 경사 하강법을 실행할 때 학습률 α의 범위를 0.001과 0.01 같이 10배만큼의 차이가 나도록 설정합니다. 각 학습률 α 마다 반복 회수에 따른 비용 함수 J(θ)를 그려보고 비용 함수 J(θ)가 가장 빠르게 감소하는 학습률 α를 선택합니다.

실제로 열 배만큼 값을 변경하는 것은 중요하지 않습니다. 여기 학습률 α가 0.001에서 시작합니다. 다음은 학습률 α를 3배 증가시킨 0.003입니다. 그다음에 0.003에서 대략 3 배수는 0.01입니다. 경사 하강법은 전보다 3배만큼 증가시키면서 테스트합니다. 일반적으로 학습률 α를 결정할 때 가장 작은 값과 가장 큰 값을 먼저 찾고, 가장 큰 값이나 가장 큰 값에서 조금 작은 값에서 시작합니다. 점점 값을 낮추면서 가장 적절한 학습률 α를 찾습니다.

앤드류 응의 머신러닝 동영상 강의

정리하며 - 학습률 α

학습률 α는 비용 함수 J(θ)가 최저값에 수렴할 때까지 얼마의 간격으로 찾아가는 지를 나타냅니다. 문제는 경사 하강법이 최저값에 수렴할 때까지 얼마나 반복해야 하는 지를 사전에 알기 어렵다는 것입니다. 보통은 경사 하강 알고리즘을 반복할 때마다 비용 함수 J(θ) 그래프의 변화를 보면서 비용 함수 J(θ)가 수렴하는지를 살펴봅니다.

학습률 α가 너무 작으면 비용 함수 J(θ)가 천천히 최소값에 수렴하고, 학습률 α가 너무 크면 비용 함수 J(θ)가 감소하지 않거나 수렴하지 않습니다. 따라서, 비용 함수 J(θ)가 수렴하지 않을 때 학습률 α의 값을 작게 합니다.

학습률 α를 결정할 때 가장 작은 값과 가장 큰 값을 먼저 찾고, 가장 큰 값이나 가장 큰 값에서 조금 작은 값에서 시작합니다. 학습률 α의 값을 점점 낮추면서 가장 적절한 학습률 α를 찾습니다. 낮추는 간격은 전의 값보다 열 배 작게 하거나 3배씩 차이 나게 합니다.