brunch

You can make anything
by writing

C.S.Lewis

by 라인하트 Dec 16. 2020

앤드류 응의 머신러닝(17-4):확률적 경사하강법 수렴

온라인 강의 플랫폼 코세라의 창립자인 앤드류 응 (Andrew Ng) 교수는 인공지능 업계의 거장입니다. 그가 스탠퍼드 대학에서 머신 러닝 입문자에게 한 강의를 그대로 코세라 온라인 강의 (Coursera.org)에서 무료로 배울 수 있습니다. 이 강의는 머신러닝 입문자들의 필수코스입니다. 인공지능과 머신러닝을 혼자 공부하면서 자연스럽게 만나게 되는 강의입니다.

Large Scale Machine Learning

(대규모 머신러닝)

Gradient Descent with Large Datasets

(대규모 데이터셋과 경사 하강법)

Stochastic Gradient Descent Convergence

(확률적 경사 하강법 수렴)

You now know about the stochastic gradient descent algorithm. But when you're running the algorithm, how do you make sure that it's completely debugged and is converging okay? Equally important, how do you tune the learning rate alpha with Stochastic Gradient Descent. In this video we'll talk about some techniques for doing these things, for making sure it's converging and for picking the learning rate alpha.

지금까지 확률적 경사 하강법 알고리즘을 공부했습니다. 하지만, 알고리즘이 완전히 수렴하였는 지를 어떻게 확인할 수 있을까요? 답은 다른 알고리즘과 마찬가지로 학습률 α를 조정하는 것입니다. 이번 강의에서 알고리즘이 제대로 동작하고 수렴하는 지를 확인하기 위해 학습률 α를 결정하는 몇 가지 방법을 설명합니다.

Back when we were using batch gradient descent, our standard way for making sure that gradient descent was converging was we would plot the optimization cost function as a function of the number of iterations. So that was the cost function and we would make sure that this cost function is decreasing on every iteration. When the training set sizes were small, we could do that because we could compute the sum pretty efficiently. But when you have a massive training set size then you don't want to have to pause your algorithm periodically. You don't want to have to pause stochastic gradient descent periodically in order to compute this cost function since it requires a sum of your entire training set size. And the whole point of stochastic gradient was that you wanted to start to make progress after looking at just a single example without needing to occasionally scan through your entire training set right in the middle of the algorithm, just to compute things like the cost function of the entire training set.

배치 경사 하강법을 사용할 때 알고리즘이 수렴하는 지를 확인하는 기본적인 방법은 반복 횟수를 함수로 하는 최적화 비용 함수를 도식화하는 것입니다. 비용 함수 Jtrain(θ)가 매 반복마다 감소하는 지를 확인하는 것입니다.

Jtrain(θ) = 1/(2m) * Σ (hθ(x^(i)) - y^(i))^2

i=1

학습 셋의 크기가 작을 때는 알고리즘이 모든 학습 예제에 대한 합산을 효율적으로 할 수 있습니다. 그러나, 학습 셋의 크기가 3억 개 정도로 엄청나게 클 때는 알고리즘이 모든 학습 예제에 대한 합산을 빠르게 할 수 없습니다. 확률적 경사 하강법은 전체 학습 셋의 합산하기 위해 알고리즘의 속도가 느려지는 것을 방지합니다.

확률적 경사 하강법의 핵심은 매 스텝마다 전체 학습 셋을 계산할 필요 없이 단 하나의 예제만 계산하는 것입니다.

So for stochastic gradient descent, in order to check the algorithm is converging. Here's what we can do instead. So, while the algorithm is looking at the example xi, yi, but before it has updated the parameters theta using that an example, let's compute the cost of that example. Just to say the same thing again, but using slightly different words. Let's take the definition of the cost that we had previously. So the cost of the parameters theta with respect to a single training example is just one half of the square error on that training example. Then, while stochastic gradient descent is learning, right before we train on a specific example. So, in stochastic gradient descent we're going to look at the examples xi, yi, in order, and then sort of take a little update with respect to this example. And we go on to the next example, xi plus 1, yi plus 1, and so on, right? That's what stochastic gradient descent does. A stochastic gradient descent is scanning through our training set right before we have updated theta using a specific training example x(i) comma y(i) let's compute how well our hypothesis is doing on that training example. And we want to do this before updating theta because if we've just updated theta using example, you know, that it might be doing better on that example than what would be representative.

따라서, 확률적 경사 하강법이 수렴하는 지를 확인해야 합니다. 알고리즘이 파라미터 θ를 업데이트하기 전에 학습 예제 (x^(i), y^(i))의 비용 Cost (θ, (x^(i), y^(i)))을 먼저 계산합니다.

Cost (θ, (x^(i), y^(i))) = 1/2 * (hθ(x^(i)) - y^(i))^2

다시 말해서 약간 다른 단어를 사용해 보겠습니다. 여기 전에 사용했던 비용 함수의 정의가 있습니다. 단일 학습 예제에 대한 파라미터 θ의 비용은 학습 예제의 실측치에 실제값의 오차의 제곱에 절반에 불과합니다. 확률적 경사 하강법이 학습 예제 (x^(i), y^(i))를 업데이트하기 전에 Cost (θ, (x^(i), y^(i)))를 먼저 계산하고 파라미터 θ를 업데이트를 할 것입니다 그리고 다음 학습 예제 (x^(i+1), y^(i+1))로 넘어갑니다. 이것이 확률적 경사 하강법의 동작 방식입니다. 따라서, 확률적 경사 하강법은 특정 훈련 예제 (x^(i), y^(i))를 계산한 후 θ를 업데이트하기 직전에 학습 셋에 대해 Cost (θ, (x^(i), y^(i)))를 먼저 계산합니다. 즉, 파라미터 θ를 업데이트하기 전에 가설이 얼마나 잘 수행되는 지를 확인할 수 있습니다.

Finally, in order to check for the convergence of stochastic gradient descent, what we can do is every, say, every thousand iterations, we can plot these costs that we've been computing in the previous step. We can plot those costs average over, say, the last thousand examples processed by the algorithm. And if you do this, it kind of gives you a running estimate of how well the algorithm is doing on, you know, the last 1000 training examples that your algorithm has seen train periodically which needed to scan through the entire training set. So, in contrast to computing Jtrain periodically which needed to scan through the entire training set.With this other procedure, well, as part of stochastic gradient descent, it doesn't cost much to compute these costs as well right before updating to parameter theta. And all we're doing is every thousand integrations or so, we just average the last 1,000 costs that we computed and plot that. And by looking at those plots, this will allow us to check if stochastic gradient descent is converging.

마지막으로 확률적 경사 하강법이 수렴을 확인하는 방법은 매 1,000번의 반복마다 전에 계산했던 Cost() 함수를 도식화하는 것입니다. 예를 들어, 알고리즘이 마지막 1,000 개의 예제에 대해 계산한 Cost ())의 평균을 도식화할 수 있습니다. 알고리즘이 얼마나 잘 작동하는지에 대한 추정치를 알 수 있습니다. 알고리즘이 전체 학습 셋을 스캔하기 위해 계산한 마지막 1,000 개의 학습 예제를 알 수 있습니다. 따라서, 전체 학습 셋을 스캔해서 주기적으로 Jtrain(θ)를 계산하는 것과 달리 확률적 경사 하강법의 일부로 파라미터 θ를 업데이트하기 직전에 Cost ()를 계산하는 것은 계산 비용이 높지 않습니다. 즉, 매 1,000회마다 합산하는 것입니다. 마지막 1,000개 예제의 Cost ()의 평균을 구하고 도식화합니다. 이 그래프를 보면 확률적 경사 하강법이 수렴하는지 확인할 수 있습니다.

So here are a few examples of what these plots might look like. Suppose you have plotted the cost average over the last thousand examples, because these are averaged over just a thousand examples, they are going to be a little bit noisy and so, it may not decrease on every single iteration. Then if you get a figure that looks like this, So the plot is noisy because it's average over, you know, just a small subset, say a thousand training examples. If you get a figure that looks like this, you know that would be a pretty decent run with the algorithm, maybe, where it looks like the cost has gone down and then this plateau that looks kind of flattened out, you know, starting from around that point. look like, this is what your cost looks like then maybe your learning algorithm has converged. If you want to try using a smaller learning rate, something you might see is that the algorithm may initially learn more slowly so the cost goes down more slowly.

But then eventually you have a smaller learning rate is actually possible for the algorithm to end up at a, maybe very slightly better solution. So the red line may represent the behavior of stochastic gradient descent using a slower, using a smaller leaning rate. And the reason this is the case is because, you remember, stochastic gradient descent doesn't just converge to the global minimum, is that what it does is the parameters will oscillate a bit around the global minimum. And so by using a smaller learning rate, you'll end up with smaller oscillations. And sometimes this little difference will be negligible and sometimes with a smaller than you can get a slightly better value for the parameters.

여기 비용 함수를 도식화한 몇 가지 사례가 있습니다. 최근 1,000개의 예제에 대한 Cost() 평균을 도식화합니다. 단지 1,000개의 예제에 대한 평균이므로 진동이 많아서 매 반복마다 감소하지 않는 것처럼 보일 수 있습니다. 파란색 곡선은 평균 이상으로 진동합니다. 1,000개의 학습 예제가 있고, 이와 같은 그래프를 얻는다면 꽤 괜찮은 알고리즘입니다. Cost() 함수는 지속적으로 하강하는 것처럼 보입니다. 평평한 부분에 다다르면 학습 알고리즘이 수렴되었다고 볼 수 있습니다. 더 작은 학습률 α를 사용하면 알고리즘이 더 느리게 학습할 것입니다. 더 작은 학습률 α를 가진 빨간색 곡선은 약간 더 나은 솔루션으로 보입니다. 빨간색 곡선은 더 작은 경사율을 사용하여 더 느린 속도로 하강하는 확률적 경사 하강법의 동작을 나타낸 것입니다. 확률적 경사 하강법은 단순히 전역 최소값으로 수렴하는 것이 아니라 파라미터가 전역 최소값 주변에서 진동한다는 것을 명심하십시오. 저 작은 학습률을 사용할수록 진동은 더 작아집니다. 때때로 이 작은 차이는 무시할 수 있고, 때때로 파라미터의 값은 전역 최소값보다 약간 더 작은 값이 될 수 있습니다.

Here are some other things that might happen. Let's say you run stochastic gradient descent and you average over a thousand examples when plotting these costs. So, you know, here might be the result of another one of these plots.

Then again, it kind of looks like it's converged. If you were to take this number, a thousand, and increase to averaging over 5 thousand examples. Then it's possible that you might get a smoother curve that looks more like this. And by averaging over, say 5,000 examples instead of 1,000, you might be able to get a smoother curve like this. And so that's the effect of increasing the number of examples you average over. The disadvantage of making this too big of course is that now you get one date point only every 5,000 examples. And so the feedback you get on how well your learning learning algorithm is doing is, sort of, maybe it's more delayed

because you get one data point on your plot only every 5,000 examples rather than every 1,000 examples.

우측 상단에는 또 다른 그림이 있습니다. 파란색 곡선은 확률적 경사 하강법을 실행하고 1,000 개의 예제의 Cost()의 평균을 그린 것입니다. 파란색 곡선은 수렴된 것처럼 보입니다. 빨간색 곡선은 1,000개에서 5,000개로 늘린 학습 예제의 평균입니다. 훨씬 더 부드러운 곡선이 만들어집니다. 이것이 평균을 내는 예제의 수를 늘리는 효과입니다. 단점은 학습 알고리즘이 얼마나 잘 수행되는지에 대한 피드백이 늦어집니다. 왜냐하면 1,000개의 예제가 아닌 5,000개의 예제마다 도식화하기 때문에 데이터 포인트가 늦어집니다.

Along a similar vein some times you may run a gradient descent and end up with a plot that looks like this. And with a plot that looks like this, you know, it looks like the cost just is not decreasing at all. It looks like the algorithm is just not learning. It's just, looks like this here a flat curve and the cost is just not decreasing. But again if you were to increase this to averaging over a larger number of examples it is possible that you see something like this red line it looks like the cost actually is decreasing, it's just that the blue line averaging over 2, 3 examples, the blue line was too noisy so you couldn't see the actual trend in the cost actually decreasing and possibly averaging over 5,000 examples instead of 1,000 may help. Of course we averaged over a larger number examples that we've averaged here over 5,000 examples, I'm just using a different color, it is also possible that you that see a learning curve ends up looking like this. That it's still flat even when you average over a larger number of examples. And as you get that, then that's maybe just a more firm verification that unfortunately the algorithm just isn't learning much for whatever reason. And you need to either change the learning rate or change the features or change something else about the algorithm.

비슷하지만 왼쪽 하단의 그림의 파란색 곡선처럼 나타날 수도 있습니다. 그래프는 전혀 Cost()가 감소하지 않는 것 같습니다. 알고리즘이 학습을 제대로 하지 못하는 것 같습니다. 하지만, 예제의 수를 늘려서 평균을 구하면 빨간색 곡선처럼 보일 수 있습니다. Cost()가 감소하는 것처럼 보입니다. 파란색 곡선이 너무 진동이 심해서 실제 추세를 볼 수가 없습니다. 1,000개의 예제가 도움이 될 수도 있고 5,000개의 예제가 도움이 될 수도 있습니다. 분홍색 곡선은 더 많은 예제의 수를 사용할 때는 평평합니다. 불행히도 알고리즘이 어떤 이유로든 학습을 제대로 하지 못한다고 생각할 수 있습니다. 학습률 α를 변경하거나 피처를 변경하거나 알고리즘을 변경해야 합니다.

Finally, one last thing that you might see would be if you were to plot these curves and you see a curve that looks like this, where it actually looks like it's increasing. And if that's the case then this is a sign that the algorithm is diverging. And what you really should do is use a smaller value of the learning rate alpha.

마지막으로 우측 하단의 마지막 그림은 매 1,000개의 예제의 Cost() 평균이 점점 상승하고 있습니다. 알고리즘이 발산한다는 신호입니다. 즉, 학습률 α를 더 작은 값으로 줄여야 합니다.

So hopefully this gives you a sense of the range of phenomena you might see when you plot these cost average over some range of examples as well as suggests the sorts of things you might try to do in response to seeing different plots. So if the plots looks too noisy, or if it wiggles up and down too much, then try increasing the number of examples you're averaging over so you can see the overall trend in the plot better. And if you see that the errors are actually increasing, the costs are actually increasing, try using a smaller value of alpha.

Cost()의 평균을 도식화할 때 볼 수 있는 그래프들을 정리했습니다. 여러분들이 Cost() 그래프에 대한 감각을 가지길 바랍니다. 곡선이 너무 진동하거나 위아래로 너무 많이 흔들리는 경우 평균을 구하는 예제의 수를 늘려서 전체 추세를 더 잘 살펴볼 수 있습니다. 오류가 증가하고 Cost()가 증가한다면 더 작은 학습률 α를 사용합니다.

Finally, it's worth examining the issue of the learning rate just a little bit more.

We saw that when we run stochastic gradient descent, the algorithm will start here and sort of meander towards the minimum. And then it won't really converge, and instead it'll wander around the minimum forever. And so you end up with a parameter value that is hopefully close to the global minimum that won't be exact at the global minimum.

마지막으로 학습률 α에 대한 문제를 조금 더 살펴보겠습니다. 확률적 경사 하강법을 실행할 때 알고리즘이 최소값을 향해 구불구불하게 이동합니다. 그런 다음 수렴하지 않고 최소값 근처를 돌아다닙니다. 따라서, 전역 최소값에 가까운 파라미터 값으로 끝납니다.

In most typical implementations of stochastic gradient descent, the learning rate alpha is typically held constant. And so what you we end up is exactly a picture like this. If you want stochastic gradient descent to actually converge to the global minimum, there's one thing which you can do which is you can slowly decrease the learning rate alpha over time. So, a pretty typical way of doing that would be to set alpha equals some constant 1 divided by iteration number plus constant 2. So, iteration number is the number of iterations you've run of stochastic gradient descent, so it's really the number of training examples you've seen And const 1 and const 2 are additional parameters of the algorithm that you might have to play with a bit in order to get good performance. One of the reasons people tend not to do this is because you end up needing to spend time playing with these 2 extra parameters, constant1 and constant2, and so this makes the algorithm more finicky. You know, it's just more parameters able to fiddle with in order to make the algorithm work well. But if you manage to tune the parameters well, then the picture you can get is that the algorithm will actually around towards the minimum, but as it gets closer because you're decreasing the learning rate the meanderings will get smaller and smaller until it pretty much just to the global minimum. I hope this makes sense, right? And the reason this formula makes sense is because as the algorithm runs, the iteration number becomes large So alpha will slowly become small, and so you take smaller and smaller steps until it hopefully converges to the global minimum. So If you do slowly decrease alpha to zero you can end up with a slightly better hypothesis. But because of the extra work needed to fiddle with the constants and because frankly usually we're pretty happy with any parameter value that is, you know, pretty close to the global minimum. Typically this process of decreasing alpha slowly is usually not done and keeping the learning rate alpha constant is the more common application of stochastic gradient descent although you will see people use either version.

일반적으로 확률적 경사 하강법은 학습률 α 를 일정하게 유지합니다. 하지만, 확률적 경사 하강법이 실제로 전역 최소값으로 수렴하기 위해서 시간이 지남에 따라 학습률 α 를 천천히 줄일 수 있습니다. 학습률 α 를 설정하는 가장 일반적인 방법은 다음과 같습니다.

학습률 α = const1 / (반복 회수 + const2)

따라서, 반복 회수는 확률적 경사 하강법을 실행한 반복 회수입니다. const1과 const2는 성능을 향상하기 위한 추가적인 파라미터입니다. 주로 학습률을 변경하지 않는 이유는 두 개의 추가적인 파라미터를 가지고 조작해야 할 시간이 필요하기 때문입니다. 즉, 알고리즘이 까다로워진다는 것입니다. 알고리즘이 잘 작동하도록 하기 위해 조작해야 할 파라미터가 더 많습니다. 하지만, 파라미터를 잘 조정하면 최소값에 점점 근접할 수 있습니다. 가까워질수록 학습률이 낮아지기 때문에 구불구불한 것은 점점 작아질 것입니다. 이 공식이 타당한 이유는 알고리즘이 실행될수록 반복 회수가 커지기 때문입니다. 학습률 α는 서서히 작아져서 전체가 최소값에 수렴될 때까지 스텝을 반복합니다. 따라서, 학습률 α가 0에 가까워지면 약간 더 나은 가설을 얻을 수 있습니다. 그러나 const1과 const2 상수를 조작하기 위한 추가 작업과 시간이 필요하고 사람들은 파라미터가 전역 최소값에 가까운 값에 충분히 만족하기 때문에 잘 사용하지 않습니다.

To summarize in this video we talk about a way for approximately monitoring

how the stochastic gradient descent is doing in terms for optimizing the cost function. And this is a method that does not require scanning over the entire training set periodically to compute the cost function on the entire training set.

But instead it looks at say only the last thousand examples or so. And you can use this method both to make sure the stochastic gradient descent is okay and is converging or to use it to tune the learning rate alpha.

이번 강의를 요약하면 대략적인 모니터링 방법을 설명했습니다. 비용 함수 J를 최적화하기 위해 확률적 경사 하강법의 동작 방식을 설명했습니다. 확률적 경사 하강법은 전체 학습 셋을 반복적으로 스캔할 필요가 없는 방법으로 마지막 1,000개 정도의 예제만 살펴봅니다. 이 방법으로 확률적 경사 하강법이 정상적으로 수렴하는 지를 확인하고 학습률 α를 조정할 수 있습니다.

앤드류 응의 머신러닝 동영상 강의

정리하며

배치 경사 하강법은 매 경사 하강 스텝마다 전체 학습 셋을 합산하지만, 확률적 경사 하강법은 매 경사 하강 스텝마다 단 하나의 학습 셋을 계산합니다. 미니-배치 경사 하강법은 매 경사 하강 스텝마다 약 10개의 학습 셋을 묶은 미니-배치 단위로 계산합니다. 벡터화 구현이 좋다면 미니-배치 경사 하강법은 가장 효율적인 방법입니다.

확률적 경사 하강법은 전역 최소값으로 수렴하는 것이 아니라 파라미터가 전역 최소값 주변에서 맴돕니다. 알고리즘이 완전히 수렴하였는 지를 확인하기 위한 방법이 필요합니다.

우선, 확률적 경사 하강법은 특정 훈련 예제 (x^(i), y^(i))를 계산한 후 θ를 업데이트하기 직전에 학습 셋에 대해 Cost (θ, (x^(i), y^(i)))를 먼저 계산합니다. 즉, 파라미터 θ를 업데이트하기 전에 가설이 얼마나 잘 수행되는 지를 확인할 수 있습니다.

Cost (θ, (x^(i), y^(i))) = 1/2 * (hθ(x^(i)) - y^(i))^2

두 번째로 확률적 경사 하강법이 수렴하는지 확인하는 방법은 매 1,000번의 반복마다 Cost() 함수를 도식화하여 확인합니다. 그래프의 모양에 따라 학습률 α를 조정하거나 반복하는 학습 예제의 수를 늘리거나 줄입니다.