brunch

You can make anything
by writing

C.S.Lewis

by 라인하트 Dec 15. 2020

앤드류 응의 머신러닝(17-3):미니-배치 경사 하강법

온라인 강의 플랫폼 코세라의 창립자인 앤드류 응 (Andrew Ng) 교수는 인공지능 업계의 거장입니다. 그가 스탠퍼드 대학에서 머신 러닝 입문자에게 한 강의를 그대로 코세라 온라인 강의 (Coursera.org)에서 무료로 배울 수 있습니다. 이 강의는 머신러닝 입문자들의 필수코스입니다. 인공지능과 머신러닝을 혼자 공부하면서 자연스럽게 만나게 되는 강의입니다.

Large Scale Machine Learning

(대규모 머신러닝)

Gradient Descent with Large Datasets

(대규모 데이터셋과 경사 하강법)

Mini-Batch Gradient Descent (미니-배치 경사 하강법)

In the previous video, we talked about Stochastic gradient descent, and how that can be much faster than Batch gradient descent. In this video, let's talk about another variation on these ideas is called Mini-batch gradient descent they can work sometimes even a bit faster than stochastic gradient descent.

지난 강의에서 확률적 경사 하강법의 동작 방식과 배치 경사 하강법보다 훨씬 더 빠른 이유를 설명했습니다. 이번 강의에서 미니 배치 경사 하강법을 설명합니다. 미니-배치 경사 하강법은 때때로 확률적 경사 하강법보다 조금 더 빠르게 동작할 수 있습니다.

To summarize the algorithms we talked about so far. In Batch gradient descent we will use all m examples in each generation. Whereas in Stochastic gradient descent we will use a single example in each generation. What Mini-batch gradient descent does is somewhere in between. Specifically, with this algorithm we're going to use b examples in each iteration where b is a parameter called the "mini batch size" so the idea is that this is somewhat in-between Batch gradient descent and Stochastic gradient descent. This is just like batch gradient descent, except that I'm going to use a much smaller batch size. A typical choice for the value of b might be b equals 10, lets say, and a typical range really might be anywhere from b equals 2 up to b equals 100. So that will be a pretty typical range of values for the Mini-batch size. And the idea is that rather than using one example at a time or m examples at a time we will use b examples at a time. So let me just write this out informally, we're going to get, let's say, b.

지금까지 다룬 알고리즘을 요약합니다. 배치 경사 하강법은 각 스텝마다 모든 m개의 학습 예제를 사용합니다. 확률적 경사 하강법은 각 스텝마다 하나의 학습 예제만을 사용합니다. 미니-배치 경사 하강법은 두 알고리즘 사이 어딘가에 있습니다. 미니-배치 경사 하강법은 각 스텝마다 b개의 예제를 사용합니다. 여기서 파라미터 b는 미니-배치의 크기입니다. 따라서, 이것이 두 알고리즘 사이 어딘가에 있다는 의미입니다. 미니-배차 경사 하강법은 배치 경사 하강법보다 훨씬 더 작은 크기의 배치를 사용한다는 점만 다릅니다. b의 값은 보통 10입니다. 보통 일반적인 b의 범위는 b = 2에서 100 사이입니다. 미니-배치 경사 하강법은 확률적 경사 하강법보다 훨씬 더 많은 b개의 예제를 사용한다는 점만 다릅니다.

For this example, let's say b equals 10. So we're going to get, the next 10 examples from my training set so that may be some set of examples xi, yi. If it's 10 examples then the indexing will be up to x (i+9), y (i+9) so that's 10 examples altogether and then we'll perform essentially a gradient descent update using these 10 examples. So, that's any rate times one tenth times sum over k equals i through i+9 of h subscript theta of x(k) minus y(k) times x(k) j. And so in this expression, where summing the gradient terms over my ten examples. So, that's number ten, that's, you know, my mini batch size and just i+9 again, the 9 comes from the choice of the parameter b, and then after this we will then increase, you know, i by tenth, we will go on to the next ten examples and then keep moving like this. So just to write out the entire algorithm in full.

예를 들면, b = 10이라고 가정합니다. 학습 셋에서 다음 10개의 예제를 추출합니다. 10개의 학습 예제는 (x^(i), y^(i))부터 x^(i+9), y^(i+9)까지입니다. 기본적인 10개의 예제를 사용하여 경사 하강법 업데이트를 합니다.

i+9

θj := θj - α * 1/10* Σ (hθ(x^(k)) - y^(k))*xj^(k)

k=i

i := i + 10;

이것은 10개의 예제 대한 기울기 항을 합산하는 것입니다. 숫자 b = 10는 미니 배치의 크기를 나타냅니다. 그리고, i를 10개씩 증가하면 다음 10개의 예제로 이동합니다.

In order to simplify the indexing for this one at the right top, I'm going to assume we have a mini-batch size of ten and a training set size of a thousand, what we're going to do is have this sort of form, for i equals 1 and that in 21's the stepping, in steps of 10 because we look at 10 examples at a time. And then we perform this sort of gradient descent update using ten examples at a time so this 10 and this i+9 those are consequence of having chosen my mini-batch to be ten. And you know, this ultimate for loop, this ends at 991 here because if I have 1000 training samples then I need 100 steps of size 10 in order to get through my training set. So this is mini-batch gradient descent. Compared to batch gradient descent, this also allows us to make progress much faster. So we have again our running example of, you know, U.S. Census data with 300 million training examples, then what we're saying is after looking at just the first 10 examples we can start to make progress in improving the parameters theta so we don't need to scan through the entire training set. We just need to look at the first 10 examples and this will start letting us make progress and then we can look at the second ten examples and modify the parameters a little bit again and so on. So, that is why Mini-batch gradient descent can be faster than batch gradient descent. Namely, you can start making progress in modifying the parameters after looking at just ten examples rather than needing to wait 'till you've scan through every single training example of 300 million of them.

따라서, 인덱싱을 단순화하기 위해 미니-벤치 b의 크기가 10이고, 전체 학습 셋의 크기가 1,000이라고 가정합니다. 무작위로 1,000개의 데이터를 정렬한 후에 한 번에 10개의 예제씩 계산합니다. 10개씩 학습 예제를 사용하므로 평균 오차의 제곱을 구하기 위해 1/10로 나눕니다. 미니-배치 경사 하강 알고리즘은 i = 1에서 시작하지만, 10개씩 증가하기 때문에 다음은 11, 21,..., 991입니다. 궁극의 For 루프는 991에서 끝납니다. 왜냐하면 1,000개의 학습 예제를 10개씩 100 단계를 움직이기 때문입니다. 이것이 미니-배치 경사 하강법입니다. 미니-배치 경사 하강법은 배치 경사 하강법 보다 훨씬 더 빠르게 진행할 수 있습니다. 예를 들면, 3억 개의 학습 예제가 있는 미국 인구 조사 데이터가 있습니다. 파라미터 θ를 개선하기 위해 처음 10개의 예제씩 합산하고 전체 학습 셋을 계산할 필요가 없습니다. 처음 10개의 예제만으로 진전이 있다면 두 번째 10개의 예제를 계산하고 파라미터를 수정합니다. 그래서 미니-배치 경사 하강법이 배치 경사 경사 하강법보다 빠른 이유입니다. 즉, 3 억 개의 학습 예제를 모두 계산할 때까지 기다리지 않고 10개의 예제만 계산한 후 파라미터 값을 수정합니다.

So, how about Mini-batch gradient descent versus Stochastic gradient descent. So, why do we want to look at b examples at a time rather than look at just a single example at a time as the Stochastic gradient descent? The answer is in vectorization. In particular, Mini-batch gradient descent is likely to outperform Stochastic gradient descent only if you have a good vectorized implementation. In that case, the sum over 10 examples can be performed in a more vectorized way which will allow you to partially parallelize your computation over the ten examples. So, in other words, by using appropriate vectorization to compute the rest of the terms, you can sometimes partially use the good numerical algebra libraries and parallelize your gradient computations over the b examples, whereas if you were looking at just a single example of time with Stochastic gradient descent then, you know, just looking at one example at a time their isn't much to parallelize over. At least there is less to parallelize over. One disadvantage of Mini-batch gradient descent is that there is now this extra parameter b, the Mini-batch size which you may have to fiddle with, and which may therefore take time. But if you have a good vectorized implementation this can sometimes run even faster that Stochastic gradient descent. So that was Mini-batch gradient descent which is an algorithm that in some sense does something that's somewhat in between what Stochastic gradient descent does and what Batch gradient descent does.

따라서, 미니-배치 경사 하강법과 확률적 경사 하강법을 사용하는 이유는 무엇일까요? 한 번에 하나의 예제만 계산하는 것과 한 번에 b개의 예제를 계산하는 이유는 무엇일까요? 정답은 벡터화에 있습니다. 특히, 미니-배치 경사 하강법은 벡터화 구현이 좋은 경우에는 확률적 경사 하강법보다 성능이 우수합니다. 10 개 이상의 예제 합계보다 벡터화 구현으로 수행한다면 10 개 예제에 대해 부분적으로 병렬화하여 계산할 수 있습니다. 즉, 적절한 벡터화를 사용하여 나머지 항을 계산하면 좋은 선형 대수 라이브러리를 부분적으로 사용하면서 b 예제에 대한 기울기 계산을 병렬화할 수 있습니다. 반면에 단일 예제만 계산하는 확률적 경사 하강법은 한 번에 하나의 예제만 보기 때문에 병렬화를 할 수 없습니다. 적어도 병렬화할 것이 적습니다. 미니-배치 경사 하강법의 한 가지 단점은 파라미터 b를 추가하는 것입니다. 미니-배치의 크기는 결정하기 위해 데이터를 조작해야 할 시간이 필요할 수 있습니다. 그러나 벡터화 구현이 좋다면 확률적 경사 하강법보다 더 빠르게 실행할 수 있습니다. 그리서, 미니-배치 경사 하강법은 어떤 의미에서 확률적 경사 하강법과 배치 경사 하강법 사이의 알고리즘입니다.

And if you choose their reasonable value of b. I usually use b equals 10, but, you know, other values, anywhere from say 2 to 100, would be reasonably common. So we choose value of b and if you use a good vectorized implementation, sometimes it can be faster than both Stochastic gradient descent and faster than Batch gradient descent.

그리고, b의 값을 선택할 때, 저는 b = 10을 주로 사용하지만, 2에서 100까지 다른 값을 합리적으로 선택할 수 있습니다. 미니-배치 경사 하강법은 b의 값을 선택하고 벡터화 구현이 잘될 때 확률적 경사 하강법보다 빠르고 배치 경사 하강법 보다 빠를 수 있습니다.

앤드류 응의 머신러닝 동영상 강의

정리하며

배치 경사 하강법은 각 스텝마다 모든 m개의 학습 예제를 합산하고, 확률적 경사 하강법은 각 스텝마다 하나의 학습 예제만을 계산합니다. 미니-배치 경사 하강법은 두 알고리즘 사이 어딘가에 있습니다. 즉, 미니-배치 경사 하강법은 각 스텝마다 b개의 예제를 사용합니다. 여기서 파라미터 b는 미니-배치의 크기입니다.

만일 미니-배치의 크기 b를 10으로 가정한다면, 미니-배치 경사 하강법의 공식은 다음과 같습니다.

i+9

θj := θj - α * 1/10* Σ (hθ(x^(k)) - y^(k))*xj^(k)

k=i

i := i + 10;

따라서, 미국 인구 조사 데이터 3 억 개를 매 경사 하강 스텝마다 3억 개의 예제를 합산하는 배치 경사 하강법 보다 매 경사 하강 스텝마다 b개의 학습 예제를 합산하는 미니-배치 경사 하강법이 훨씬 빠릅니다. 또한, 미니-배치 경사 하강법은 벡터화 구현이 좋은 경우에는 확률적 경사 하강법보다 성능이 우수합니다.

여기서 미니-배치의 크기 b의 합리적인 크기는 2에서 100 사이입니다. 일반적으로 10을 많이 선택합니다.