brunch

You can make anything
by writing

C.S.Lewis

by 라인하트 Dec 15. 2020

앤드류 응의 머신러닝(17-2): 확률적 경사하강법

온라인 강의 플랫폼 코세라의 창립자인 앤드류 응 (Andrew Ng) 교수는 인공지능 업계의 거장입니다. 그가 스탠퍼드 대학에서 머신 러닝 입문자에게 한 강의를 그대로 코세라 온라인 강의 (Coursera.org)에서 무료로 배울 수 있습니다. 이 강의는 머신러닝 입문자들의 필수코스입니다. 인공지능과 머신러닝을 혼자 공부하면서 자연스럽게 만나게 되는 강의입니다.

Large Scale Machine Learning

(대규모 머신러닝)

Gradient Descent with Large Datasets

(대규모 데이터셋과 경사 하강법)

Stochastic ge Datasets (확률적 경사 하강법)

For many learning algorithms, among them linear regression, logistic regression and neural networks, the way we derive the algorithm was by coming up with a cost function or coming up with an optimization objective. And then using an algorithm like gradient descent to minimize that cost function. We have a very large training set gradient descent becomes a computationally very expensive procedure. In this video, we'll talk about a modification to the basic gradient descent algorithm called Stochastic gradient descent, which will allow us to scale these algorithms to much bigger training sets.

선형 회귀, 로지스틱 회귀 및 인공신경망과 같은 알고리즘은 비용 함수와 최적화 목표를 정의하고, 경사 하강법과 같은 알고리즘을 사용하여 비용 함수를 최소화합니다. 매우 큰 학습 데이터 셋은 경사 하강법의 계산 비용을 매우 크게 만듭니다. 이번 강의에서 대규모 학습 셋에서 사용할 수 있도록 경사 하강법을 수정한 확률적 경사 하강법을 설명합니다.

Suppose you are training a linear regression model using gradient descent. As a quick recap, the hypothesis will look like this, and the cost function will look like this, which is the sum of one half of the average square error of your hypothesis on your m training examples, and the cost function we've already seen looks like this sort of bow-shaped function. So, plotted as function of the parameters theta 0 and theta 1, the cost function J is a sort of a bow-shaped function. And gradient descent looks like this, where in the inner loop of gradient descent you repeatedly update the parameters theta using that expression. Now in the rest of this video, I'm going to keep using linear regression as the running example. But the ideas here, the ideas of Stochastic gradient descent is fully general and also applies to other learning algorithms like logistic regression, neural networks and other algorithms that are based on training gradient descent on a specific training set.

경사 하강법을 사용하는 선형 회귀 모델을 학습시킨다고 가정합니다. 가설과 비용 함수는 다음과 같습니다. Jtrain(θ)은 m개의 학습 예제에 대한 가설 hθ(x^(i))과 실제 값 y^(i)의 오차의 제곱을 평균하여 절반으로 나누었습니다. θ0와 θ1의 함수로 비용 함수 J를 도식화하면 오른쪽 그림과 같은 활 모양입니다. 여기서 경사 하강법은 최솟값을 찾기 위해 경사 하강법 업데이트를 반복합니다. 강의에서 선형 회귀를 계속 사용합니다. 이유는 확률적 경사 하강법의 아이디어는 로지스틱 회귀, 인공 신경망 및 기타 알고리즘에도 동일하게 적용할 수 있기 때문입니다.

So here's a picture of what gradient descent does, if the parameters are initialized to the point there then as you run gradient descent different iterations of gradient descent will take the parameters to the global minimum. So take a trajectory that looks like that and heads pretty directly to the global minimum.

여기 경사 하강법의 역할을 정리한 그림이 있습니다. 파라미터 θ0와 θ1이 화살표 지점으로 초기화된 후 경사 하강법을 반복하면서 전역 최소값을 얻습니다.

Now, the problem with gradient descent is that if m is large. Then computing this derivative term can be very expensive, because the surprise, summing over all m examples. So if m is 300 million, alright. So in the United States, there are about 300 million people. And so the US or United States census data may have on the order of that many records. So you want to fit the linear regression model to that then you need to sum over 300 million records. And that's very expensive. To give the algorithm a name, this particular version of gradient descent is also called Batch gradient descent. And the term Batch refers to the fact that we're looking at all of the training examples at a time. We call it sort of a batch of all of the training examples. And it really isn't the, maybe the best name but this is what machine learning people call this particular version of gradient descent. And if you imagine really that you have 300 million census records stored away on disc. The way this algorithm works is you need to read into your computer memory all 300 million records in order to compute this derivative term. You need to stream all of these records through computer because you can't store all your records in computer memory. So you need to read through them and slowly, you know, accumulate the sum in order to compute the derivative. And then having done all that work, that allows you to take one step of gradient descent. And now you need to do the whole thing again. You know, scan through all 300 million records, accumulate these sums. And having done all that work, you can take another little step using gradient descent. And then do that again. And then you take yet a third step. And so on. And so it's gonna take a long time in order to get the algorithm to converge.

여기서 문제는 학습 예제의 수 m이 너무 크다는 것입니다. 경사 하강법 업데이트의 미분항을 계산하는 비용이 급격히 증가합니다. 경사 하강법 각 스텝마다 m 개의 모든 예제를 합산하기 때문입니다. 예를 들면, m = 00,000,000이라고 가정합니다. 미국 인구 조사 데이터를 작성할 때 미국의 인구는 약 3억 명입니다. 현실 세계에서 충분히 확보할 수 있는 데이터 크기입니다. 선형 회귀 모델을 적용하려면 경사 하강법 업데이트의 각 스텝마다 3억 개가 넘는 기록을 합산해야 합니다. 연산 비용은 급격히 증가하고 매우 비쌉니다. 이런 방식의 경사 하강 알고리즘을 배치 경사 하강법이라고 합니다. 배치(Batch)는 한 번에 모든 학습 예제를 다룬다는 의미입니다. 배치는 딱 어울리거나 최고의 이름은 아니지만, 머신 러닝 전문가들이 특별한 버전의 경사 하강법을 배치 경사 하강법이라고 부릅니다. 예를 들면, 3억 개의 인구 조사 기록이 디스크에 저장되어 있습니다. 알고리즘은 미분항을 계산하기 위해 3억 개의 기록을 모두 컴퓨터 메모리로 불러들여야 합니다. 하지만, 모든 데이터 셋을 컴퓨터 메모리에 저장할 수 없기 때문에 메모리에서 모든 기록을 넣고 삭제하는 과정을 반복해야 합니다. 즉, 데이터 셋을 읽고 미분을 계산하고 합계를 저장하는 과정이 느릴 수밖에 없습니다. 이 모든 과정을 완료하면 한 스텝을 움직입니다. 그리고, 지금까지의 모든 과정을 다시 반복해야 합니다. 3억 개의 데이터를 읽고 계산하고 합계를 누적합니다. 그리고 두 번째 스텝을 움직입니다. 똑같이 반복한 후 세 번째 스텝을 움직입니다. 따라서, 알고리즘이 최적화 목표인 전역 최소값에 수렴하기 위해 너무 오랜 시간이 걸립니다.

In contrast to Batch gradient descent, what we are going to do is come up with a different algorithm that doesn't need to look at all the training examples in every single iteration, but that needs to look at only a single training example in one iteration. Before moving on to the new algorithm, here's just a Batch gradient descent algorithm written out again with that being the cost function and that being the update and of course this term here, that's used in the gradient descent rule, that is the partial derivative with respect to the parameters theta J of our optimization objective, J train of theta.

배치 경사 하강법과 달리 매 반복마다 모든 학습 예제를 계산할 필요 없이 한 번에 하나의 학습 예제만 계산하는 알고리즘을 개발합니다. 새로운 알고리즘을 개발하기 전에 여기 비용 함수와 경사 하강법 업데이트인 배치 경사 알고리즘이 있습니다. 물론, 경사 업데이트 공식은 최적화 목표 Jtrain(θ)에 대한 편미분항으로 표시할 수 있습니다.

Now, let's look at the more efficient algorithm that scales better to large data sets. In order to work off the algorithms called Stochastic gradient descent, this vectors the cost function in a slightly different way then they define the cost of the parameter theta with respect to a training example x(i), y(i) to be equal to one half times the squared error that my hypothesis incurs on that example, x(i), y(i). So this cost function term really measures how well is my hypothesis doing on a single example x(i), y(i). Now you notice that the overall cost function j train can now be written in this equivalent form. So Jtrain is just the average over my m training examples of the cost of my hypothesis on that example x(i), y(i).

이제 대규모 데이터 셋에 적용할 수 있는 보다 효율적인 확률적 경사 하강법 알고리즘을 살펴봅니다. 우선, 비용 함수를 벡터화한 다음 훈련 예제 x^(i)와 y^(i)에 대한 파라미터 θ의 비용을 다음과 같이 정의합니다.

Cost (θ, (x^(i), y^(i))) = 1/2 * (hθ(x^(i)) - y^(i))^2

여기서, Cost() 함수는 (x^(i), y^(i)) 단일 예제에서 가설이 얼마나 잘 동작하는 지를 측정합니다. 따라서, 전체 비용 함수 Jtrain(θ)는 다음과 같습니다.

Jtrain(θ) = 1/m * Σ (Cost (θ, (x^(i), y^(i)))

i=1

Jtrain(θ)은 m개의 학습 예제에 대한 가설의 비용에 대한 평균입니다.

Armed with this view of the cost function for linear regression, let me now write out what Stochastic gradient descent does. The first step of Stochastic gradient descent is to randomly shuffle the data set. So by that I just mean randomly shuffle, or randomly reorder your m training examples. It's sort of a standard pre-processing step, come back to this in a minute. But the main work of Stochastic gradient descent is then done in the following. We're going to repeat for i equals 1 through m. So we'll repeatedly scan through my training examples and perform the following update. Gonna update the parameter theta j as theta j minus alpha times h of x(i) minus y(i) times x(i) j. And we're going to do this update as usual for all values of j. Now, you notice that this term over here is exactly what we had inside the summation for Batch gradient descent. In fact, for those of you that are calculus is possible to show that that term here, that's this term here, is equal to the partial derivative with respect to my parameter theta j of the cost of the parameters theta on x(i), y(i). Where cost is of course this thing that was defined previously. And just the wrap of the algorithm, let me close my curly braces over there.

선형 회귀에 대한 비용 함수에 대한 관점을 바탕으로 확률적 경사 하강법을 공식을 정리합니다. 확률적 경사 하강법의 첫 번째 단계는 데이터 셋을 무작위로 썩는 것입니다. 무작위로 썩는다는 것은 무작위로 학습 예제를 재 정렬하는 것을 의미합니다. 일종의 데이터 전처리 단계입니다. 확률적 경사 하강법의 핵심은 다음 단계입니다. 배치 경사 하강 업데이트는 i=1부터 m까지 반복한 후에 파라미터를 업데이트하고 다음 업데이트를 수행합니다. 모든 피처에 대해 수행할 것입니다. 하지만 확률적 경사 하강법은 다음과 같습니다. 배치 경사 하강 알고리즘의 Σ 안의 내용과 정확히 일치하고, 아래 편미분항과도 일치합니다.

θj := θj - α * (hθ(x^(i)) - y^(i))*xj^(i)

(for j = 0,1,..., n)

θj := θj - α * ∂/(∂θj) * (Cost (θ, (x^(i), y^(i)))

So what Stochastic gradient descent is doing is it is actually scanning through the training examples. And first it's gonna look at my first training example x(1), y(1). And then looking at only this first example, it's gonna take like a basically a little gradient descent step with respect to the cost of just this first training example. So in other words, we're going to look at the first example and modify the parameters a little bit to fit just the first training example a little bit better. Having done this inside this inner for-loop is then going to go on to the second training example. And what it's going to do there is take another little step in parameter space, so modify the parameters just a little bit to try to fit just a second training example a little bit better. Having done that, is then going to go onto my third training example. And modify the parameters to try to fit just the third training example a little bit better, and so on until you know, you get through the entire training set. And then this ultra repeat loop may cause it to take multiple passes over the entire training set. This view of Stochastic gradient descent also motivates why we wanted to start by randomly shuffling the data set. This doesn't show us that when we scan through the training site here, that we end up visiting the training examples in some sort of randomly sorted order.

그래서, 확률적 경사 하강법은 학습 예제를 스캔합니다. 우선, 첫 번째 학습 예제 (x^(1), y^(1))를 살펴보겠습니다. 알고리즘은 첫 번째 학습 예제에 적합한 Cost()를 계산하고 경사 하강법 스텝을 약간 움직입니다. 즉, 첫 번째 학습 예제의 비용 함수를 최소화한 후 파라미터 θ를 업데이트합니다. For 루프에서 이 작업을 수행하고 두 번째 학습 예제로 이동합니다. 두 번째 학습 예제의 비용 함수를 최소화한 후 파라미터 θ를 업데이트합니다. 그리고 세 번째 학습 예제로 이동합니다. 마지막 예제까지 이 작업을 반복합니다. 확률적 경사 하강법은 데이터 셋을 무작위로 썩어서 시작합니다. 이것은 언제 우리가 학습 데이터를 스캔해야 하는 지를 의미하는 것이 아니라 데이터 셋을 무작위로 정렬된 순서대로 학습 예제를 처리한다는 것을 의미합니다.

Depending on whether your data already came randomly sorted or whether it came originally sorted in some strange order, in practice this would just speed up the conversions to stochastic gradient descent just a little bit. So in the interest of safety, it's usually better to randomly shuffle the data set if you aren't sure if it came to you in randomly sorted order. But more importantly another view of Stochastic gradient descent is that it's a lot like descent but rather than wait to sum up these gradient terms over all m training examples, what we're doing is we're taking this gradient term using just one single training example and we're starting to make progress in improving the parameters already. So rather than, you know, waiting 'till taking a path through all 300,000,000 United States Census records, say, rather than needing to scan through all of the training examples before we can modify the parameters a little bit and make progress towards a global minimum. For Stochastic gradient descent instead we just need to look at a single training example and we're already starting to make progress in this case of parameters towards, moving the parameters towards the global minimum.

데이터가 이미 무작위로 정렬되었거나 이상한 순서로 정렬되었는지에 따라 확률적 경사 하강법의 속도를 조금 더 개선할 수 있습니다. 그래서, 데이터의 순서가 확실하지 않다면 데이터를 무작위로 써는 것이 좋습니다. 또한, 확률적 경사 하강법은 경사 하강법과 매우 유사하지만 다른 점이 있습니다. 경가 하강법은 모든 학습 예제 m에 대해 미분항을 계산하지만, 확률적 경사 하강법은 단일 학습 예제에 대한 미분항을 계산합니다. 즉, 전역 최소값에 다다르기 위해 모든 학습 예제를 합산할 필요 없이 단일 학습 예제를 이용하여 약간 수정하는 것입니다. 예를 들면, 3억 인구에 대한 조사 기록을 모두 계산할 필요가 없습니다. 확률 경사 하강법은 단일 학습 예제만을 계산하여 전역 최소값으로 이동합니다.

So, here's the algorithm written out again where the first step is to randomly shuffle the data and the second step is where the real work is done, where that's the update with respect to a single training example x(i), y(i). So, let's see what this algorithm does to the parameters. Previously, we saw that when we are using Batch gradient descent, that is the algorithm that looks at all the training examples in time, Batch gradient descent will tend to, you know, take a reasonably straight line trajectory to get to the global minimum like that. In contrast with Stochastic gradient descent every iteration is going to be much faster because we don't need to sum up over all the training examples. But every iteration is just trying to fit single training example better.

그래서, 여기 확률적 경사 하강법이 있습니다. 첫 번째 단계는 무작위로 데이터를 썩는 것입니다. 두 번째 단계는 실제로 경사 하강 업데이트를 하는 것입니다. 여기서 단일 학습 예제 (x^(i), y^(i))를 업데이트합니다. 알고리즘이 파라미터에 어떤 역할을 하는 지를 봅시다. 배치 경사 하강법은 각 스텝마다 모든 훈련 예제를 계산하는 알고리즘입니다. 배치 경사 하강법은 전역 최소값에 도달하기 위해 합리적으로 직선 궤도를 따라 움직이는 경향이 있습니다. 확률적 경사 하강법은 모든 학습 예제를 합산할 필요가 없기 때문에 매 스텝마다 훨씬 빨리 이동합니다. 그러나, 매 스텝마다 단일 예제에 최적화합니다.

So, if we were to start stochastic gradient descent, oh, let's start stochastic gradient descent at a point like that. The first iteration, you know, may take the parameters in that direction and maybe the second iteration looking at just the second example maybe just by chance, we get more unlucky and actually head in a bad direction with the parameters like that. In the third iteration where we tried to modify the parameters to fit just the third training examples better, maybe we'll end up heading in that direction. And then we'll look at the fourth training example and we will do that. The fifth example, sixth example, 7th and so on. And as you run Stochastic gradient descent, what you find is that it will generally move the parameters in the direction of the global minimum, but not always. And so take some more random-looking, circuitous path to watch the global minimum.

그래서, 무작위로 초기화한 지점에서 확률적 경사 하강법을 도식화합니다. 각 스텝마다 파라미터의 값을 취합니다. 첫 번째 스텝에서는 운이 좋았지만 두 번째 스텝에서 나쁜 방향으로 움직입니다. 세 번째 스텝은 제대로 된 방향으로 움직입니다. 네 번째 학습 예제, 다섯 번째, 여섯 번째, 일곱 번째 등등을 반복합니다. 일반적으로 확률적 경사 하강법은 파라미터를 전역 최소값 방향으로 이동시키지만 항상 그런 것은 아닙니다. 전역 최소값을 향해 가지만 좀 더 무작위로 보이는 경로를 따라 움직입니다.

And in fact as you run Stochastic gradient descent it doesn't actually converge in the same same sense as Batch gradient descent does and what it ends up doing is wandering around continuously in some region that's in some region close to the global minimum, but it doesn't just get to the global minimum and stay there. But in practice this isn't a problem because, you know, so long as the parameters end up in some region there maybe it is pretty close to the global minimum. So, as parameters end up pretty close to the global minimum, that will be a pretty good hypothesis and so usually running Stochastic gradient descent we get a parameter near the global minimum and that's good enough for, you know, essentially any, most practical purposes.

확률 경사 하강법은 배치 경사 하강법과 달리 한 점에 수렴하지 않고 전역 최소값에 가까운 영역에서 계속 돌아다닙니다. 즉, 글로벌 최소값에 도달해도 머무르지 않습니다. 그러나, 이것이 문제가 되지는 않습니다. 왜냐하면 파라미터가 일부 지역에서 끝나는 한 전역 최소값에 매우 가깝기 때문입니다. 따라서, 파라미터가 전역 최소값에 거의 가깝다면 꽤 좋은 가설입니다. 일반적으로 확률적 경사 하강법을 실행하면 전역 최소값에 가까운 파라미터를 얻습니다. 확률적 경사 하강법은 기본 목적을 충실히 수행할 수 있습니다.

Just one final detail. In Stochastic gradient descent, we had this outer loop repeat which says to do this inner loop multiple times. So, how many times do we repeat this outer loop? Depending on the size of the training set, doing this loop just a single time may be enough. And up to, you know, maybe 10 times may be typical so we may end up repeating this inner loop anywhere from once to ten times. So if we have a you know, truly massive data set like the this US census gave us that example that I've been talking about with 300 million examples, it is possible that by the time you've taken just a single pass through your training set. So, this is for i equals 1 through 300 million. It's possible that by the time you've taken a single pass through your data set you might already have a perfectly good hypothesis. In which case, you know, this inner loop you might need to do only once if m is very, very large. But in general taking anywhere from 1 through 10 passes through your data set, you know, maybe fairly common. But really it depends on the size of your training set.

마지막으로 세부 사항 하나를 살펴봅니다. 확률적 경사 하강법은 경사 하강 업데이트를 수행하는 내부 For 루프를 여러 번 수행하는 외부 루프가 있습니다. 외부 루프를 몇 번 반복해야 할까요? 학습 셋의 크기에 따라 외부 루프는 한 번만 수행해도 충분할지도 모릅니다. 일반적으로는 10번 정도 수행합니다. 내부 루프를 1번에서 10번까지 반복합니다. 예를 들면, 미국 인구조사와 같은 정말 방대한 데이터 은 3억 건의 학습 셋이 있습니다. i = 1에서 3억까지입니다. 데이터 셋을 전체를 한 번 계산할 때 완벽하게 좋은 가설을 만들 수 있습니다. 내부 For 루프는 m이 매우 크다면 한 번만 수행할 수 도 있습니다. 일반적으로 1에서 10까지의 데이트 셋을 반복합니다. 실제로 학습 셋의 크기에 따라 다릅니다.

And if you contrast this to Batch gradient descent. With Batch gradient descent, after taking a pass through your entire training set, you would have taken just one single gradient descent steps. So one of these little baby steps of gradient descent where you just take one small gradient descent step and this is why Stochastic gradient descent can be much faster. So, that was the Stochastic gradient descent algorithm. And if you implement it, hopefully that will allow you to scale up many of your learning algorithms to much bigger data sets and get much more performance that way.

대조적으로 배치 경사 하강법과 비교해봅시다. 배치 경사 하강법은 전체 학습 셋을 계산하고 경사 하강법 스텝 한 단계만을 이동합니다. 확률적 경사 하강법은 단일 학습 셋을 계산하고 빠르게 이동합니다. 이것이 확률적 경사 하강법이 훨씬 더 빠른 이유입니다. 이것이 확률적 경사 하강법 알고리즘의 동작 방식입니다. 확률적 경사 하강 알고리즘은 훨씬 더 큰 데이터 셋을 빠르게 처리할 수 있습니다.

앤드류 응의 머신러닝 동영상 강의

정리하며

선형 회귀, 로지스틱 회귀 및 인공신경망과 같은 알고리즘은 비용 함수와 최적화 목표를 정의하고, 경사 하강법과 같은 알고리즘을 사용하여 비용 함수를 최소화합니다. 선형 회귀는 다음과 같습니다.

가설 함수 hθ(x) = Σ θjxj (j는 피처의 개수이고, n개)

j=0

비용 함수 Jtrain(θ) = 1/2 * Σ(hθ(x)^(i) - y^(i))^2 (i는 학습 예제의 개수, m개)

i=1

경사 하강 업데이트

θj := θj - α *1/m *Σ (hθ(x^(i)) - y^(i))*xj^(i)

i=1

(for j = 0,1,..., n)

알고리즘이 파라미터 θ의 최소값을 찾는 경사 하강 업데이트에서 가장 큰 문제는 m 이 너무 클 경우입니다. 미국의 3억 인구에 대한 자료를 기반으로 한다면, 경사 하강법 각 스텝마다 3억 번의 예제 합산이 이루어집니다. 즉, 연산 비용은 급격히 증가하고 매우 비쌉니다. 알고리즘이 최적화 목표인 전역 최소값에 수렴하기 위해 너무 오랜 시간이 걸립니다. 한 번에 모든 학습 예제를 다루는 경사 하강 알고리즘을 배치(Batch) 경사 하강법이라고 합니다.

이 문제를 해결하기 위해 확률적 경사 하강법을 사용합니다. 확률적 경사 하강법은 학습 데이터 셋을 무작위로 학습 예제를 재 정렬한 후 경사 하강 업데이트를 한 번에 하나씩 수행합니다.

확률적 경사 업데이트를 설명하기 위해 비용 함수를 약간 변형합니다.

Cost(θ, (x^(i), y^(i))) = 1/2 * (hθ(x^(i)) - y^(i))^2

여기서, Cost() 함수는 (x^(i), y^(i)) 단일 예제에서 가설이 얼마나 잘 동작하는 지를 측정합니다. 따라서, 전체 비용 함수 Jtrain(θ)는 다음과 같습니다.

Jtrain(θ) = 1/m * Σ (Cost (θ, (x^(i), y^(i)))

i=1

이에 대한 확률적 경사 하강법은 다음과 같습니다. 배치 경사 하강 알고리즘과의 차이는 합산이 없고, 오직 하나의 학습 예제에 대해서 수행합니다.

θj := θj - α * (hθ(x^(i)) - y^(i))*xj^(i)

(for j = 0,1,..., n)

위 식과 아래 식은 같습니다.

θj := θj - α * ∂/(∂θj) * (Cost (θ, (x^(i), y^(i)))

그래서, 확률적 경사 하강법은 처음에 데이터 셋을 무작위로 정렬합니다. 그리고 경사 하강 업데이트를 시작합니다. 또한, 배치 경사 하강법과 달리 한 점에 수렴하지 않고 전역 최소값에 가까운 영역에서 계속 돌아다닙니다. 즉, 글로벌 최소값에 도달해도 머무르지 않습니다.

추가적으로 이렇게 반복하는 것을 몇 번 적도 해야 할까요? 내부 For 루프는 m이 매우 크다면 한 번만 수행할 수 도 있습니다. 일반적으로 1에서 10까지의 데이트 셋을 반복합니다.

배치 경사 하강법은 전체 학습 셋을 계산하고 경사 하강법 스텝 한 단계만을 이동하지만, 확률적 경사 하강법은 단일 학습 셋을 계산하고 빠르게 이동합니다. 이것이 확률적 경사 하강법이 훨씬 더 빠른 이유입니다