brunch

You can make anything
by writing

C.S.Lewis

by 라인하트 Sep 30. 2020

앤드류 응의 머신러닝 (2-6):경사 하강법 이해

온라인 강의 플랫폼 코세라의 창립자인 앤드류 응 (Andrew Ng) 교수는 인공지능 업계의 거장입니다. 그가 스탠퍼드 대학에서 머신 러닝 입문자에게 한 강의를 그대로 코세라 온라인 강의 (Coursera.org)에서 무료로 배울 수 있습니다. 이 강의는 머신러닝 입문자들의 필수코스입니다. 인공지능과 머신러닝을 혼자 공부하면서 자연스럽게 만나게 되는 강의입니다.

Linear Regression with One Variable

단변수 선형 회귀

Parameter Learning (파라미터 학습)

Gradient Descent intuition (경사 하강법의 이해)

In the previous video, we gave a mathematical definition of gradient descent. Let's go deeper and this video get better intuition about what the algorithm is doing and why the steps of the gradient descent algorithm might make sense.

지난 강의에서 경사 하강법의 수학적 정의를 배웠습니다. 이번 강의에서 경사 하강법을 좀 더 깊게 이해하기 위해 경사 하강법의 역할과 경사 하강법의 스텝에 대해 설명합니다.

Here's a gradient descent algorithm that we saw last time and just to remind you this parameter, or this term alpha is called the learning rate. And it controls how big a step we take when updating my parameter theory j. And this second term here is the derivative term. And what I wanna do in this video is give you that intuition about what each of these two terms is doing and why when put together, this entire update makes sense.

이것은 지난 시간에 봤던 경사 하강 알고리즘입니다. 다시 한번 상기시켜드리자면, α는 학습률 (Learning rate)입니다. 학습률은 파라미터 j를 업데이트할 때 얼마나 큰 스텝으로 이동하는 지를 결정합니다. 두 번째 항은 미분 항입니다. 이번 강의에서 두 개 항의 역할과 전체 업데이트를 하는 이유를 설명합니다.

In order to convey these intuitions, what I want to do is use a slightly simpler example, where we want to minimize the function of just one parameter. So say we have a cost function, J of just one parameter, theta one, like we did a few videos back, where theta one is a real number. So we can have one plots, which are a little bit simpler to look at. Let's try to understand what gradient decent would do on this function. So let's say, here's my function, J of theta 1. And so that's mine. And where theta 1 is a real number. All right?

이해를 돕기 위해 간단한 예를 들겠습니다. 지난 강의에서 다루었던 하나의 파라미터를 가진 함수를 최소화합니다. 파라미터 θ1은 실수이고, 비용 함수 J(θ1)의 그래프를 그릴 수 있습니다. 기서 경사 하강법의 역할에 대한 감각을 익힐 수 있습니다.

Now, let's have in this slide its grade in descent with theta one at this location. So imagine that we start off at that point on my function. What grade in descent would do is it will update. Theta one gets updated as theta one minus alpha times d d theta one J of theta one, right? And as an aside, this derivative term, right, if you're wondering why I changed the notation from these partial derivative symbols. If you don't know what the difference is between these partial derivative symbols and the dd theta, don't worry about it.

Technically in mathematics you call this a partial derivative and call this a derivative, depending on the number of parameters in the function J. But that's a mathematical technicality. And so for the purpose of this lecture, think of these partial symbols and d, d theta 1, as exactly the same thing. And don't worry about what the real difference is. I'm gonna try to use the mathematically precise notation, but for our purposes these two notations are really the same thing. And so let's see what this equation will do. So we're going to compute this derivative, not sure if you've seen derivatives in calculus before, but what the derivative at this point does, is basically saying, now let's take the tangent to that point, like that straight line, that red line, is just touching this function, and let's look at the slope of this red line. That's what the derivative is, it's saying what's the slope of the line that is just tangent to the function. Okay, the slope of a line is just this height divided by this horizontal thing. Now, this line has a positive slope, so it has a positive derivative. And so my update to theta is going to be theta 1, it gets updated as theta 1, minus alpha times some positive number. Okay. Alpha the the learning, is always a positive number. And, so we're going to take theta one is updated as theta one minus something. So I'm gonna end up moving theta one to the left. I'm gonna decrease theta one, and we can see this is the right thing to do cuz I actually wanna head in this direction. You know, to get me closer to the minimum over there. So, gradient descent so far says we're going the right thing.

θ1 이 비용 함수 J와 만나는 지점을 빨간색 점으로 표시합니다. 여기서 경사 하강 알고리즘을 업데이트하는 공식은 다음과 같습니다.

업데이트 공식은 미분항을 포함합니다. 왜 갑자기 편도 함수의 미분 기호 ∂(라운드)를 도함수의 미분 기호 d 로 바꿨는지는 몰라도 상관없습니다. 단지, 비용 함수 J의 파라미터 개수에 따라 기호가 달라집니다. 파라미터 θ가 하나일 때는 도함수, 파라미터 θ가 두 개 이상일 때는 편도 함수입니다. 강의에서 수학적으로 정확한 표기법을 사용하겠지만, 두 가지 기호는 의도하는 바가 동일하기 때문에 같은 것으로 간주해도 상관없습니다.

여기 파라미터 θ1의 경사 하강법 업데이트입니다. 미분항의 역할은 빨간색 점의 탄젠트 값으로 접선의 기울기입니다. 미분항은 빨간색 직선의 기울기를 나타냅니다. 직선의 기울기는 점선으로 된 가로와 세로의 길이로 계산합니다. 빨간색 직선은 양의 값이므로 미분항의 값은 양수입니다. 즉, 파라미터 θ1를 업데이트한다는 의미는 다음과 같이 정리할 수 있습니다.

θ1 := θ1 - α *양수

θ1 := θ1 - (어떤 값)

여기서, 학습률 α는 항상 양수입니다. 그래서 파라미터 θ1의 값에서 어떤 양수를 뺀 후 파라미터 θ1을 업데이트합니다. 결국, 파라미터 θ1의 값은 작아질 수 밖에 없으므로 왼쪽으로 이동합니다. 비용 함수 J(θ1)의 그래프에서 파라미터 θ1의 값이 줄어든다는 것은 경사 하강 법 알고리즘이 왼쪽으로 이동한다는 의미입니다. 왼쪽으로 갈수록 최소값에 가깝습니다. 지금까지 비용 함수 J(θ1)의 오른쪽에 있는 경사하강 알고리즘을 설명했습니다.

Let's look at another example. So let's take my same function J, let's try to draw from the same function, J of theta 1. And now, let's say I had to say initialize my parameter over there on the left. So theta 1 is here. I glare at that point on the surface. Now my derivative term d d theta one J of theta one when you value into that this point, we're gonna look at right the slope of that line, so this derivative term is a slope of this line. But this line is slanting down, so this line has negative slope. Right. Or alternatively, I say that this function has negative derivative, just means negative slope at that point. So this is less than equals to 0, so when I update theta, I'm gonna have theta. Just update this theta of minus alpha times a negative number. And so I have theta 1 minus a negative number which means I'm actually going to increase theta, because it's minus of a negative number, means I'm adding something to theta. And what that means is that I'm going to end up increasing theta until it's not here, and increase theta wish again seems like the thing I wanted to do to try to get me closer to the minimum. So this whole theory of intuition behind what a derivative is doing

비용함수 J(θ1)의 왼쪽에 있는 경사 하강 알고리즘을 설명합니다. 파라미터 θ1이 비용 함수 J(θ1)과 만나는 지점을 빨간색으로 표시합니다. 비용 함수 J(θ1)의 표면과 만나는 점입니다. 경사하강 알고리즘 업데이트 공식은 다음과 같습니다.

여기서 미분항의 역할은 빨간색 직선의 기울기입니다. 미분항은 빨간색 직선의 기울기를 나타냅니다. 직선의 기울기는 점선으로 된 가로와 세로의 길이로 계산합니다. 빨간색 직선은 음의 값이므로 미분항의 값은 음수입니다. 즉, 파라미터 θ1를 업데이트한다는 의미는 다음과 같이 정리할 수 있습니다.

θ1 := θ1 - α *음수

θ1 := θ1 + (어떤 값)

여기서, 학습률 α는 항상 양수이므로 양수 * 음수는 음수입니다. 그래서 파라미터 θ1의 값에서 어떤 음수를 뺀 값으로 파라미터 θ1을 업데이트합니다. 결국, 파라미터 θ1의 값은 커질 수 밖에 없으므로 오른쪽으로 이동합니다. 비용 함수 J(θ1)의 그래프에서 파라미터 θ1의 값이 늘어든다는 것은 경사 하강 법 알고리즘이 오른쪽으로 이동한다는 의미입니다. 오른쪽으로 갈수록 최소값에 가깝습니다. 지금까지 비용 함수 J(θ1)의 왼쪽에 있는 경사하강 알고리즘을 설명했습니다. 이것이 미분항의 역할입니다.

Let's take a look at the rate term alpha and see what that's doing. So here's my gradient descent update mural, that's this equation. And let's look at what could happen if alpha is either too small or if alpha is too large. So this first example, what happens if alpha is too small? So here's my function J, J of theta. Let's all start here. If alpha is too small, then what I'm gonna do is gonna multiply my update by some small number, so end up taking a baby step like that. Okay, so this one step. Then from this new point, I'm gonna have to take another step. But if alpha's too small, I take another little baby step. And so if my learning rate is too small I'm gonna end up taking these tiny tiny baby steps as you try to get to the minimum. And I'm gonna need a lot of steps to get to the minimum and so if alpha is too small gradient descent can be slow because it's gonna take these tiny tiny baby steps and so it's gonna need a lot of steps before it gets anywhere close to the global minimum.

이제 학습률 α 의 역할과 개념을 살펴봅시다. 여기에 경사 하강 업데이트 방정식이 있습니다. 학습률 α가 매우 작거나 매우 크다면 어떤 일이 생길지 살펴보겠습니다.

우선, 학습률 α가 너무 작은 값일 때를 생각해 봅니다. 여기 비용 함수 J(θ1)이 있습니다. 그래프의 왼쪽 높은 지점에서 시작합니다. 학습률 α가 매우 작은 값이므로 파라미터 θ1는 오른쪽으로 매우 조금씩 이동할 것입니다. 아주 조금 이동한 후에 다시 경사 하강 알고리즘 업데이트를 한 후 아주 조금 이동하는 것을 반복하므로 너무 많은 스텝을 반복합니다. 즉, 학습률 비율이 α가 매우 작은 값일 때 최소값에 다다르는 하강 속도는 매우 느립니다.

Now how about if our alpha is too large? So, here's my function Jf filter, turns out that alpha's too large, then gradient descent can overshoot the minimum and may even fail to convert or even divert, so here's what I mean. Let's say it's all our data there, it's actually close to minimum. So the derivative points to the right, but if alpha is too big, I want to take a huge step. Remember, take a huge step like that. So it ends up taking a huge step, and now my cost functions have strong roots. Cuz it starts off with this value, and now, my values are strong in verse. Now my derivative points to the left, it says I should decrease data. But if my learning is too big, I may take a huge step going from here all the way to out there. So we end up being over there, right? And if my is too big, we can take another huge step on the next elevation and kind of overshoot and overshoot and so on, until you already notice I'm actually getting further and further away from the minimum. So if alpha is to large, it can fail to converge or even diverge.

다음으로 학습률 α가 매우 큰 값일 때를 생각해 봅니다. 학습률 α가 너무 큰 값일 때 경사 하강 알고리즘이 최소값을 지나갈 수 있고 방향을 바꾸는 것에 실패할 수도 있습니다. 여기 비용 함수 J(θ1)이 있습니다. 그래프의 왼쪽 높은 지점에서 시작합니다. 학습률 α가 매우 큰 값이므로 파라미터 θ1는 오른쪽으로 매우 크게 이동할 것입니다. 최소값에 매우 가까운 지점에서 미분항은 오른쪽으로 이동할 것을 가리키고 학습률 α에 의해 너무 큰 거리를 이동합니다. 반대편에서 미분항은 왼쪽으로 이동할 것을 가리키고 학습률 α에 의해 너무 큰 거리를 이동합니다. 학습률 α값이 크기 때문에 같은 방법으로 더 크게 이동하고 다시 이동합니다. 계속해서 최소값으로부터 멀어집니다. 결국, 학습률 α가 너무 큰 값일 때 전역 최소값에 수렴하지 못하고 발산할 수 있습니다.

Now, I have another question for you. So this a tricky one and when I was first learning this stuff it actually took me a long time to figure this out. What if your parameter theta 1 is already at a local minimum, what do you think one step of gradient descent will do?

까다로운 질문이 하나 있습니다. 이 질문의 답을 찾기 위해 많은 시간을 소모했습니다. 만일 파라미터 θ1 이 이미 로컬 최소값에 도달했다면, 경사 하강 알고리즘의 다음 스텝은 무엇일까요?

So let's suppose you initialize theta 1 at a local minimum. So, suppose this is your initial value of theta 1 over here and is already at a local optimum or the local minimum. It turns out the local optimum, your derivative will be equal to zero. So for that slope, that tangent point, so the slope of this line will be equal to zero and thus this derivative term is equal to zero. And so your gradient descent update, you have theta one cuz I updated this theta one minus alpha times zero. And so what this means is that if you're already at the local optimum it leaves theta 1 unchanged cause its updates as theta 1 equals theta 1. So if your parameters are already at a local minimum one step with gradient descent does absolutely nothing it doesn't your parameter which is what you want because it keeps your solution at the local optimum.

파라미터 θ1이 최소값의 위치에 있다고 가정합시다. θ1이 여기에 있고 지역 최소값 또는 지역 최적 값입니다. 접선의 기울기를 나타내는 미분항의 값은 0입니다. 따라서, θ1의 경사 하강 업데이트는 다음과 같습니다.

θ1 := θ1 - α*0

θ1 := θ1

즉, 파라미터 θ1이 로컬 최소값에 있다면 θ1 := θ1이기 때문에 업데이트할 필요가 없습니다. 파라미터 θ1이 지역 최소값이라면 경사 하강 알고리즘은 파라미터에 아무런 영향을 미치지 않습니다. 파라미터는 지역 최적 값을 유지합니다.

This also explains why gradient descent can converse the local minimum even with the learning rate alpha fixed. Here's what I mean by that let's look in the example. So here's a cost function J of theta that maybe I want to minimize and let's say I initialize my algorithm,

my gradient descent algorithm, out there at that magenta point. If I take one step in gradient descent, maybe it will take me to that point, because my derivative's pretty steep out there.

Right? Now, I'm at this green point, and if I take another step in gradient descent, you notice that my derivative, meaning the slope, is less steep at the green point than compared to at the magenta point out there. Because as I approach the minimum, my derivative gets closer and closer to zero, as I approach the minimum. So after one step of descent, my new derivative is a little bit smaller. So I wanna take another step in the gradient descent. I will naturally take a somewhat smaller step from this green point right there from the magenta point. Now with a new point, a red point, and I'm even closer to global minimum so the derivative here will be even smaller than it was at the green point. So I'm gonna another step in the gradient descent.Now, my derivative term is even smaller and so the magnitude of the update to theta one is even smaller, so take a small step like so. And as gradient descent runs, you will automatically take smaller and smaller steps. Until eventually you're taking very small steps, you know, and you finally converge to the to the local minimum.

이것은 경사 하강법이 학습률 α가 고정되어도 지역 최소값에 수렴하는 이유입니다. 비용 함수 J(θ)가 있습니다. 최소값을 찾기 위해 경사 하강 알고리즘을 분홍색 점에서 초기화합니다. 경사 하강법의 한 스텝을 이동할 때 다음 지점은 녹색 점으로 상당히 큰 거리를 이동합니다. 왜냐하면 기울기가 가파르기 때문입니다. 파라미터 θ의 간격이 동일하더라도 함수의 그래프에서 이동은 기울기에 의해 결정됩니다. 분홍색 지점보다 녹색 지점은 기울기가 덜 가파릅니다. 최소값에 가까울수록 기울기는 완만해집니다. 좀 더 이동하면 미분항은 더욱더 작은 값을 가집니다. 분홍색 점에서 녹색 점으로 이동한 거리보다 더 적게 이동합니다. 새로운 지점은 전역 최소값에 더 가깝고, 미분항은 녹색점보다 더 작은 값입니다. 미분항은 더 작아지고 θ1에 업데이트 값도 작습니다. 경사 하강 알고리즘이 실행될 때 자동으로 더 작은 간격으로 움직입니다. 결국 조금씩 이동하다보면 지역 최소값에 도달합니다.

So just to recap, in gradient descent as we approach a local minimum, gradient descent will automatically take smaller steps. And that's because as we approach the local minimum, by definition the local minimum is when the derivative is equal to zero. As we approach local minimum, this derivative term will automatically get smaller, and so gradient descent will automatically take smaller steps. This is what so no need to decrease alpha or the time.

요약하자면, 경사 하강법은 지역 최소값에 가까워질수록 자동으로 더 작은 거리를 이동합니다. 왜냐하면, 최소값에 가까워진다는 것은 접선의 기울기인 미분항의 값이 0이 된다는 것입니다. 지역 최소값에 가까워질수록 미분항의 값은 작아지고 경사 하강 알고리즘은 더 조금씩 이동합니다. 따라서, 최소값에 가까울 때 학습률 α를 줄일 필요가 없습니다.

So that's the gradient descent algorithm and you can use it to try to minimize any cost function J, not the cost function J that we defined for linear regression. In the next video, we're going to take the function J and set that back to be exactly linear regression's cost function, the square cost function that we came up with earlier. And taking gradient descent and this great cause function and putting them together. That will give us our first learning algorithm, that'll give us a linear regression algorithm.

이것이 경사 하강 알고리즘입니다. 선형 회귀에서 어떤 비용 함수 J 및 기타 함수에서 최소값을 구할 때 사용합니다. 다음 강의에서 우리는 비용 함수 J를 사용해서 선형 회귀에서 사용되는 비용 함수로 돌아갈 것입니다. 비용 함수와 경사 하강 알고리즘을 함께 사용할 것입니다. 그것은 첫 번째 학습 알고리즘인 선형 회귀 알고리즘입니다.

정리하며

미분항은 접선의 기울기입니다. 미분항이 양수이면 'θ1 := θ1- α *양수' 이므로 θ1의 값을 업데이트한다는 의미는 최소값이 있는 오른쪽으로 한 스텝 이동한다는 것입니다. 반대로 'θ1 := θ1- α *음수' 이므로 θ1의 값를 업데이트한다는 의미는 최소값이 있는 왼쪽으로 한 스텝 이동한다는 것입니다.

경사 하강 알고리즘은 비용 함수 J의 최소값을 찾을 때까지 이 과정을 반복하는 것입니다.

학습 비율 α 값은 경사 하강 알고리즘이 최소값을 찾기 위해 이동할 때 한 스텝의 간격을 나타냅니다. 학습 비율이 작으면 스텝의 간격이 작아서 최소값에 도달하기 위해 많은 연산과 시간이 필요합니다. 반대로 학습 비율이 너무 크면 경사 하강 알고리즘이 최소값을 지나칠 수 있고 방향을 바꾸는 것에 실패할 수도 있습니다.

학습 비율이 고정된다면 θ1 값은 일정한 간격으로 줄어들거나 늘어나면서 최소값쪽으로 이동합니다. 하지만, 함수 표면이 최소값에 멀어질수록 가파르고 최소값에 가까워질수록 완만합니다. 따라서, 일정한 간격으로 θ1 값이 변화더라도 함수 표면에 경사 하강 알고리즘은 경사가 급하면 빠르게 경사가 완만하면 느리게 이동합니다.

keyword

라인하트 자기계발 분야 크리에이터 소속 직업 컨설턴트

머신러닝 강의노트 저자

IT 엔지니어, IT 블로거, 작가, 경영학 박사 과정, 그리고 두 아이의 아빠. 글을 읽고 쓰며 세상을 바라본다

구독자 1,540

매거진의 이전글 앤드류 응의 머신러닝 (2-5) : 경사 하강법 앤드류 응의 머신러닝 (2-7):선형회귀의 경사 하강법 매거진의 다음글

브런치는 최신 브라우저에 최적화 되어있습니다. IE chrome safari