brunch

You can make anything
by writing

C.S.Lewis

앤드류 응의 머신러닝 (2-5) : 경사 하강법

by 라인하트 Sep 29. 2020

온라인 강의 플랫폼 코세라의 창립자인 앤드류 응 (Andrew Ng) 교수는 인공지능 업계의 거장입니다. 그가 스탠퍼드 대학에서 머신 러닝 입문자에게 한 강의를 그대로 코세라 온라인 강의 (Coursera.org)에서 무료로 배울 수 있습니다. 이 강의는 머신러닝 입문자들의 필수코스입니다. 인공지능과 머신러닝을 혼자 공부하면서 자연스럽게 만나게 되는 강의입니다.

Linear Regression with One Variable

단변수 선형 회귀

Parameter Learning (파라미터 학습)

Gradient Descent (경사 하강법)

We previously defined the cost function J. In this video, I want to tell you about an algorithm called gradient descent for minimizing the cost function J. It turns out gradient descent is a more general algorithm, and is used not only in linear regression. It's actually used all over the place in machine learning. And later in the class, we'll use gradient descent to minimize other functions as well, not just the cost function J for the linear regression.So in this video, we'll talk about gradient descent for minimizing some arbitrary function J and then in later videos, we'll take this algorithm and apply it specifically to the cost function J that we have defined for linear regression.

지난 강의에서 비용 함수 J를 정의했습니다. 이번 강의에서 비용 함수 J의 값을 최소화하는 경사 하강법 알고리즘을 설명합니다. 경사 하강법을 선형 회귀에서만 사용하는 것이 아니라 머신 러닝의 거의 모든 영역에서 활용합니다. 이 과정의 뒤로 갈수록 경사 하강법 알고리즘을 선형회귀 뿐만 아니라 다른 알고리즘에서도 사용할 것입니다. 함수에서도 사용할 것입니다. 이번 강의에서 임의의 함수 J를 최소화하는 경사 하강법 알고리즘을 배우고, 나중에 선형 회귀의 비용 함수 J에 경사 하강법을 적용할 것입니다.

So here's the problem setup. Going to assume that we have some function J(theta 0, theta 1) maybe it's the cost function from linear regression, maybe it's some other function we wanna minimize. And we want to come up with an algorithm for minimizing that as a function of J(theta 0, theta 1).

여기에 문제 조건이 있습니다. 선형 회귀의 비용 함수 J(θ0,θ1)가 있습니다. 함수 J(θ0, θ1)를 최소화하는 알고리즘을 찾을 것입니다.

Just as an aside it turns out that gradient descent actually applies to more general functions. So imagine, if you have a function that's a function of J, as theta 0, theta 1, theta 2, up to say some theta n, and you want to minimize theta 0. You minimize over theta 0 up to theta n of this J of theta 0 up to theta n. And it turns our gradient descent is an algorithm for solving this more general problem. But for the sake of brevity, for the sake of succinctness of notation, I'm just going to pretend I have only two parameters throughout the rest of this video.

한편으로, 경사 하강법은 좀 더 일반적인 함수들에 적용됩니다. 만약 함수 J가 있고, 파라미터 θ0, θ1, θ2,... θn 까지고 있고, θ0, θ1, θ2,... θn을 최소화합니다. 즉, 함수 J(θ0,θ1,..., θn)의 모든 파라미터의 값을 최소화할 것입니다. 경사 하강법은 이런 일반적인 문제를 해결하는 알고리즘입니다. 하지만 경사 하강법의 더 잘 이해하기 위해 단 2 개의 파라미터만을 가진 비용 함수 J(θ0,θ1)으로 단순화합니다.

Here's the idea for gradient descent. What we're going to do is we're going to start off with some initial guesses for theta 0 and theta 1. Doesn't really matter what they are, but a common choice would be we set theta 0 to 0, and set theta 1 to 0, just initialize them to 0. What we're going to do in gradient descent is we'll keep changing theta 0 and theta 1 a little bit to try to reduce J(theta 0, theta 1), until hopefully, we wind at a minimum, or maybe at a local minimum.

여기에 경사 하강법의 개념이 있습니다. 우선, 파라미터 θ0와 θ1의 초기값들을 추측합니다. 초기값을 선택하는 것은 중요하지 않습니다. 일반적으로 'θ0 = 0'와 'θ1 = 0'로 설정합니다. 파라미터 θ0와 θ1을 모두 0으로 초기화합니다. 경사 하강법은 θ0와 θ1의 값을 조금씩 바꾸면서 J(θ0, θ1)의 값을 조금이라도 줄이면서 최소값을 찾습니다.

So let's see in pictures what gradient descent does. Let's say you're trying to minimize this function. So notice the axes, this is theta 0, theta 1 on the horizontal axes and J is the vertical axis and so the height of the surface shows J and we want to minimize this function.

여기 함수 J(θ0, θ1)의 3차원 그래프가 있습니다. 경사 하강법의 역할을 설명합니다. 함수 J(θ0, θ1)를 최소화한다고 가정합니다. 수평축에는 파라미터 θ0와 θ1이 있습니다. 그리고 함수 J(θ0, θ1)는 수직축이고, J(θ0, θ1)의 표면은 수평축 θ0와 θ1 평면으로 부터의 높이 입니다.

So we're going to start off with theta 0, theta 1 at some point. So imagine picking some value for theta 0, theta 1, and that corresponds to starting at some point on the surface of this function. So whatever value of theta 0, theta 1 gives you some point here. I did initialize them to (0, 0) but sometimes you initialize it to other values as well. Now, I want you to imagine that this figure shows a hole. Imagine this is like the landscape of some grassy park, with two hills like so, and I want us to imagine that you are physically standing at that point on the hill, on this little red hill in your park. In gradient descent, what we're going to do is we're going to spin 360 degrees around, just look all around us, and ask, if I were to take a little baby step in some direction, and I want to go downhill as quickly as possible, what direction do I take that little baby step in? If I wanna go down, so I wanna physically walk down this hill as rapidly as possible. Turns out, that if you're standing at that point on the hill, you look all around and you find that the best direction is to take a little step downhill is roughly that direction. Okay, and now you're at this new point on your hill. You're gonna, again, look all around and say what direction should I step in order to take a little baby step downhill? And if you do that and take another step, you take a step in that direction. And then you keep going. From this new point you look around, decide what direction would take you downhill most quickly. Take another step, another step, and so on until you converge to this local minimum down here.

이 함수의 표면의 임의 점 (θ0, θ1)에서 시작합니다. θ0와 θ1 값이 무엇이든지 상관없습니다. 일반적으로 점 (θ0, θ1)을 (0, 0)로 초기화하지만, 때때로 다른 값으로 초기화하기도 합니다. 여러분은 초록 잔디가 깔린 공원에 두 개의 언덕이 있다고 상상하세요. 여러분은 작은 언덕 정상 근처의 검은색 엑스표시에서 시작해서 잔디깔린 언덕 아래로 내려갈 계획입니다. 여러분 주변을 360도로 살펴보고 아기 걸음으로 조금씩 걸어 내려 가거나 작은 걸음으로 가거나 뛰어서 언덕을 최대한 빠르게 내려 갈 것입니다. 그렇다면, 어떤 길이 가장 빠른 길일까요? 보통은 무조건 밑으로 빨리 달겨가겠지만 상상해 봅시다. 여러분은 작은 언덕 정상 근처에 서있습니다. 여러분은 주위를 둘러보고 언덕을 내려가는 가장 빠른 길을 찾습니다. 그리고 그 길을 따라 내려 갑니다. 그리고 다시 새로운 지점에 서있습니다. 다시 한번 주위를 둘러봅니다. 어느 쪽으로 가야 할 지를 결정하고 한 걸음을 옮깁니다. 새로운 지점에서 주위를 둘러봅니다. 어떤 길이 밑으로 가장 빨리 내려갈 수 있을까요? 또 조금 걷고, 또 조금 걸어봅시다. 지역 최소값에 도달할 때까지 계속 반복합니다.

Gradient descent has an interesting property. This first time we ran gradient descent we were starting at this point over here, right? Started at that point over here. Now imagine we had initialized gradient descent just a couple steps to the right. Imagine we'd initialized gradient descent with that point on the upper right. If you were to repeat this process, so start from that point, look all around, take a little step in the direction of steepest descent, you would do that. Then look around, take another step, and so on.And if you started just a couple of steps to the right, gradient descent would've taken you to this second local optimum over on the right. So if you had started this first point, you would've wound up at this local optimum, but if you started just at a slightly different location, you would've wound up at a very different local optimum. And this is a property of gradient descent that we'll say a little bit more about later.

경사 하강법은 흥미로운 개념입니다. 처음 경사 하강법을 시작했던 점이 아닌 다른 점에서 시작합니다. 경사 하강 알고리즘을 새로운 지점으로 초기화합니다. 그리고 조금 전에 했던 과정을 반복합니다. 주위를 둘러고보 가장 작은 값의 경사로 한 걸음 이동합니다. 다시 주위를 둘러보고, 또 다른 걸음을 걷습니다. 그렇게 몇 걸음을 더 걷는다면, 경사 하강 알고리즘은 두 번째 지역 최소값에 도달합니다. 어느 지점에서 시작하는 지에 따라 도달하는 최소값이 다릅니다. 첫 번째 경사하강법이 도달한 지역 최소값과 두 번째 경사 하강법이 도달한 지역 최소값이 다릅니다. 경사 하강법의 이런 특징은 나중에 더 자세히 다룰 것입니다.

So that's the intuition in pictures. Let's look at the math. This is the definition of the gradient descent algorithm. We're going to just repeatedly do this until convergence, we're going to update my parameter theta j by taking theta j and subtracting from it alpha times this term over here, okay? So let's see, there's lot of details in this equation so let me unpack some of it.

이것이 강사 하강법에 대한 이해입니다. 수학적으로 이해하기 위해 경사 하강 알고리즘을 정의합니다. 한 점으로 수렴할 때까지 이 개념을 반복할 것입니다. 파라미터 θj를 업데이트하는 수학 공식입니다. 이 수학 공식에 숨은 많은 의미를 찾아봅니다.

First, this notation here, :=, gonna use := to denote assignment, so it's the assignment operator. So briefly, if I write a := b, what this means is, it means in a computer, this means take the value in b and use it overwrite whatever value is a. So this means set a to be equal to the value of b, which is assignment. And I can also do a := a + 1. This means take a and increase its value by one. Whereas in contrast, if I use the equal sign and I write a equals b, then this is a truth assertion.Okay? So if I write a equals b, then I'm asserting that the value of a equals to the value of b, right? So the left hand side, that's the computer operation, where we set the value of a to a new value. The right hand side, this is asserting, I'm just making a claim that the values of a and b are the same, and so whereas you can write a := a + 1, that means increment a by 1, hopefully I won't ever write a = a + 1 because that's just wrong. a and a + 1 can never be equal to the same values. Okay? So this is first part of the definition.

경사하강법 공식은 다음과 같습니다.

여기서 := 기호는 지정을 의미합니다. 그것은 '지정 연산자'입니다. 'a := b'라는 식이 있을 때 컴퓨터에서 사용하는 의미는 b의 값을 a에 넣는다는 것을 의미합니다. 그래서 a는 b의 값과 같다는 의미입니다. 또한 'a := a + 1'의 식은 a값에 1을 더한다는 의미입니다. 반대로 등호를 사용한 'a = b'는 a값과 b의 값이 같다는 의미로 참과 거짓의 문제가 됩니다. 그래서 왼쪽의 'a:=b'는 컴퓨터에서 사용합니다. ':='이 왼쪽에 새 값으로 지정할 값을 놓고, ':='의 오른쪽에 대입할 값을 놓습니다. 'a = a + 1'는 거짓으로 사용할 수 없습니다. a는 절대로 a + 1과 같을 수가 없기 때문입니다.

This alpha here is a number that is called the learning rate. And what alpha does is it basically controls how big a step we take downhill with creating descent. So if alpha is very large, then that corresponds to a very aggressive gradient descent procedure where we're trying take huge steps downhill and if alpha is very small, then we're taking little, little baby steps downhill. And I'll come back and say more about this later, about how to set alpha and so on.

여기에 있는 α 은 학습률입니다. 학습률 α는 언덕을 내려갈 때 얼마의 스텝(간격)으로 내려갈지를 결정합니다. 학습률 α가 매우 크면, 꽤 공격적인 경사 하강이 될 것입니다. 언덕을 내려가는 간격 또는 스텝이 매우 클 것입니다. 학습률 α가 매우 작다면, 매우 작은 간격이나 스텝으로 내려가게 될 것입니다. 학습률 α를 정하는 방법은 나중에 자세히 설명할 것입니다.

And finally, this term here, that's a derivative term. I don't wanna talk about it right now, but I will derive this derivative term and tell you exactly what this is later, okay? And some of you will be more familiar with calculus than others, but even if you aren't familiar with calculus, don't worry about it. I'll tell you what you need to know about this term here.

마지막으로 경사 하강 알고리즘 공식의 미분항입니다. 지금 당장 이 개념을 설명하지 않고 나중에 정확히 설명할 것입니다. 여러분들 중 몇 분은 미적분에 익숙하겠지만, 혹시 미적분에 익숙하지 않더라도 걱정하지 마세요. 여러분들이 알아야 할 것들은 모두 설명할 것입니다.

Now, there's one more subtlety about gradient descent, which is in gradient descent we're going to update, you know, theta 0 and theta 1, right? So this update takes place for j = 0 and j = 1, so you're gonna update theta 0 and update theta 1. And the subtlety of how you implement gradient descent is for this expression, for this update equation, you want to simultaneously update theta 0 and theta 1. What I mean by that is that in this equation, we're gonna update theta 0 := theta 0 minus something, and update theta 1 := theta 1 minus something. And the way to implement is you should compute the right hand side, right? Compute that thing for theta 0 and theta 1 and then simultaneously, at the same time, update theta 0 and theta 1, okay? So let me say what I mean by that. This is a correct implementation of gradient descent meaning simultaneous update. So I'm gonna set temp0. equals that, set temp1 equals that so basic compute the right-hand sides, and then having. computed the right-hand sides and stored them into variables temp0 and temp1, I'm gonna. update theta 0 and theta 1 simultaneously because that's the correct implementation.

이제, 경사 하강법에서 한 가지 좀 더 중요한 세부 사항이 있습니다. 경사 하강법은 파라미터 θ0와 θ1을 업데이트합니다. 정확히 j = 0과 j = 1 일 때 파라미터 θ0와 θ1를 업데이트합니다. 경사 하강법을 업데이트 시 기억해야 할 것은 업데이트 방정식에서 파라미터 θ0의 과 θ1을 동시에 업데이트하는 것입니다. 그래서 이 공식을 간단하게 정리합니다.

temp0 := θ0 - 어떤 값

temp1 := θ1 - 어떤 값

θ0 = temp0

θ1 = temp1

여기서, 이 식을 계산하기 위해서 오른쪽의 식을 풀어야 합니다. 파라미터 θ0과 θ1을 동시에 계산하고 동시에 θ0과 θ1 값을 바꿉니다. 이것이 올바른 방법입니다. temp0와 temp1 값이 같다고 합시다. 그래서 오른쪽에 있는 공식을 계산하고 변수 temp0과 변수 temp1에 값을 대입합니다. 그리고, θ0과 θ1에 동시에 값을 업데이트합니다. 그게 바로 올바른 구현이니다.

In contrast, here's an incorrect implementation that does not do a simultaneous update. So in this incorrect implementation, we compute temp0, and then we update theta 0, and then we compute temp1, and then we update temp1. And the difference between the right hand side and the left hand side implementations is that If you look down here, you look at this step, if by this time you've already updated theta 0, then you would be using the new value of theta 0 to compute this derivative term. And so this gives you a different value of temp1, than the left-hand side, right? Because you've now plugged in the new value of theta 0 into this equation. And so, this on the right-hand side is not a correct implementation of gradient descent, okay?

반대로, 여기에 틀린 공식이 있습니다. 동시 업데이트를 하지 않습니다. 그래서 틀린 구현에서 우리는 temp0 값을 계산하고 θ0 값을 업데이트합니다. 그리고 temp1을 계산하고 temp1 값을 업데이트합니다. 오른쪽의 공식과 왼쪽의 공식의 차이는 이미 업데이트한 파리미터 θ0 값을 파라미터 θ1을 업데이트할 때 다시 재사용합니다. 변수 temp1은 전혀 다른 결과를 표시합니다. 오른쪽에 있는 경사 하강법 알고리즘은 틀린 공식입니다.

So I don't wanna say why you need to do the simultaneous updates. It turns out that the way gradient descent is usually implemented, which I'll say more about later, it actually turns out to be more natural to implement the simultaneous updates. And when people talk about gradient descent, they always mean simultaneous update. If you implement the non simultaneous update, it turns out it will probably work anyway. But this algorithm wasn't right. It's not what people refer to as gradient descent, and this is some other algorithm with different properties. And for various reasons this can behave in slightly stranger ways, and so what you should do is really implement the simultaneous update of gradient descent. So, that's the outline of the gradient descent algorithm.

동시 업데이트를 하는 이유를 자세히 설명하지 않겠습니다. 이것이 주로 경사 하강법에서 구현하는 방식이고 나중에 더 자세히 이야기할 것입니다. 사실 동시 업데이트가 훨씬 더 자연스럽습니다. 경사 하강법은 항상 동시 업데이트를 합니다. 만약 동시 업데이트가 아닌 방법으로 구현하더라도 경사 하강법은 동작하지만 올바르지 않습니다. 여러 가지 이유로 이상한 방법입니다. 그러므로 경사 하강법의 핵심은 동시 업데이트를 구현하는 것입니다.

In the next video, we're going to go into the details of the derivative term, which I wrote up but didn't really define. And if you've taken a calculus class before and if you're familiar with partial derivatives and derivatives, it turns out that's exactly what that derivative term is, but in case you aren't familiar with calculus, don't worry about it. The next video will give you all the intuitions and will tell you everything you need to know to compute that derivative term, even if you haven't seen calculus, or even if you haven't seen partial derivatives before. And with that, with the next video, hopefully we'll be able to give you all the intuitions you need to apply gradient descent.

다음 강의에서 미분항을 더 자세히 설명할 것입니다. 미적분학에 익숙하다면 좋겠지만, 미적분을 몰라도 걱정할 필요는 없습니다. 다음 강의에서 미분을 계산하기 위한 모든 개념을 설명할 것입니다. 미적분에 익숙하지 않아도 미적분을 본 적이 없어도 좋습니다. 다음 강의에서 경사 하강에 적용할 수 있는 실제 개념을 설명할 것입니다.

정리하며

경사 하강법은 비용 함수 J를 최소화하는 경사 하강 알고리즘입니다. 산 정상에서 산 아래로 내려가는 가장 빠른 길을 찾는 과정입니다. 경사 하강법은 한 스텝 움직일 때마다 주위의 최소값을 찾아서 다음 스텝을 결정하는 과정을 반복합니다.

경사 하강법의 수학적 정의는 다음과 같습니다.

α (학습률) : 최소값을 찾아갈 때 스텝의 크기를 설정 (언덕을 내려갈 때 보폭을 결정)

:= (지정 연산자) : 지정 연산자 뒤의 값을 앞에 대입 ('a := b'라면 b의 값을 a에 대입)

∂(편미분기호) : 접선의 기울기를 구하기 위한 편미분항을 표시

경사 하강법은 하강하는 매 단계에서 파라미터 θ0와 θ1을 계산하고 업데이트를 반복합니다. 각 과정에서 임의점 (θ0, θ1)에서 아래의 순서로 동시에 업데이트해야 합니다. θ0을 구하고 바로 θ0의 정보를 업데이트할 경우에는 정확한 값을 산출할 수 없습니다. temp0를 구한 후 θ0의 값을 temp0로 변경한다면 temp1의 값을 구할 때 영향을 미치기 때문입니다.

문제풀이

. θ0 = 1 , θ1 = 2 일 때 다음과 같은 규칙으로 θ0와 θ1를 업데이트합니다. θ0와 θ1의 값을 구하시오?

정답은 2 번입니다.

θ0 = 1이고, θ2 = 2이고, 여기서는 미분항을 제거한 공식으로 θj = θj + √( θ0 θ1)입니다. 동시 업데이트의 핵심은 temp0와 temp 1의 값을 각각 구한 후 θ0 = temp0과 θ1 = temp1을 계산합니다. 따라서, temp0 = 1+ √(1*2)= 1 + √(2), temp = 2 + √ (1*2) = 2 + √(2) 이므로 정답은 두 번째입니다.

keyword