brunch

You can make anything
by writing

C.S.Lewis

앤드류 응의 머신러닝 (2-7):선형회귀의 경사 하강법

by 라인하트 Oct 02. 2020

온라인 강의 플랫폼 코세라의 창립자인 앤드류 응 (Andrew Ng) 교수는 인공지능 업계의 거장입니다. 그가 스탠퍼드 대학에서 머신 러닝 입문자에게 한 강의를 그대로 코세라 온라인 강의 (Coursera.org)에서 무료로 배울 수 있습니다. 이 강의는 머신러닝 입문자들의 필수코스입니다. 인공지능과 머신러닝을 혼자 공부하면서 자연스럽게 만나게 되는 강의입니다.

Linear Regression with One Variable

단변수 선형 회귀

Parameter Learning (파라미터 학습)

Gradient Descent For Linear Regression (선형 회귀의 경사 하강법)

In previous videos, we talked about the gradient descent algorithm and we talked about the linear regression model and the squared error cost function. In this video we're gonna put together gradient descent with our cost function, and that will give us an algorithm for linear regression or putting a straight line to our data.

지난 강의에서 경사 하강 알고리즘, 선형 회귀 모델, 제곱 오차 함수를 배웠습니다. 이번 강의에서 데이터에 적합한 직선을 그릴 수 있도록 선형 회귀와 경사 하강법을 조합할 것입니다.

So this was what we worked out in the previous videos. This gradient descent algorithm which you should be familiar and here's the linear regression model with our linear hypothesis and our squared error cost function. What we're going to do is apply gradient descent to minimize our squared error cost function.

여기 지난 강의에서 배웠던 경사 하강 알고리즘입니다.

그리고, 선형 회귀 모델입니다. 선형 회귀 모델은 선형 가설과 제곱 오차 비용 함수입니다.

제곱 오차 비용 함수를 최소화하는 경사 하강 알고리즘을 만듭니다.

Now in order to apply gradient descent, in order to, you know, write this piece of code, the key term we need is this derivative term over here. So you need to figure out what is this partial derivative term and plugging in the definition of the cause function J. This turns out to be this. Sum from i equals 1 though m of this squared error cost function term. And all I did here was I just, you know plug in the definition of the cost function there. And simplifying a little bit more, this turns out to be equal to this. Sigma i equals one through m of theta zero plus theta one x i minus Yi squared. And all I did there was I took the definition for my hypothesis and plugged it in there. And turns out we need to figure out what is this partial derivative for two cases for J equals 0 and J equals 1. So we want to figure out what is this partial derivative for both the theta 0 case and the theta 1 case. And I'm just going to write out the answers. It turns out this first term is, simplifies to 1/M sum from over my training step of just that of X(i)- Y(i) and for this term partial derivative let's write the theta 1, it turns out I get this term. Minus Y(i) times X(i). Okay and computing these partial derivatives, so we're going from this equation. Right going from this equation to either of the equations down there. Computing those partial derivative terms requires some multivariate calculus. If you know calculus, feel free to work through the derivations yourself and check that if you take the derivatives, you actually get the answers that I got. But if you're less familiar with calculus, don't worry about it and it's fine to just take these equations that were worked out and you won't need to know calculus or anything like that, in order to do the homework so let's implement gradient descent and get back to work.

경사 하강법을 적용하기 위해 미분항을 계산합니다. 편미분 항의 개념과 비용 함수 J(θ0, θ1)이 정의에 미치는 영향을 이해해야 합니다. 다음은 미분항에 비용 함수 J(θ0, θ1)의 식을 대입한 것입니다.

Σ i=1부터 m까지의 합산입니다. 두 번째 식은 가설 함수 hθ(x^(i))와 비용 함수를 조합한 것입니다. 그리고, 경사 하강 업데이트 공식은 j=0의 경우와 j=1일 경우가 있으므로 경사 하강 업데이트 공식의 미분항을 따로 떼어 내어 계산합니다.

여기서, 편미분을 계산하기 위해 다변수의 미적분학을 알아야 합니다. 미적분을 안다면 혼자서 이 공식을 풀 수 있겠지만, 미적분을 몰라도 걱정할 필요는 없습니다. 이 공식을 외우고 그냥 풀면 됩니다. 여기서 식의 유도 과정을 간단하게 정리합니다. 간단히 미분은 'ax^2 + b'를 미분하면 상수는 0이 되어 2ax가 되고, 이 값은 접선의 기울기입니다. 따라서, 1/2 은 미분에 의해 2를 상쇄하기 위해 처음부터 넣은 값입니다.

So armed with these definitions or armed with what we worked out to be the derivatives which is really just the slope of the cost function J. we can now plug them back in to our gradient descent algorithm. So here's gradient descent for linear regression which is gonna repeat until convergence, theta 0 and theta 1 get updated as you know this thing minus alpha times the derivative term. So this term here. So here's our linear regression algorithm. This first term here. That term is of course just the partial derivative with respect to theta zero, that we worked out on a previous slide. And this second term here, that term is just a partial derivative in respect to theta 1, that we worked out on the previous line. And just as a quick reminder, you must, when implementing gradient descent. There's actually this detail that you should be implementing it so the update theta 0 and theta 1 simultaneously.

여기 미분항을 계산한 경사 하강 업데이트 공식입니다.

비용 함수 J(θ0, θ1)의 기울기인 미분의 값을 구합니다. 경사 하강 알고리즘에 대입합니다. 이것이 선형 회귀에서 쓰는 경사 하강 알고리즘입니다. 경사 하강 업데이트 알고리즘은 파라미터 θ0와 θ1에 '- α * 미분항'을 하면서 최소값에 수렴할 때까지 반복합니다. 경사 하강법 업데이트 알고리즘을 구현할 때 반드시 파라미터 θ0와 θ1의 값을 동시에 업데이트해야 합니다.

So. Let's see how gradient descent works. One of the issues we saw with gradient descent is that it can be susceptible to local optima. So when I first explained gradient descent I showed you this picture of it going downhill on the surface, and we saw how depending on where you initialize it, you can end up at different local optima. You will either wind up here or here.

경사 하강법의 동작 방식을 살펴보겠습니다. 경사 하강법은 지역 최소값에 갇힐 수 있습니다. 처음 경사 하강법을 설명할 때 언덕을 내려가는 방법을 설명했습니다. 초기값이 어딘지에 따라 지역 최적 값은 달랐습니다. 왼쪽 최소값 또는 오른쪽의 최소값에 다다를 수 있습니다.

But, it turns out that that the cost function for linear regression is always going to be a bow shaped function like this. The technical term for this is that this is called a convex function. And I'm not gonna give the formal definition for what is a convex function, C, O, N, V, E, X. But informally a convex function means a bowl shaped function and so this function doesn't have any local optima except for the one global optimum. And does gradient descent on this type of cost function which you get whenever you're using linear regression it will always converge to the global optimum. Because there are no other local optimum, global optimum.

그러나 선형 회귀의 비용 함수는 항상 그릇 모양입니다. 정확한 용어로는 볼록 함수(Convex Function)라고 합니다. 볼록 함수가 무엇인지는 알 필요는 없습니다. Convex Fuction 또는 볼록 함수는 그릇 모양의 함수라는 뜻으로 전역 최소값 또는 전역 최적 값이 있다는 것입니다. 즉, 지역 최소값은 존재하지 않습니다. 선형 회귀의 비용 함수에서 경사 하강법 알고리즘은 항상 전역 최적값을 향합니다. 왜냐하면 전역 최소값이 유일한 값이기 때문입니다.

So now let's see this algorithm in action. As usual, here are plots of the hypothesis function and of my cost function J. And so let's say I've initialized my parameters at this value. Let's say, usually you initialize your parameters at zero, zero. Theta zero and theta equals zero. But for the demonstration, in this physical infrontation I've initialized you know, theta zero at 900 and theta one at about -0.1 okay. And so this corresponds to h(x)=-900-0.1x, [the intercept should be +900] is this line, out here on the cost function.

이제 경사 하강 알고리즘의 동작 방식을 구체적으로 설명합니다. 왼쪽에 가설 함수의 그래프가 있고, 오른쪽에 비용 함수 J(θ0,θ1)의 등고선 그래프가 있습니다. 오른쪽의 빨간색 엑스 표시에서 파라미터를 초기화합니다. 일반적으로 파라미터를 (0,0)에서 초기화하지만, 쉽게 설명을 위해 θ0의 값을 900으로 θ1을 -0.1로 초기화합니다. 따라서, 가설 함수 hθ(x)=-900 - 0.1x이고 왼쪽 그림의 파란색 직선입니다.

Now, if we take one step in gradient descent, we end up going from this point out here, over to the down and left, to that second point over there. And you notice that my line changed a little bit,

지금, 경사 하강법의 동작 방식에 따라 한 스텝 내린다면, 오른쪽 그림에서 왼쪽의 빨간 점으로 이동하고, 왼쪽 그림의 가설 함수 파란색 직선은 아래로 이동합니다.

And as I take another step of gradient descent, my line on the left will change. Right?

오른쪽 그림에서 다시 한 스텝 이동하면, 왼쪽 그래프의 가설 함수 파란색 직선이 다시 낮아집니다.

And I've also moved to a new point on my cost function. And as I take further steps of gradient descent, I'm going down in cost. And as I take further steps of gradient descent, I'm going down in cost. So my parameters and such are following this trajectory.

경사 하강 알고리즘에 따라 비용 함수의 값을 타원의 중앙에 근점 하도록 다시 이동합니다. 비용 함수의 값이 작아지는 방향으로 나아갑니다. 계속 나아갑니다.

And if you look on the left, this corresponds with hypotheses. That seem to be getting to be better and better fits to the data

빨간색 엑스 표시가 타원의 중아에 근접할수록 파란색 직선은 데이터에 적합해집니다. 데이터와 더 잘 맞습니다.

until eventually I've now wound up at the global minimum and this global minimum corresponds to this hypothesis, which gets me a good fit to the data.

결국 오른쪽 그림에서 비용 함수가 타원이 중심인 전역 최소값에 다다를 때 설 함수 파란색 직선이 데이터와 아주 잘 적합합니다.

And so that's gradient descent, and we've just run it and gotten a good fit to my data set of housing prices. And you can now use it to predict, you know, if your friend has a house size 1250 square feet, you can now read off the value and tell them that I don't know maybe they could get $250,000 for their house.

이것이 경사 하강 알고리즘입니다. 주택 크기에 따른 주택 가격에 대한 데이터가 일치하는 가설을 찾았습니다. 이제 주택 가격을 예측할 수 있습니다. 여러분의 친구의 주택 크기가 1250 평방피트일 때, 주택 가격은 약 25만 달러입니다.

Finally just to give this another name it turns out that the algorithm that we just went over is sometimes called batch gradient descent. And it turns out in machine learning I don't know I feel like us machine learning people were not always great at giving names to algorithms. But the term batch gradient descent refers to the fact that in every step of gradient descent, we're looking at all of the training examples. So in gradient descent, when computing the derivatives, we're computing the sums [INAUDIBLE]. So ever step of gradient descent we end up computing something like this that sums over our m training examples and so the term batch gradient descent refers to the fact that we're looking at the entire batch of training examples. And again, it's really not a great name, but this is what machine learning people call it. And it turns out that there are sometimes other versions of gradient descent that are not batch versions, but they are instead. Do not look at the entire training set but look at small subsets of the training sets at a time. And we'll talk about those versions later in this course as well. But for now using the algorithm we just learned about or using batch gradient descent you now know how to implement gradient descent for linear regression. So that's linear regression with gradient descent. If you've seen advanced linear algebra before, so some of you may have taken a class in advanced linear algebra. You might know that there exists a solution for numerically solving for the minimum of the cost function j without needing to use an iterative algorithm like gradient descent. Later in this course we'll talk about that method as well that just solves for the minimum of the cost function j without needing these multiple steps of gradient descent. That other method is called the normal equations method. But in case you've heard of that method it turns out that gradient descent will scale better to larger data sets than that normal equation method.

마지막으로, 경사 하강 알고리즘의 다른 이름은 배치 경사 하강 알고리즘입니다. 머신 러닝에서 사용하는 용어입니다. 머신 러닝을 전문가들은 알고리즘의 이름을 짓는 것에 능숙하지 못합니다. 벤치 경사 하강법은 매 스텝마다 모든 학습 데이터 셋을 계산한다는 의미입니다.

경사 하강 알고리즘이 미분항을 계산할 때 m개의 학습 데이터 셋에 대해 반복 연산합니다. 벤치 경사 하강법은 모든 학습 데이터들의 전체를 계산한다는 의미입니다. 물론, 다른 경사 하강 알고리즘은 전체 학습 셋이 아닌 부분을 다루기도 합니다. 배치 경사 하강법의 반복을 사용하지 않고 비용 함수 J의 최소값을 구할 수 있습니다. 데이터의 범위가 너무 크거나 경사 하강 알고리즘을 사용할 수 없을 때 사용합니다. 이것은 이 과정의 뒷부분에서 다시 다룰 것입니다. 다소 복잡한 벤치 경사 하강법을 사용하는 이유는 선형 회귀에서 경사 하강 법을 구현하는 방법을 이해하기 위함입니다. 이것이 선형 회귀의 경사 하강 알고리즘입니다. 이미 고급 선형대수를 아시는 분들은 비용 함수 J의 해답을 얻기 위해서 경사 하강 알고리즘을 반복적으로 사용한다는 것을 알 것입니다.

And now that we know about gradient descent we'll be able to use it in lots of different contexts and we'll use it in lots of different machine learning problems as well.So congrats on learning about your first machine learning algorithm. We'll later have exercises in which we'll ask you to implement gradient descent and hopefully see these algorithms right for yourselves. But before that I first want to tell you in the next set of videos. The first one to tell you about a generalization of the gradient descent algorithm that will make it much more powerful. And I guess I'll tell you about that in the next video.

경사 하강법을 다양한 머신 러닝 문제에서 사용할 수 있습니다. 첫 번째 머신러닝 알고리즘을 배운 것을 축하합니다. 다음 강의를 시작하기 전 경사 하강법에 대한 다양한 예제를 스스로 풀어보시기 바랍니다. 경사 하강법은 매우 강력한 알고리즘입니다.

정리하며

경사 하강 알고리즘과 선형 회귀 모델을 조합할 수 있습니다. 가설 함수 'h(x) = θ0 + θ1*x' 일 때 θ0와 θ1의 값이 변할 때마다 변하는 비용 함수 J(θ0,θ1)가 있습니다. 비용 함수와 경사 하강 알고리즘을 합쳐서 하나의 수학적 정의로 표현할 수 있습니다.

가설 함수 'h(x) = θ1*x' 일 때 θ1의 값이 변할 때마다 변하는 비용 함수 J(θ1)가 있습니다. 비용 함수와 경사 하강 알고리즘을 합쳐서 하나의 수학적 정의로 표현할 수 있습니다.

경사 하강 알고리즘을 배치 경사 하강 알고리즘이라고 합니다. 배치 경사 알고리즘은 매 스텝마다 모든 학습 데이터셋에 대해 합산을 합니다. 일반적으로 경사 하강 알고리즘은 로컬 최소값 또는 로컬 최적 값을 계산하지만, 배치 경사 하강 알고리즘은 전역 최소값 또는 전역 최적 값을 계산합니다. 선형 회귀 모델은 항상 단 하나의 최소값만 있습니다.

keyword