brunch

You can make anything
by writing

C.S.Lewis

by 라인하트 Oct 19. 2020

앤드류 응의 머신러닝(6-5):로지스틱회귀 경사 하강법

온라인 강의 플랫폼 코세라의 창립자인 앤드류 응 (Andrew Ng) 교수는 인공지능 업계의 거장입니다. 그가 스탠퍼드 대학에서 머신 러닝 입문자에게 한 강의를 그대로 코세라 온라인 강의 (Coursera.org)에서 무료로 배울 수 있습니다. 이 강의는 머신러닝 입문자들의 필수코스입니다. 인공지능과 머신러닝을 혼자 공부하면서 자연스럽게 만나게 되는 강의입니다.

Logistic Regression

로지스틱 회귀

Logistic Regrassion Model (로지스틱 회귀 모델)

Simplified Cost Function and Gradient Descent

(단순화된 비용 함수와 경사 하강 알고리즘)

In this video, we'll figure out a slightly simpler way to write the cost function than we have been using so far. And we'll also figure out how to apply gradient descent to fit the parameters of logistic regression. So, by the end of this video you know how to implement a fully working version of logistic regression.

이번 강의에서 지금까지 사용했던 비용 함수를 조금 더 간단하게 만들고, 로지스틱 회귀에서 데이터에 적합한 파라미터 θ를 찾기 위한 경사 하강법을 설명합니다. 이번 강의의 마지막 즈음에 제대로 된 로지스틱 회귀를 구현하는 방법도 설명합니다.

Here's our cost function for logistic regression. Our overall cost function is 1 over m times the sum over the training set of the cost of making different predictions on the different examples of labels y i. And this is the cost of a single example that we worked out earlier. And just want to remind you that for classification problems in our training sets, and in fact even for examples, now that our training set y is always equal to zero or one, right?

여기 로지스틱 회귀 가설과 비용 함수가 있습니다.

전체 비용 함수 J(θ)는 학습 예제에서 실제 레이블 y^(i)와 다른 예측 hθ(x)을 할 때 Cost()의 총합을 학습 예제의 수로 나눈 것입니다. Cost(hθ(x) , y)는 지난 강의에서 다룬 단일 학습 예제에 대한 Cost()입니다. 분류 문제에서 학습 데이터 셋 예제의 y의 값은 항상 y=0 또는 y=1입니다.

That's sort of part of the mathematical definition of y. Because y is either zero or one, we'll be able to come up with a simpler way to write this cost function. And in particular, rather than writing out this cost function on two separate lines with two separate cases, so y equals one and y equals zero. I'm going to show you a way to take these two lines and compress them into one equation. And this would make it more convenient to write out a cost function and derive gradient descent. Concretely, we can write out the cost function as follows. We say that cost of H(x), y. I'm gonna write this as -y times log h(x)- (1-y) times log (1-h(x)). And I'll show you in a second that this expression, no, this equation, is an equivalent way, or more compact way, of writing out this definition of the cost function that we have up here. Let's see why that's the case.

이것은 일종의 y에 관한 수학 정의입니다. y의 값은 항상 0 또는 1이기 때문에 Cost() 함수를 더 간단하게 쓸 수 있습니다. 그것은 Cost() 함수를 y = 1의 경우와 y=0인 경우로 두 줄로 나누어 적는 것이 아니라 한 줄로 압축하는 것입니다. 단 한 줄의 방정식이 Cost() 함수로 작성하고 경사 하강을 유도하는 것에 더 편리합니다. 구체적으로 다음과 같이 쓸 수 있습니다.

간단히 설명하면, 한 줄로 정리한 방정식 Cost() 함수는 두 줄로 정리한 Cost() 함수와 동일하지만 좀 더 축약된 방법입니다. 왜 그런지를 알아봅니다.

We know that there are only two possible cases. Y must be zero or one. So let's suppose Y equals one. If y is equal to 1, than this equation is saying that the cost is equal to, well if y is equal to 1, then this thing here is equal to 1. And 1 minus y is going to be equal to 0, right. So if y is equal to 1, then 1 minus y is 1 minus 1, which is therefore 0. So the second term gets multiplied by 0 and goes away. And we're left with only this first term, which is y times log- y times log (h(x)). Y is 1 so that's equal to -log h(x). And this equation is exactly what we have up here for if y = 1.

항상 y = 0 또는 y = 1인 경우만 가능합니다. 먼저 y=1인 경우를 계산해 봅시다.

Cost() 함수에서 y를 1로 적습니다. 두 번째 항의 (1-y)는 (1-1)로 항 전체가 0이 됩니다. 두 줄 방정식의 y=1일 때와 정확히 일치합니다.

The other case is if y = 0. And if that's the case, then our writing of the cos function is saying that, well, if y is equal to 0, then this term here would be equal to zero. Whereas 1 minus y, if y is equal to zero would be equal to 1, because 1 minus y becomes 1 minus zero which is just equal to 1. And so the cost function simplifies to just this last term here, right? Because the fist term over here gets multiplied by zero, and so it disappears, and so it's just left with this last term, which is -log (1- h(x)). And you can verify that this term here is just exactly what we had for when y is equal to 0.

So this shows that this definition for the cost is just a more compact way of taking both of these expressions, the cases y =1 and y = 0, and writing them in a more convenient form with just one line.

다음으로 y = 0일 경우입니다.

Cost() 함수에서 y를 0으로 적습니다. 첫 번째 항의 y 가 0이 되어 항 전체가 0입니다. 두 줄 방정식의 y=0일 때와 정확히 일치합니다.

새로운 한 줄의 Cost() 함수는 y=1과 y=0의 경우를 통합하였습니다. 단 줄로 통한된 방정식은 사용이 편리합니다.

We can therefore write all our cost functions for logistic regression as follows. It is this 1 over m of the sum of these cost functions. And plugging in the definition for the cost that we worked out earlier, we end up with this. And we just put the minus sign outside. And why do we choose this particular function, while it looks like there could be other cost functions we could have chosen. Although I won't have time to go into great detail of this in this course, this cost function can be derived from statistics using the principle of maximum likelihood estimation, which is an idea in statistics for how to efficiently find parameter theta for different models. And it also has a nice property that it is convex. So this is the cost function that essentially everyone uses when fitting logistic regression models. If you don't understand the terms that I just said, if you don't know what the principle of maximum likelihood estimation is, don't worry about it. But it's just a deeper rationale and justification behind this particular cost function than I have time to go into in this class.

로지스틱 회귀의 비용 함수 J(θ)를 다음과 정리할 수 있습니다.

로지스틱 회귀의 비용 함수 J(θ)의 Cost() 함수 부분을 조금 전에 정리한 한 줄 방정식으로 치환합니다. 그리고, 마이너스(-)를 식의 앞으로 뺍니다. 다른 Cost() 함수를 사용할 수 있지만 여기의 함수를 사용하는 이유가 있습니다. 여기서 자세히 설명하지 않을 것입니다. 비용 함수 J(θ)는 최대 우도 추정법 (maximum likelihood estimation)의 원리를 이용한 통계에서 유도하였습니다. 최대 우도 추정법은 모델에 대한 파라미터 θ를 효율적으로 구하는 방법으로 통계에서 시작한 아이디어입니다. 그리고, 또 다른 특징은 볼록 함수(Convex Function)입니다. 따라서 로지스틱 회귀 모델에서 반드시 사용해야 할 비용 함수 J(θ)입니다. 최대 우도 추정법 (maximum likelihood estimation)의 원리를 알 필요는 없습니다. 원리는 이해한다고 해도 비용 함수 J(θ)에 대한 더 깊은 근거와 정당성을 확보할 뿐입니다.

Given this cost function, in order to fit the parameters, what we're going to do then is try to find the parameters theta that minimize J of theta. So if we try to minimize this, this would give us some set of parameters theta. Finally, if we're given a new example with some set of features x, we can then take the thetas that we fit to our training set and output our prediction as this. And just to remind you, the output of my hypothesis I'm going to interpret as the probability that y is equal to one. And given the input x and parameterized by theta. But just, you can think of this as just my hypothesis as estimating the probability that y is equal to one.

학습 데이터에 최적화된 파라미터 θ 를 찾기 위해 비용 함수 J(θ)을 최소화해야 합니다. 비용 함수 J(θ)를 최소화하기 위해서는 학습 데이터 셋에 최적화된 파라미터 θ 로 구성된 가설 hθ(x)가 있어야 합니다. 그리고,

몇 개의 피처로 구성된 새로운 학습 예제를 가설에 대입하여 결과값을 구할 수 있습니다. 다시 한번 정리하면, 입력값 x와 파라미터 θ가 있을 때 가설 hθ(x)의 출력값은 y = 1일 확률입니다. 즉, P(y=1 | x;θ)입니다. 가설 함수 hθ(x)는 y = 1일 확률을 추정합니다.

So all that remains to be done is figure out how to actually minimize J of theta as a function of theta so that we can actually fit the parameters to our training set. The way we're going to minimize the cost function is using gradient descent. Here's our cost function and if we want to minimize it as a function of theta, here's our usual template for graded descent where we repeatedly update each parameter by taking, updating it as itself minus learning ray alpha times this derivative term. If you know some calculus, feel free to take this term and try to compute the derivative yourself and see if you can simplify it to the same answer that I get. But even if you don't know calculus don't worry about it.

If you actually compute this, what you get is this equation, and just write it out here. It's sum from i equals one through m of essentially the error times xij. So if you take this partial derivative term and plug it back in here, we can then write out our gradient descent algorithm as follows.

그래서 학습 데이터 셋에 파라미터 θ를 최적화하기 위해 파라미터 θ의 함수이자 비용 함수 J(θ)를 최소화하는 방법을 찾습니다. 비용 함수를 최소화하는 방법은 경사 하강법이 있습니다. 여기 비용 함수와 경사 하강 법이 있습니다.

경사 하강법 업데이트를 반복적으로 수행하면서 각 파라미터 θ를 구합니다. 미적분에 익숙한 분들은 이 항을 미분할 수 있을 것입니다. 미적분학을 몰라도 상관없습니다. 어차피 같은 결과를 얻을 것입니다. 미분항을 계산한 식을 경사 하강법에 치환합니다.

And all I've done is I took the derivative term for the previous slide and plugged it in there.

So if you have n features, you would have a parameter vector theta, which with parameters theta 0, theta 1, theta 2, down to theta n. And you will use this update to simultaneously update all of your values of theta. Now, if you take this update rule and compare it to what we were doing for linear regression. You might be surprised to realize that, well, this equation was exactly what we had for linear regression. In fact, if you look at the earlier videos, and look at the update rule, the Gradient Descent rule for linear regression. It looked exactly like what I drew here inside the blue box. So are linear regression and logistic regression different algorithms or not?

다시 정리합니다.

n개의 피처와 파라미터 θ0, θ1, θ2... θn의 파라미터 벡터인 θ가 있습니다. 경사 하강법은 모든 파리미터 θ를 동시에 업데이트합니다. 로지스틱 회귀의 경사 하강법과 선형 회귀의 경사 하강법 공식은 똑같습니다. 여러분은 잠시 놀랄을 지도 모릅니다. 선형 회귀와 로지스틱 회귀 알고리즘이 똑같습니다.

Well, this is resolved by observing that for logistic regression, what has changed is that the definition for this hypothesis has changed. So as whereas for linear regression, we had h(x) equals theta transpose X, now this definition of h(x) has changed. And is instead now one over one plus e to the negative transpose x. So even though the update rule looks cosmetically identical, because the definition of the hypothesis has changed, this is actually not the same thing as gradient descent for linear regression.

In an earlier video, when we were talking about gradient descent for linear regression, we had talked about how to monitor a gradient descent to make sure that it is converging. I usually apply that same method to logistic regression, too to monitor a gradient descent, to make sure it's converging correctly.

경사 하강법 업데이트 공식만 같을 뿐입니다.

로지스틱 회귀와 선형 회귀 가설 함수의 정의가 다릅니다. 동시 업데이트 공식은 외관상 동일하게 보이더라도 가설의 정의는 다릅니다. 따라서, 로지스틱 회귀의 경사 하강법은 선형 회귀의 경사 하강법과 다릅니다.

지난 강의에서 선형 회귀의 경사 하강법과 경사 하강법의 수렴 여부를 관측하는 방법을 설명했습니다. 로지스틱 회귀에서 경사 하강법이 적절하게 수렴하는지 관측하는 방법은 동일합니다.

And hopefully, you can figure out how to apply that technique to logistic regression yourself. When implementing logistic regression with gradient descent, we have all of these different parameter values, theta zero down to theta n, that we need to update using this expression. And one thing we could do is have a for loop. So for i equals zero to n, or for i equals one to n plus one. So update each of these parameter values in turn. But of course rather than using a for loop, ideally we would also use a vectorized implementation. So that a vectorized implementation can update all of these n plus one parameters all in one fell swoop. And to check your own understanding, you might see if you can figure out how to do the vectorized implementation with this algorithm as well.

그리고, 이 과정을 완료한 후 여러분은 로지스틱 회귀 기법을 스스로 적용하는 방법을 알 것입니다. 로지스틱 회귀를 경사 하강법으로 구현할 때, θ0부터 θn까지 모든 파라미터를 동시에 업데이트해야 합니다. 보통은 For 루프를 활용합니다. for i = 0:n 또는 for i = 1:n+1까지 파라미터를 업데이트합니다. 물론, For 루프가 아니라 벡터화 구현을 사용하는 것이 더 이상적입니다. 벡터화 구현은 n +1 개의 파라미터를 한 번에 업데이트합니다.

So, now you you know how to implement gradient descents for logistic regression. There was one last idea that we had talked about earlier, for linear regression, which was feature scaling. We saw how feature scaling can help gradient descent converge faster for linear regression. The idea of feature scaling also applies to gradient descent for logistic regression. And yet we have features that are on very different scale, then applying feature scaling can also make grading descent run faster for logistic regression.

지금까지 로지스틱 회귀에서 경사 하강 알고리즘이었습니다. 선형 회귀에서 피처 스케일링은 선형 회귀의 경사 하강법이 빠르게 전역 최소값에 수렴하게 도와줍니다. 로지스틱 회귀에서 피처의 값의 범위가 너무 다르다면 피처 스케일링을 적용합니다.

So that's it, you now know how to implement logistic regression and this is a very powerful, and probably the most widely used, classification algorithm in the world. And you now know how we get it to work for yourself.

로지스틱 회귀에서 경사 하강 알고리즘을 공부했습니다. 로지스틱 회귀는 세계에서 가장 강력하고 가장 널리 사용하는 분류 알고리즘입니다.