brunch

You can make anything
by writing

C.S.Lewis

by 라인하트 Oct 19. 2020

앤드류 응의 머신러닝(6-4):로지스틱 회귀 비용함수

온라인 강의 플랫폼 코세라의 창립자인 앤드류 응 (Andrew Ng) 교수는 인공지능 업계의 거장입니다. 그가 스탠퍼드 대학에서 머신 러닝 입문자에게 한 강의를 그대로 코세라 온라인 강의 (Coursera.org)에서 무료로 배울 수 있습니다. 이 강의는 머신러닝 입문자들의 필수코스입니다. 인공지능과 머신러닝을 혼자 공부하면서 자연스럽게 만나게 되는 강의입니다.

Logistic Regression

로지스틱 회귀

Logistic Regrassion Model (로지스틱 회귀 모델)

Cost Function (비용함수)

In this video, we'll talk about how to fit the parameters of theta for the logistic regression. In particular, I'd like to define the optimization objective, or the cost function that we'll use to fit the parameters.

이번 강의에서 로지스틱 회귀 가설 함수가 학습 데이터 셋에 최적화된 파라미터 θ를 찾는 방법을 설명합니다. 파라미터 θ를 계산하기 위한 비용 함수와 최적화 목표를 정의합니다.

Here's the supervised learning problem of fitting logistic regression model. We have a training set of m training examples and as usual, each of our examples is represented by a that's n plus one dimensional,and as usual we have x0 equals one. First feature or a zero feature is always equal to one. And because this is a computation problem, our training set has the property that every label y is either 0 or 1. This is a hypothesis, and the parameters of a hypothesis is this theta over here. And the question that I want to talk about is given this training set, how do we choose, or how do we fit the parameter's theta?

여기 로지스틱 회귀 모델을 적용한 지도 학습 문제가 있습니다.

학습 데이터 셋은 m개가 있고, n+1 차원 벡터입니다. 인터셉트 항 x0 = 1 입니다. 학습 데이터 셋은 y=0 또는 y=1 중 하나의 값을 갖습니다. 로지스틱 회귀 가설 함수의 파라미터 θ가 있습니다. 어떻게 파라미터 θ를 구할 수 있을까요?

Back when we were developing the linear regression model, we used the following cost function. I've written this slightly differently where instead of 1 over 2m, I've taken a one-half and put it inside the summation instead. Now I want to use an alternative way of writing out this cost function. Which is that instead of writing out this square of return here, let's write in here costs of h of x, y and I'm going to define that total cost of h of x, y to be equal to this. Just equal to this one-half of the squared error. So now we can see more clearly that the cost function is a sum over my training set, which is 1 over m times the sum of my training set of this cost term here

선형 회귀 가설 모델에서 파리미터 θ를 구하기 위해 비용 함수 J(θ)를 사용했습니다. 로지스틱 회귀 가설 모델도 비용 함수 J(θ)를 사용하지만, 비용 함수 J(θ)를 약간 변경합니다.

우선, 선형 회귀의 비용 함수 J(θ)의 1/2를 Σ(시그마) 안으로 옮깁니다. 그리고, ∑(시그마) 뒤의 식을 Cost(hθ(x^(i)) , y^(i))으로 대체합니다. 즉, 즉, 로지스틱 회귀의 비용 함수 J(θ)는 모든 학습 데이터 셋에 대한 비용(Cost)의 합계를 학습 데이터셋의 개수 m으로 나눈 것입니다.

And to simplify this equation a little bit more, it's going to be convenient to get rid of those superscripts. So just define cost of h of x comma y to be equal to one half of this squared error. And interpretation of this cost function is that, this is the cost I want my learning algorithm to have to pay if it outputs that value, if its prediction is h of x, and the actual label was y. So just cut off the superscripts, right, and no surprise for linear regression the cost we've defined is that or the cost of this is that is one-half times the square difference between what I predicted and the actual value that we have, 0 for y. Now this cost function worked fine for linear regression. But here, we're interested in logistic regression. If we could minimize this cost function that is plugged into J here, that will work okay.

Cost() 함수 방정식을 단순화하기 위해 학습 예제의 순서를 나타내는 위첨자를 지웁니다. Cost() 함수는 hθ(x)와 실제 레이블 y 의 오차에 대한 제곱에 절반입니다.

Cost() 함수는 알고리즘이 실제 레이블 y를 가진 학습 예제에 대해 출력값 hθ(x)예측하기 위해 지불해야 비용입니다. 선형 회귀에서 위 첨자를 지워도 비용 함수는 잘 동작했습니다. 선형 회귀의 비용 함수 Jθ)의 모양만 바꾼 로지스틱 회귀의 비용함수 J(θ)를 최소화할 수 있다면 문제없이 동작할 것입니다.

But it turns out that if we use this particular cost function, this would be a non-convex function of the parameter's data. Here's what I mean by non-convex. Have some cost function J of theta and for logistic regression, this function h here has a nonlinearity that is one over one plus e to the negative theta transpose. So this is a pretty complicated nonlinear function. And if you take the function, plug it in here. And then take this cost function and plug it in there and then plot what j of theta looks like. You find that j of theta can look like a function that's like this with many local optima. And the formal term for this is that this is a non-convex function. And you can kind of tell, if you were to run gradient descent on this sort of function It is not guaranteed to converge to the global minimum.

Whereas in contrast what we would like is to have a cost function j of theta that is convex, that is a single bow-shaped function that looks like this so that if you run theta in the we would be guaranteed that would converge to the global minimum. And the problem with using this squared cost function is that because of this very nonlinear function that appears in the middle here, J of theta ends up being a nonconvex function if you were to define it as a square cost function. So what we'd like to do is, instead of come up with a different cost function, that is convex, and so that we can apply a great algorithm, like gradient descent and be guaranteed to find the global minimum.

그러나, 로지스틱 회귀의 비용함수 J(θ)는 파라미터 θ 에 대해 비볼록 함수(Non-Convex Function)입니다. 비볼록 함수의 의미는 로지스틱 회귀 가설 함수 hθ(x)에 대한 비용 함수J(θ)가 비선형이라는 것입니다. 매우 복잡한 비선형 함수입니다. 왼쪽 그림은 파라미터 θ에 대한 비볼록 함수인 비용 함수 J(θ)를 그린 것입니다. 다수의 로컬 최적값이 있는 함수입니다. 비볼록 함수는 경사 하강법을 적용해도 전역 최소값에 수렴하기 어렵습니다.

반면에 오른쪽 그림은 비용함수 J(θ)가 활 모양으로 한 번만 구부러진 볼록 함수(Convex Function)입니다. 볼록 함수는 경사하강법을 적용하면 전역 최소값에 수렴합니다. 문제는 시그모이드 함수가 매우 비선형이기 때문에 로지스틱 회귀의 비용 함수 J(θ)는 비볼록함수입니다. 따라서. 경사 하강 알고리즘이 반드시 전역 최소값을 찾을 수 있는 볼록한 비용 함수 J(θ)가 필요합니다.

Here's the cost function that we're going to use for logistic regression. We're going to say that the cost, or the penalty that the algorithm pays, if it upwards the value of h(x), so if this is some number like 0.7, it predicts the value h of x. And the actual cost label turns out to be y. The cost is going to be -log(h(x)) if y = 1 and -log(1- h(x)) if y = 0. This looks like a pretty complicated function, but let's plot this function to gain some intuition about what it's doing. Let's start off with the case of y = 1. If y = 1, then the cost function is -log(h(x)). And if we plot that, so let's say that the horizontal axis is h(x), so we know that a hypothesis is going to output a value between 0 and 1. Right, so h(x), that varies between 0 and 1. If you plot what this cost function looks like, you find that it looks like this.

여기 로지스틱 회귀의 Cost() 함수가 있습니다. 이것은 알고리즘이 지불해야 할 비용 또는 패널티 Cost((hθ(x), y) 입니다. 가설 함수 hθ(x)는 예측 값으로 0.7과 같은 실수입니다. y의 레이블이 되는 실제 비용입니다.

y = 1일때 -log(hθ(x)) 이고, y = 0일 때 -log(1-hθ(x))입니다. 매우 복잡한 함수같지만 그래프로 그리면 훨씬 더 쉽게 이해할 수 있습니다. y=1 일때 Cost() 함수 -log(hθ(x))를 그립니다. x축이 hθ(x) 일때 값의 범위는 0과 1 사이입니다. 이 때의 Cost()함수는 파란색 곡선입니다.

One way to see why the plot looks like this is because if you were to plot log z with z on the horizontal axis, then that looks like that. And it approaches minus infinity, right? So this is what the log function looks like. And this is 0, this is 1. Here, z is of course playing the role of h of x. And so -log z will look like this. Just flipping the sign, minus log z, and we're interested only in the range of when this function goes between zero and one, so get rid of that. And so we're just left with, you know, this part of the curve, and that's what this curve on the left looks like

그래프의 모양을 이해하기 위해 수평축이 z인 log(z)를 그려봅니다. log(z) 그래프는 z가 1 이하일 때 음의 무한대의 값을 가지고 1을 넘어가면 양의 값을 가집니다. z는 h(x)의 역할입니다. - log(z) 그래프는 log(z) 그래프를 정확히 뒤집어 놓은 모양입니다. 로지스틱 회귀는 로그함수가 0과 1 사이의 범위에 있을 때만 필요합니다. 그래서 다른 부분을 제거합니다. 그러면 -log(z) 함수의 분홍색 부분만 남습니다. 이것이 왼쪽 곡선과 일치합니다.

Now, this cost function has a few interesting and desirable properties. First, you notice that if y is equal to 1 and h(x) is equal to 1, in other words, if the hypothesis exactly predicts h equals 1 and y is exactly equal to what it predicted, then the cost = 0 right? That corresponds to the curve doesn't actually flatten out. The curve is still going. First, notice that if h(x) = 1, if that hypothesis predicts that y = 1 and if indeed y = 1 then the cost = 0. That corresponds to this point down here, right? If h(x) = 1 and we're only considering the case of y = 1 here. But if h(x) = 1 then the cost is down here, is equal to 0. And that's where we'd like it to be because if we correctly predict the output y, then the cost is 0.

이 Cost() 함수는 몇 가지 재미있는 속성이 있습니다. 첫 번째, y = 1 일 때 hθ(x)=1 입니다. 다시 말해서 가설을 정확하게 예측한 경우 h(x) = 1입니다. 가설이 예측한 값과 데이터의 레이블 y가 같으면 Cost는 0입니다. 이건 곡선이 평평하지 않다는 사실과 부합합니다. h(x)=1 이고 y=1 일때 Cost은 0 입니다. 여기 파란색 점입니다. hθ(x)=1 이고 y=1인 경우는 여기 한 곳 뿐입니다. 즉, hθ(x)가 정확히 1을 예측하면 Cost는 0 입니다.

But now notice also that as h(x) approaches 0, so as the output of a hypothesis approaches 0, the cost blows up and it goes to infinity. And what this does is this captures the intuition that if a hypothesis of 0, that's like saying a hypothesis saying the chance of y equals 1 is equal to 0.

It's kinda like our going to our medical patients and saying the probability that you have a malignant tumor, the probability that y=1, is zero. So, it's like absolutely impossible that your tumor is malignant. But if it turns out that the tumor, the patient's tumor, actually is malignant, so if y is equal to one, even after we told them, that the probability of it happening is zero. So it's absolutely impossible for it to be malignant. But if we told them this with that level of certainty and we turn out to be wrong, then we penalize the learning algorithm by a very, very large cost. And that's captured by having this cost go to infinity if y equals 1 and h(x) approaches 0. This slide consider the case of y equals 1. Let's look at what the cost function looks like for y equals 0.

하지만 가설 h(x)가 0에 가까워지면 즉, 가설의 출력값이 0에 가까워질수록 Cost은 무한대로 증가합니다. 즉 y = 1 일 확률이 0일 때를 직관적으로 알려줍니다.

예를 들면, 여기 종양을 가진 환자가 있습니다. 악성 종양을 가지고 있을 확률 ( y=1)은 0 일때 종양이 악성이라는 건 절대적으로 불가능합니다. 그러나 만일 환자의 종양이 실제로 악성일 수 있습니다. 즉, 예측과 달리 y = 1입니다. 하지만, 여러분이 악성 종양일 발생 확률이 0%라고 이미 환자에게 말했습니다. 틀렸다는 것이 증명될 경우 학습 알고리즘을 매우 큰 Cost로 처벌해야 합니다. 만일 y = 1이면 Cost는 h(x) = 0 일 때 Cost는 무한대입니다. 이제 y=0 일 경우 Cost() 함수를 설명합니다.

If y is equal to 0, then the cost looks like this, it looks like this expression over here, and if you plot the function, -log(1-z), what you get is the cost function actually looks like this. So it goes from 0 to 1, something like that and so if you plot the cost function for the case of y equals 0, you find that it looks like this. And what this curve does is it now goes up and

it goes to plus infinity as h of x goes to 1 because as I was saying, that if y turns out to be equal to 0. But we predicted that y is equal to 1 with almost certainly, probably 1, then we end up paying a very large cost. And conversely, if h of x is equal to 0 and y equals 0, then the hypothesis melted. The protected y of z is equal to 0, and it turns out y is equal to 0, so at this point, the cost function is going to be 0.

만약 y = 0 이라면 Cost은 -log(1-hθ(x)) 입니다. 이것은 로그 함수 -log(1-z)의 그래프 모양입니다. 0부터 1까지 필요합니다. 이 곡선은 h(x) = 1인 점에서 무한대입니다. 가설 hθ(x)가 y = 1 을 예측한다면 거의 확실히 무한대의 Cost를 지불합니다. 반대로 h(x) = 0 이고 y=0 일 때 Cost은 0입니다. 여기 빨간색 점입니다. hθ(x)=0 이고 y=0인 경우는 여기 한 곳 뿐입니다.

In this video, we will define the cost function for a single training example. The topic of convexity analysis is now beyond the scope of this course, but it is possible to show that with a particular choice of cost function, this will give a convex optimization problem. Overall cost function j of theta will be convex and local optima free. In the next video we're gonna take these ideas of the cost function for a single training example and develop that further, and define the cost function for the entire training set. And we'll also figure out a simpler way to write it than we have been using so far, and based on that we'll work out grading descent, and that will give us logistic regression algorithm.

이번 강의에서 단순화하기 위해 학습 예제가 하나 일 때 Cost() 함수를 정의했습니다. 로지스틱 회귀 함수의 가설이 볼록 함수인지 아닌 지는 이 과정의 범위를 벗어납니다. 하지만, 비볼록 함수는 최적화 문제를 일으킨다는 것을 이해했습니다. 로지스틱 회귀의 비용 함수 J(θ)는 볼록 함수이고 전역 최적값이 없습니다.

다음 강의에서 한 개의 학습 데이터를 이용한 Cost() 함수를 좀 더 발전시켜 전체 학습 데이터 셋에 대한 Cost() 함수로 정의할 것입니다. 그리고, 지금까지 사용했던 것보다 더 간단한 방법으로 경사 하강 알고리즘을 계산하고 로지스틱 회귀 알고리즘을 제공할 것입니다.

앤드류 응의 머신 러닝 동영상 강의

정리하며

로직스틱 회쉬의 비용 함수 J(θ)는 선형회귀의 비용함수 J(θ)에서 시작합니다.

Cost() 함수는 알고리즘이 실제 레이블 y를 가진 학습 예제에 대해 출력값 hθ(x)예측하기 위해 지불해야 비용입니다. 로지스틱 회귀 가설 함수 hθ(x) = 1/(1+e^(-θTX) 에 대한 비용 함수 J(θ)는 볼록 함수가 아닌 비볼록 함수입니다.매우 복잡한 비선형 함수로 전역 최소값에 도달한다는 보장이 없습니다.따라서, 전역 최소값을 찾을 수 있는 볼록한 비용 함수가 필요합니다.

로지스틱 회귀의 비용 함수의 특징은 y = 1 일 때 h(x)=1 입니다. 다시 말해서 가설을 정확하게 예측한 경우 h(x) = 1입니다. 반대로 y= 0 일 때 h(x) = 0 이면 비용은 0입니다.만일 y=0 일 때 h(x) = 1을 예측하면 비용은 무한대로 증가합니다.