brunch

You can make anything
by writing

C.S.Lewis

by 라인하트 Sep 26. 2020

앤드류 응의 머신러닝 강의 (2-2) : 비용 함수

온라인 강의 플랫폼 코세라의 창립자인 앤드류 응 (Andrew Ng) 교수는 인공지능 업계의 거장입니다. 그가 스탠퍼드 대학에서 머신 러닝 입문자에게 한 강의를 그대로 코세라 온라인 강의 (Coursera.org)에서 무료로 배울 수 있습니다. 이 강의는 머신러닝 입문자들의 필수코스입니다. 인공지능과 머신러닝을 혼자 공부하면서 자연스럽게 만나게 되는 강의입니다.

Linear Regression with One Variable

단변수 선형 회귀

Model and Cost Function (모델과 비용 함수)

Cost Function (비용 함수)

In this video we'll define something called the cost function, this will let us figure out how to fit the best possible straight line to our data.

이번 강의에서 비용 함수를 정의합니다. 비용 함수는 데이터에 가장 잘 맞는 최적의 직선을 찾습니다.

In linear progression, We have a training set that I showed here remember on notation m was the number of training examples, so maybe m equals 47. And the form of our hypothesis, which we use to make predictions is this linear function.

지난 강의에서 봤던 학습 데이터 셋입니다. m은 학습용 데이터의 개수입니다. m= 47 입니다. 데이터 셋에 적합한 가설 형태는 다음과 같습니다.

hθ(x) = θ0 + θ1x

우리가 예측을 위해 사용했던 선형 함수입니다.

To introduce a little bit more terminology, these theta zero and theta one, they stabilize what I call the parameters of the model. And what we're going to do in this video is talk about how to go about choosing these two parameter values, theta 0 and theta 1.

전문용어를 좀 더 소개합니다. 가설에서 θ0와 θ1은 파라미터입니다. 이번 강의에서 2개의 파라미터 θ0와 θ1의 값을 고르는 방법을 설명할 것입니다.

With different choices of the parameter's theta 0 and theta 1, we get different hypothesis, different hypothesis functions. I know some of you will probably be already familiar with what I am going to do on the slide, but just for review.

파라미터 θ0와 θ1의 값이 무엇인지에 따라 전혀 다른 가설과 가설함수가 만들어집니다. 여러분 중 몇 명은 이것이 익숙한 내용이겠지만 복습한다는 느낌으로 보시기 바랍니다.

Here are a few examples. If theta 0 is 1.5 and theta 1 is 0, then the hypothesis function will look like this. Because your hypothesis function will be h of x equals 1.5 plus 0 times x which is this constant value function which is phat at 1.5. If theta0 = 0, theta1 = 0.5, then the hypothesis will look like this, and it should pass through this point 2,1 so that you now have h(x). Or really h of theta(x), but sometimes I'll just omit theta for brevity. So h(x) will be equal to just 0.5 times x, which looks like that. And finally, if theta zero equals one, and theta one equals 0.5, then we end up with a hypothesis that looks like this. Let's see, it should pass through the two-two point. Like so, and this is my new vector of x, or my new h subscript theta of x. Whatever way you remember, I said that this is h subscript theta of x, but that's a shorthand, sometimes I'll just write this as h of x.

여기 몇 가지 예가 있습니다. 만일 'θ0 = 1.5', 'θ1 =0'이면 가설 함수는 왼쪽 예와 같습니다. 가설 함수는 h = 1.5 + 0x, 즉 정수 값 1.5에 수렴합니다. 만일 'θ0 = 0'이고 'θ1 = 0.5' 이면, 가설은 h = 0 + 0.5x이고 (2, 1) 값을 지나가는 h(x)가 됩니다. hθ(x)로도 표현할 수 있지만, 가끔 간단히 하기 위해 θ를 생략합니다. h(x) = 0.5x 입니다. 그리고 마지막으로, 'θ0 = 1'이고. 'θ1 = 0.5'이면, h(x) = 1 + 0.5x입니다. 한번 봅시다. 이 가설은 (2, 2) 값을 지나가게 됩니다. 그래서 이것은 x의 백터이고, 또는 새로운 hθ(x)가 됩니다. hθ(x) 또는 h(x) 든 상관없이 편할대로 사용하면 됩니다.

In linear regression, we have a training set, like maybe the one I've plotted here. What we want to do, is come up with values for the parameters theta zero and theta one so that the straight line we get out of this, corresponds to a straight line that somehow fits the data well, like maybe that line over there. So, how do we come up with values, theta zero, theta one, that corresponds to a good fit to the data? The idea is we get to choose our parameters theta 0, theta 1 so that h of x, meaning the value we predict on input x, that this is at least close to the values y for the examples in our training set, for our training examples. So in our training set, we've given a number of examples where we know x decides the wholes and we know the actual price is was sold for. So, let's try to choose values for the parameters so that, at least in the training set, given the X in the training set we make reason of the active predictions for the Y values.

선형 회귀에서 학습 데이터 셋을 도식화합니다. 파라미터 θ0와 θ1의 값들을 이용해서 학습 데이터 셋에 적당한 직선을 그립니다. 학습 데이터 셋이 직선과 얼마나 일치하는 지를 봅니다. 학습 데이터 셋이 θ0와 θ1에 얼마나 일치하는 지를 알 수 있을까요? 학습 데이터 셋에서 파라미터 θ0와 θ1을 선택한 후 x에 값에 따른 예측한 값 hθ(x)과 학습 데이터 셋의 실제 값 y가 얼마나 가까운 지를 계산하는 것입니다. 학습 데이터 셋에서 주택 크기 x에 따른 실제 팔린 주택 가격 y의 쌍인 (x,y)를 알고 있고, 파라미터 θ0와 θ1의 값이 주어지고, 주택 크기 x가 주어지면 주택 가격에 대한 예측 값 hθ(x)를 계산할 수 있습니다.

Let's formalize this. So linear regression, what we're going to do is, I'm going to want to solve a minimization problem. So I'll write minimize over theta0 theta1. And I want this to be small, right? I want the difference between h(x) and y to be small. And one thing I might do is try to minimize the square difference between the output of the hypothesis and the actual price of a house. Okay. So lets find some details. You remember that I was using the notation (x(i), y(i)) to represent the ith training example. So what I want really is to sum over my training set, something i = 1 to m, of the square difference between, this is the prediction of my hypothesis when it is input to size of house number i. Right? Minus the actual price that house number I was sold for, and I want to minimize the sum of my training set, sum from I equals one through M, of the difference of this squared error, the square difference between the predicted price of a house, and the price that it was actually sold for. And just remind you of notation, m here was the size of my training set right? So my m there is my number of training examples. Right that hash sign is the abbreviation for number of training examples, okay? And to make some of our, make the math a little bit easier, I'm going to actually look at we are 1 over m times that so let's try to minimize my average minimize one over 2m. Putting the 2 at the constant one half in front, it may just sound the math probably easier so minimizing one-half of something, right, should give you the same values of the process, theta 0 theta 1, as minimizing that function.And just to be sure, this equation is clear, right? This expression in here, h subscript theta(x), this is our usual, right? That is equal to this plus theta one xi. And this notation, minimize over theta 0 theta 1, this means you'll find me the values of theta 0 and theta 1 that causes this expression to be minimized and this expression depends on theta 0 and theta 1, okay?

이것을 공식화해 봅시다. 선형 회귀 문제는 파라미터 θ0와 θ1를 최소화하는 문제를 해결하는 것입니다. 그래서 θ0와 θ1의 값을 최소화한다는 의미로 다음과 같이 적습니다.

minimize

θ0, θ1

여기서 hθ(x)와 y의 차이를 최소화합니다. 즉, 가설의 결과값과 실제 집값의 차이의 제곱을 최소화하는 것입니다. 다음과 같이 표현합니다.

자세하게 보겠습니다. (x^(i), y^(i))는 i번째 학습 데이터를 나타냅니다. 실제 학습 데이터 셋에서 i=1부터 m까지의 차이의 제곱의 합계를 구합니다. 다음과 같이 표현합니다.

여기서, x^(i)는 i 번째 데이터의 주택 크기이고, hθ(x^(i))는 i 번째 데이터의 주택 크기로 예측한 주택 가격입니다. , y^(i)는 x^(i) 주택 크기로 판매한 실제 주택가격입니다. 여기 'm'은 학습 데이터 셋의 개수입니다.

여기서, Σ 합산에 대한 평균을 계산하기 위해 학습 데이터 셋의 개수 m으로 나눕니다. 1/2을 곱하는 것은 미분을 했을 때 수식을 단순화하기 위해 사용합니다. 따라서, 1/2m을 곱합니다. 평균 오차의 제곱의 절반을 최소화하는 파라미터 θ0와 θ1 값을 구하는 과정입니다.

이제 방정식을 확실히 이해했습니까? 여기서 hθ(x^(i))는 θ0 + θ1x^(i)와 같습니다. 따라서, 'hθ(x^(i)) - y^(I)' 최소화한다는 것은 파라미터 θ0와 θ1를 최소화하는 것과 같습니다. 결국, 이 방정식을 이용하여 파라미터 θ0와 θ1를 최소화하는 값을 찾을 수 있습니다.

So just a recap. We're closing this problem as, find me the values of theta zero and theta one so that the average, the 1 over the 2m, times the sum of square errors between my predictions on the training set minus the actual values of the houses on the training set is minimized. So this is going to be my overall objective function for linear regression. And just to rewrite this out a little bit more cleanly, what I'm going to do is, by convention we usually define a cost function, which is going to be exactly this, that formula I have up here. And what I want to do is minimize over theta0 and theta1. My function J(theta0, theta1). Just write this out. This is my cost function. So, this cost function is also called the squared error function. When sometimes called the squared error cost function and it turns out that why do we take the squares of the erros. It turns out that these squared error cost function is a reasonable choice and works well for problems for most regression problems. There are other cost functions that will work pretty well. But the square cost function is probably the most commonly used one for regression problems. Later in this class we'll talk about alternative cost functions as well, but this choice that we just had should be a pretty reasonable thing to try for most linear regression problems. Okay. So that's the cost function.

정리하자면, 파라미터 θ0와 θ1의 최소값을 찾기 위해 오차의 제곱에 대한 평균의 절반을 계산합니다. 평균 오차의 제곱은 학습 데이터 셋에 있는 모든 예제에 대해 주택 크기에 따라 예측한 주택 가격과 실제 주택 가격의 오차를 제곱한 것입니다. 그래서 이것이 선형 회귀에 대한 전반적인 목적 함수입니다. 좀 더 명확하게 정리하자면, 비용 함수 J( θ0, θ1)를 계산하기 위한 공식은 다음과 같습니다.

여기서, J(θ0, θ1)는 파라미터 θ0와 θ1를 최소화하는 것입니다. 이것이 비용 함수입니다. 이 비용 함수를 '제곱 오차 함수(Squared Error Function)' 또는 '제곱 오차 비용 함수(Squared Error Cost Function)'이라고 부릅니다. 이름에서 오차를 제곱한다는 의미가 있습니다. 제곱 오차 함수는 합리적인 선택이며 대부분의 회귀 문제에서 잘 작동합니다. 물론, 다른 비용 함수들도 있지만, 제곱 오차 함수는 통상적으로 회귀 문제에서 가장 많이 사용합니다. 이 과정 후반에 다른 비용 함수들도 다룰 것입니다. 제곱 오차 비용 함수는 대부분의 선형 회귀 문제에서 꽤 합리적인 선택인 것은 분명합니다. 이것이 비용함수 입니다.

So far we've just seen a mathematical definition of this cost function. In case this function j of theta zero, theta one. In case this function seems a little bit abstract, and you still don't have a good sense of what it's doing, in the next video, in the next couple videos, I'm actually going to go a little bit deeper into what the cause function "J" is doing and try to give you better intuition about what is computing and why we want to use it...

지금까지 비용 함수의 수학적인 정의를 배웠습니다. J(θ0, θ1) 함수는 약간 추상적이고 정확히 무엇인지를 여전히 잘 이해하지 못했을 것입니다. 다음 강의에서 비용 함수 J의 역할을 좀 더 자세히 설명할 것입니다. 비용 함수 J()를 사용하는 이유와 비용 함수 J가 무엇을 계산하는 지에 대한 감각을 익힐 것입니다.

앤드류 응의 머신러닝 동영상 강의

정리하며

비용함수 J(θ0, θ1)는 평균 제곱 오차를 최소화하는 파라미터 θ0와 θ1을 찾습니다. 제곱 오차 함수 J(θ0, θ1)를 유도하는 과정을 정리합니다.

(1) 가설 함수

학습 데이터 셋를 도식화했을 때 데이터에 가장 적합한 함수는 직선의 일차함수라고 가정합니다. 직선에 대한 가설 함수는 다음과 같습니다. 중학교 수학 시간에 배웠던 일차함수 'y(x) = ax + b'를 가설 함수 h(x)에 대해 정리한 것입니다. 상수 a와 b를 θ0와 θ1으로 표현합니다.

(2) 오차 제곱 함수

x에 값에 따른 가설 함수 h(x)의 예측 값과 학습용 데이터 셋의 x에 따른 실측 값 y와의 오차를 계산합니다. 오차는 데이터에 따라 양수 또는 음수로 나타납니다. 단순하게 오차를 더한다면 음수로 인해 실제 오차를 정확히 측정할 수 없기 때문에 음수를 제거하기 위해 각각의 오차에 제곱을 합니다.

(3) 제곱 오차 함수의 합산

제곱 오차 함수를 모든 학습 데이터 셋에 적용합니다. 'm'은 모든 데이터 셋의 개수이고, 'i'는 데이터의 순서입니다. i=1이면 첫 번째 데이터입니다. 만일 m= 47이라면 총 47개의 오차 데이터가 있습니다. 모든 데이터 셋에 대한 오차를 합산하는 수학 공식은 다음과 같습니다.

(4) 평균 제곱 오차 함수

제곱 오차 함수로 모든 학습 데이터 셋에 대한 오차의 제곱을 더하기만 한다면 학습용 데이터 셋이 증가할수록 오차가 증가할 것입니다. 학습 데이터의 개수가 증가하더라도 오차를 정확히 측정하기 위해 평균을 사용합니다. 평균은 훈련용 데이터의 개수 m 만큼 다시 나누어주면 평균 오차를 구할 수 있습니다. 예를 들어, m = 47이라면, 분자는 47개의 오차를 더한 값이고, 분모는 47입니다. 이를 수식으로 표현하면 다음과 같습니다. 여기서 1/m이 아닌 1/2m로 나누기도 합니다. 이유는 나중에 이식을 미분할 때 수식을 단순화하기 위해서입니다. x^2 를 미분하면 2x 이므로 1/2를 곱해서 제거합니다.