brunch

You can make anything
by writing

C.S.Lewis

by 라인하트 Oct 11. 2020

앤드류 응의 머신러닝 (4-6) : 정규 방정식

온라인 강의 플랫폼 코세라의 창립자인 앤드류 응 (Andrew Ng) 교수는 인공지능 업계의 거장입니다. 그가 스탠퍼드 대학에서 머신 러닝 입문자에게 한 강의를 그대로 코세라 온라인 강의 (Coursera.org)에서 무료로 배울 수 있습니다. 이 강의는 머신러닝 입문자들의 필수코스입니다. 인공지능과 머신러닝을 혼자 공부하면서 자연스럽게 만나게 되는 강의입니다.

Linear Regression with Mutiple Variables

다변수 선형회귀

Computing Parameters Analytically

(분석적 파라미터 연산)

Normal Equation (정규 방정식)

In this video, we'll talk about the normal equation, which for some linear regression problems, will give us a much better way to solve for the optimal value of the parameters theta. Concretely, so far the algorithm that we've been using for linear regression is gradient

descent where in order to minimize the cost function J of Theta, we would take this iterative algorithm that takes many steps, multiple iterations of gradient descent to converge to the global minimum. In contrast, the normal equation would give us a method to solve for theta analytically, so that rather than needing to run this iterative algorithm, we can instead just solve for the optimal value for theta all at one go, so that in basically one step you get to the optimal value right there.

이번 강의는 정규 방정식을 설명합니다. 정규 방정식은 특정 선형 회귀 문제에서 파라미터 θ의 최적값을 구하는 더 나은 방법입니다. 지금까지 선형 회귀에 사용했던 알고리즘은 비용 함수 J(θ)를 최소화하는 경사 하강법입니다. 경사 하강법은 전역 최소값에 수렴하기 위해 많은 스텝과 반복을 하는 알고리즘입니다. 이와 반대로 정규 방정식은 파라미터 θ를 직접 계산하는 방법입니다. 반복적으로 알고리즘을 돌릴 필요 없이 파라미터 θ의 최적값을 한 번에 구합니다. 즉, 기본적으로 최적 값을 구하기 위해 단 한 번의 스텝이면 충분합니다.

It turns out the normal equation that has some advantages and some disadvantages, but before we get to that and talk about when you should use it, let's get some intuition about what this method does. For this week's planetary example, let's imagine, let's take a very simplified cost function J of Theta, that's just the function of a real number Theta. So, for now, imagine that Theta is just a scalar value or that Theta is just a row value. It's just a number, rather than a vector. Imagine that we have a cost function J that's a quadratic function of this real value parameter Theta, so J of Theta looks like that. Well, how do you minimize a quadratic function? For those of you that know a little bit of calculus, you may know that the way to minimize a function is to take derivatives and to set derivatives equal to zero. So, you take the derivative of J with respect to the parameter of Theta. You get some formula which I am not going to derive, you set that derivative equal to zero, and this allows you to solve for the value of Theda that minimizes J of Theta. That was a simpler case of when data was just real number.

정규 방정식이 무엇인지 언제 사용하는 지를 먼저 설명하고 정규 방정식의 장단점을 나중에 설명합니다. 여기 단순화된 비용 함수 J(θ)가 있습니다.

여기서, 파라미터 θ는 실수이자 스칼라입니다. 벡터가 아니라 그냥 숫자입니다. 비용 함수 J(θ)는 실수 파라미터 θ에 대한 2차 함수입니다. 비용 함수 J(θ)를 그래프는 이차 함수 그래프 모양입니다. 2차 함수의 값을 최소화하는 방법은 무엇일까요?

미적분에서 함수를 최소화하는 방법은 함수를 미분한 후 미분이 0이 되는 값을 찾는 것입니다. 비용 함수 J(θ)를 미분하면 어떤 값이 나올 것입니다. 여기서 비용 함수 J(θ)를 미분하지 않을 것입니다. 이 식이 0되는 방정식을 만듭니다. 즉, 미분의 결과가 0일 때 비용 함수 J(θ)가 최소화되는 θ를 구할 수 있습니다. 데이터가 실수일 때 더 간단합니다.

In the problem that we are interested in, Theta is no longer just a real number, but, instead, is this n+1-dimensional parameter vector, and, a cost function J is a function of this vector value or Theta 0 through Theta m. And, a cost function looks like this, some square cost function on the right. How do we minimize this cost function J? Calculus actually tells us that, if you, that one way to do so, is to take the partial derivative of J, with respect to every parameter of Theta J in turn, and then, to set all of these to 0. If you do that, and you solve for the values of Theta 0, Theta 1, up to Theta N, then, this would give you that values of Theta to minimize the cost function J. Where, if you actually work through the calculus and work through the solution to the parameters Theta 0 through Theta N, the derivation ends up being somewhat involved. And, what I am going to do in this video, is actually to not go through the derivation, which is kind of long and kind of involved, but what I want to do is just tell you what you need to know in order to implement this process so you can solve for the values of the thetas that corresponds to where the partial derivatives is equal to zero. Or alternatively, or equivalently, the values of Theta is that minimize the cost function J of Theta. I realize that some of the comments I made that made more sense only to those of you that are normally familiar with calculus. So, but if you don't know, if you're less familiar with calculus, don't worry about it. I'm just going to tell you what you need to know in order to implement this algorithm and get it to work.

문제는 파라미터 θ가 실수가 아니라 (n+1) 차원의 파라미터 벡터인 것입니다. 비용 함수도 J(θ0, θ1,..., θm)입니다. 비용 함수 J(θ0, θ1,..., θm)를 어떻게 최소화할 수 있을까요?

J(θ0, θ1,..., θm)를 편미분 한다면 최소값을 구할 수 있습니다. θj 각각의 파라미터에 대해 차례로 J 함수를 미분하고 모두 0이 되는 값을 찾습니다. 그러면 파라미터 θ0, θ1,..., θn까지 모든 파라미터 θ의 최적값을 구할 수 있고, 비용 함수 J(θ0, θ1,..., θm)의 최소값을 찾을 수 있습니다.

이번 강의에서 시간도 많이 걸리고 복잡한 편미분을 설명하지 않습니다. 편미분의 값이 0이 되는 파라미터 θ값을 구할 수 있도록 편미분 과정에서 몇 가지 사항만을 설명합니다. 좀 다르게 말하면, 비용 함수 J(θ)가 최소값을 가지는 θ를 구하는 것입니다. 이것은 미적분에 익숙하지 않은 분들에게 생소하지만, 미적분을 몰라도 걱정할 필요는 없습니다. 학습 알고리즘을 구현하고 작동시키는 데 필요한 것들만 알면 됩니다.

For the example that I want to use as a running example let's say that I have m = 4 training examples. In order to implement this normal equation at big, what I'm going to do is the following. I'm going to take my data set, so here are my four training examples. In this case let's assume that, you know, these four examples is all the data I have.

여기에 m=4인 학습 데이터 셋이 있습니다. 정규 방정식을 구현하기 위해 필요한 것들을 정리합니다.

What I am going to do is take my data set and add an extra column that corresponds to my extra feature, x0, that is always takes on this value of 1. What I'm going to do is I'm then going to construct a matrix called X that's a matrix are basically contains all of the features from my training data, so completely here is my here are all my features and we're going to take all those numbers and put them into this matrix "X", okay? So just, you know, copy the data over one column at a time and then I am going to do something similar for y's. I am going to take the values that I'm trying to predict and construct now a vector, like so and call that a vector y. So X is going to be a m by (n+1) - dimensional matrix, and y is going to be a m-dimensional vector where m is the number of training examples and n is, n is a number of features, n+1, because of this extra feature X0 that I had.

우선, 데이터 셋에 피처 x0 열을 하나 추가합니다. 피처 x0의 값은 항상 1입니다. 다음으로 행렬 X를 만듭니다. 행렬 X는 학습 데이터 셋의 모든 피처를 포함합니다. 행렬 X에 모든 데이터를 배치합니다. 한 번에 한 열씩 복사합니다. 다음으로 벡터 y를 만듭니다. 주택 가격을 그대로 복사합니다. 행렬 X는 m X (n+1) 차원의 행렬이고, y는 m차원 벡터입니다. m은 학습 데이터의 수이고, n은 피처의 수입니다. n+1인 이유는 x0를 추가했기 때문입니다.

Finally if you take your matrix X and you take your vector Y, and if you just compute this, and set theta to be equal to X transpose X inverse times X transpose Y, this would give you the value of theta that minimizes your cost function. There was a lot that happened on the slides and I work through it using one specific example of one dataset. Let me just write this out in a slightly more general form and then let me just, and later on in this video let me explain this equation a little bit more.

마지막으로 행렬 X와 벡터 y를 활용하여 파라미터 벡터 θ를 구하는 공식입니다.

여기서, 파라미터 벡터 θ는 비용 함수 J(θ)를 최소화하는 값입니다. 너무 많은 것을 설명했습니다. 좀 더 일반화된 공식으로 정리하고 정규 방정식을 설명합니다.

It is not yet entirely clear how to do this. In a general case, let us say we have m training examples so X1, Y1 up to Xm, Ym and n features. So, each of the training example x(i) may looks like a vector like this, that is a n+1 dimensional feature vector. The way I'm going to construct the matrix "X", this is also called the design matrix is as follows. Each training example gives me a feature vector like this. say, sort of n+1 dimensional vector. The way I am going to construct my design matrix X is only construct the matrix like this. and what I'm going to do is take the first training example, so that's a vector, take its transpose so it ends up being this, you know, long flat thing and make x1 transpose the first row of my design matrix. Then I am going to take my second training example, x2, take the transpose of that and put that as the second row of x and so on, down until my last training example. Take the transpose of that, and that's my last row of my matrix X. And, so, that makes my matrix X, an

M by N +1 dimensional matrix.

파라미터 벡터 θ를 구하는 공식은 이해하기 어렵습니다. 우선은 4 개의 학습 데이터 셋에 대해 행렬 X를 피처 벡터 x^(i))^T로 표현할 수 있습니다.

공식을 일반화합니다. m개의 학습 데이터 셋이(x^(1) , y^(1))부터 (x^(m), y^(m))까지 있습니다. n개의 피처가 있으므로 학습 데이터 x^(i)는 (n+1) 차원 의 피처 벡터입니다. 행렬 X는 디자인 행렬이라고도 부릅니다.

피처 벡터는 (n+1) 차원 벡터입니다. 피처 벡터를 활용하여 디자인 행렬 X를 만듭니다. 첫 번째 학습 데이터를 피처 벡터 x^(1)로 정리한 후에 디자인 행렬 X에 넣기 위해 전치합니다. x^(1)의 전치하여 행 벡터로 디자인 행렬 X의 첫 번째 행에 넣습니다. 피처 x1의 데이터를 전치한 것이 디자인 매트릭스의 첫 번째 행 x^(1)이 됩니다. 두 번째 학습 데이터 x^(2) 벡터를 전치하고 행렬 X의 두 번째 행으로 넣습니다. 마지막 학습 데이터 x^(m)까지 반복합니다. 마지막 학습 데이터를 전치하고 행렬 X의 마지막 행으로 만듭니다. 행렬 X는 (m X n+1) 차원 행렬입니다.

As a concrete example, let's say I have only one feature, really, only one feature other than X zero, which is always equal to 1. So if my feature vectors X-i are equal to this 1, which is X-0, then some real feature, like maybe the size of the house, then my design matrix, X, would be equal to this. For the first row, I'm going to basically take this and take its transpose. So, I'm going to end up with 1, and then X-1-1. For the second row, we're going to end up with 1 and then X-1-2 and so on down to 1, and then X-1-M. And thus, this will be a m by 2-dimensional matrix. So, that's how to construct the matrix X. And, the vector y--sometimes I might write an arrow on top to denote that it is a vector, but very often I'll just write this as Y, either way. The vector y is obtained by taking all all the labels, all the correct prices of houses in my training set, and just stacking them up into an m-dimensional vector, and that's Y. Finally, having constructed the matrix X and the vector y, we then just compute theta as X'(1/X) x X'Y.

여기 단 하나의 피처만 있는 예가 있습니다. 항상 1의 값을 가지는 x0를 제외하고 하나의 피처만 있는 디자인 행렬 X가 있습니다. 피처 x1 은 주택 크기입니다. 디자인 행렬 X는 이렇게 구성할 수 있습니다.

첫 번째 데이터 x^(1)을 전치하고, 두 번째 데이터 x^(2)를 전치하고, 마지막 데이터 x^(m)을 전치합니다. 단 하나의 피처 x1 만 있으므로 x^(m)1으로 표기합니다. 결국 디자인 행렬 X는 m X 2차원 행렬입니다. 이것이 디자인 행렬 X를 만드는 방법입니다.

그리고, 벡터 y에 위쪽에 화살표시를 하여 벡터를 나타내지만 y로 표기합니다. 벡터 y는 모든 레이블(Label)을 포함합니다. 모든 학습 데이터의 실제 주택 가격을 활용하여 m 차원 벡터를 만듭니다. 이것이 벡터 y를 만드는 방법입니다.

마지막으로, 행렬 X와 벡터 y를 계산합니다. 계산 공식은 다음과 같습니다.

I just want to make I just want to make sure that this equation makes sense to you and that you know how to implement it. So, you know, concretely, what is this X'(1/X)? Well, X'(1/X) is the inverse of the matrix X'X. Concretely, if you were to say set A to be equal to X' x X, so X' is a matrix, X' x X gives you another matrix, and we call that matrix A. Then, you know, X'(1/X) is just you take this matrix A and you invert it, right! This gives, let's say 1/A. And so that's how you compute this thing. You compute X'X and then you compute its inverse.

여기서 잠시 계산하는 방정식의 의미를 이해합니다.

구체적으로 (X^TX)^-1은 무엇일까요? (X'X)^-1는 X^TX의 역행렬입니다. A를 X^TX로 정의하면, X도 행렬이고 X^T도 행렬이고, X^TX도 행렬입니다. 따라서, A는 행렬입니다. (X^TX)^-1는 행렬 A의 역행렬입니다. A^-1로 표현하면 이해하기 쉬울 것입니다. 즉, X^TX를 계산하고 나서 역행렬을 구하는 것입니다.

We haven't yet talked about Octave. We'll do so in the later set of videos, but in the Octave programming language or a similar view, and also the matlab programming language is very similar. The command to compute this quantity, X transpose X inverse times X transpose Y, is as follows. In Octave X prime is the notation that you use to denote X transpose. And so, this expression that's boxed in red, that's computing X transpose times X. pinv is a function for computing the inverse of a matrix, so this computes X transpose X inverse, and then you multiply that by X transpose, and you multiply that by Y. So you end computing that formula which I didn't prove, but it is possible to show mathematically even though I'm not going to do so here, that this formula gives you the optimal value of theta in the sense that if you set theta equal to this, that's the value of theta that minimizes the cost function J of theta for the new regression.

아직 옥타브 프로그램을 다루지 않았습니다. 이 과정 후반에서 다룰 것입니다. 옥타브 프로그램과 매트랩(matlab)에서 X'X의 역행렬에 X'y를 곱할 때 다음 식을 사용합니다.

X' : 옥타브 프로그램에서 X의 전치(transpose)를 나타내는 기호

pinv() : 역행렬을 계산하는 함수

여기서, pinv(X'*X)는 X'X의 역행렬을 계산하는 함수입니다. 이 명령어를 이용하면 파라미터 θ의 최적값을 구할 수 있습니다. 증명을 하지 않았지만, 수학적으로 보여줄 수 있습니다. 최적의 파라미터 θ의 값은 선형 회귀에서 비용 함수 J(θ)를 최소화하는 값을 구할 수 있다는 의미입니다.

One last detail in the earlier video. I talked about the feature skill and the idea of getting features to be on similar ranges of Scales of similar ranges of values of each other. If you are using this normal equation method then feature scaling isn't actually necessary and is actually okay if, say, some feature X one is between zero and one, and some feature X two is between ranges from zero to one thousand and some feature x three ranges from zero to ten to the minus five and if you are using the normal equation method this is okay and there is no need to do features scaling, although of course if you are using gradient descent, then, features scaling is still important.

지난 강의에서 피처 스케일링(feature scaling)을 자세히 배웠습니다. 피처 스케일링은 피처 값의 범위를 비슷한 범위로 조절하는 것입니다. 정규 방정식은 피처 스케일링이 필요 없습니다. 예를 들면, 피처 x1은 값의 범위가 0과 1 사이이고, 피처 x2는 0과 1000 사이고, 피처 x3는 0에서 0.00005 사이에 있습니다. 각 피처 값의 범위가 제각각이어도 상관없습니다. 정규 방정식은 피처 스케일링이 필요 없지만, 경사 하강법은 피처 스케일링이 필요합니다.

Finally, where should you use the gradient descent and when should you use the normal equation method. Here are some of the their advantages and disadvantages. Let's say you have m training examples and n features.One disadvantage of gradient descent is that, you need to choose the learning rate Alpha. And, often, this means running it few times with different learning rate alphas and then seeing what works best. And so that is sort of extra work and extra hassle. Another disadvantage with gradient descent is it needs many more iterations. So, depending on the details, that could make it slower, although there's more to the story as we'll see in a second. As for the normal equation, you don't need to choose any learning rate alpha. So that, you know, makes it really convenient, makes it simple to implement. You just run it and it usually just works.And you don't need to iterate, so, you don't need to plot J of Theta or check the convergence or take all those extra steps. So far, the balance seems to favor normal the normal equation.

마지막으로, 경사 하강법과 정규 방정식을 언제 사용할지를 정리합니다. 각각의 장점과 단점이 있습니다. m개의 학습 데이터가 있고, n개의 피처가 있습니다. 경사 하강법의 첫 번째 단점은 학습률 α를 결정해야 합니다. 여러 번 경사 하강법을 테스트해서 가장 잘 동작하는 학습률 α를 찾아야 합니다. 또한 반복이 많다는 것도 단점입니다. 경사 하강법에 대한 이야기는 조금 있다가 더 하겠습니다.

정규 방정식은 학습률 α가 필요 없습니다. 매우 편리하고 구현도 간단합니다. 반복도 없습니다. 비용 함수 J(θ) 그래프를 그리거나 최소값으로 수렴하는 것을 확인할 필요가 없습니다. 정규 방정식이 훨씬 더 나은 방법일지도 모릅니다.

Here are some disadvantages of the normal equation, and some advantages of gradient descent. Gradient descent works pretty well, even when you have a very large number of features. So, even if you have millions of features you can run gradient descent and it will be reasonably efficient. It will do something reasonable. In contrast to normal equation, In, in

order to solve for the parameters data, we need to solve for this term. We need to compute this term, X transpose, X inverse. This matrix X transpose X. That's an n by n matrix, if you have n features. Because, if you look at the dimensions of X transpose the dimension of X, you multiply, figure out what the dimension of the product is, the matrix X transpose X is an n by n matrix where n is the number of features, and for almost computed implementations the cost of inverting the matrix, rose roughly as the cube of the dimension of the matrix. So, computing this inverse costs, roughly order, and cube time. Sometimes, it's slightly faster than N cube but, it's, you know, close enough for our purposes. So if n the number of features is very large, then computing this quantity can be slow and the normal equation method can actually be much slower.

정규 방정식의 단점과 경사 하강법의 장점을 정리합니다. 경사 하강법은 피처가 많을 때 효율적입니다. 수백 만개의 피처를 다룰 때 경사 하강법은 효율적으로 잘 동작합니다. 하지만, 정규 방정식은 파라미터 θ를 구하기 위해 X^TX의 역행렬을 계산해야 합니다. n 개의 피처가 있을 때 행렬 X'X은 n X n 차원 행렬입니다. X'의 차원과 X의 차원을 곱하면 n X n 차원 행렬입니다. 역행렬을 계산하는 시간은 대략 행렬의 차원의 세제곱만큼 증가합니다. 즉, 역행렬 계산의 걸리는 시간은 보통 세제곱만큼 걸립니다. 가끔 세제곱보다 빠를 수도 있지만, 차이는 미미합니다. 피처의 수가 엄청 크다면 역행렬에 대한 계산속도는 느려지고 정규 방정식 계산은 훨씬 더 느려집니다.

So if n is large then I might usually use gradient descent because we don't want to pay this all in q time. But, if n is relatively small, then the normal equation might give you a better way to solve the parameters. What does small and large mean? Well, if n is on the order of a hundred, then inverting a hundred-by-hundred matrix is no problem by modern computing standards. If n is a thousand, I would still use the normal equation method. Inverting a thousand-by-thousand matrix is actually really fast on a modern computer. If n is ten thousand, then I might start to wonder. Inverting a ten-thousand- by-ten-thousand matrix starts to get kind of slow, and I might then start to maybe lean in the direction of gradient descent, but maybe not quite. n equals ten thousand, you can sort of convert a ten-thousand-by-ten-thousand matrix. But if it gets much bigger than that, then, I would probably use gradient descent. So, if n equals ten to the sixth with a million features, then inverting a million-by-million matrix is going to be very expensive, and I would definitely favor gradient descent if you have that many features. So exactly how large set of features has to be before you convert a gradient descent, it's hard to give a strict number. But, for me, it is usually around ten thousand that I might start to consider switching over to gradient descents or maybe, some other algorithms that we'll talk about later in this class.

만일 피처의 개수 n이 크다면 경사 하강법을 사용합니다. 경사 하강법은 세제곱만큼 시간이 필요하지 않습니다. 만약 피처의 개수 n이 작다면, 정규 방정식을 사용합니다. 그렇다면 크다와 작다의 기준은 무엇일까요?

현대의 컴퓨터의 계산 능력에 비추어 볼 때 정규 방정식 계산은 시간은 다음과 같습니다.

n이 100 단위일 때 100 X 100 행렬의 역행렬은 쉽게 계산합니다.

n이 1,000 단위일 때 1,000 X 1,000 행렬의 역행렬은 쉽게 계산합니다.

n이 10,000 단위일 때 10,000 x 10,000 행렬의 역행렬은 느리지만 계산합니다. 고민이 필요합니다.

n이 100,000 단위일 때 100,000 x 100,000 행렬의 역행렬은 확실히 오래 걸립니다.

하지만, n이 이보다 더 큰 경우 경사 하강법이 더 효율적입니다. n이 백만 개라면 100만 X 100만 행렬의 역행렬은 많은 시간이 필요합니다. 경사 하강법을 강력히 추천합니다. 사실 피처가 몇 개일 때 경사 하강법을 사용해야 할지를 제시하는 것은 어렵습니다. 전문가들마다 다소 차이가 있지만, 피처가 만개 정도일 때 경사 하강법이나 다른 알고리즘을 고민합니다.

To summarize, so long as the number of features is not too large, the normal equation gives us a great alternative method to solve for the parameter theta. Concretely, so long as the number of features is less than 1000, you know, I would use, I would usually is used in normal equation method rather than, gradient descent.

정리하면, 피처의 수가 많지 않을 때 정규 방정식이 파라미터 θ를 구하는 것이 좋습니다. 정확히

feature의 수가 1,000보다 적을 때 정규 방정식을 사용합니다.

To preview some ideas that we'll talk about later in this course, as we get to the more complex learning algorithm, for example, when we talk about classification algorithm, like a logistic regression algorithm, We'll see that those algorithm actually... The normal equation method actually do not work for those more sophisticated learning algorithms, and, we will have to resort to gradient descent for those algorithms. So, gradient descent is a very useful algorithm to know. The linear regression will have a large number of features and for some of the other algorithms that we'll see in this course, because, for them, the normal equation method just doesn't apply and doesn't work. But for this specific model of linear regression, the normal equation can give you a alternative that can be much faster, than gradient descent. So, depending on the detail of your algortithm, depending of the detail of the problems and how many features that you have, both of these algorithms are well worth knowing about.

나중에 배울 내용을 살짝 이야기하면, 더 복잡한 분류(classification) 알고리즘 중에 로지스틱 회귀 알고리즘은 정규 방정식보다 경사 하강법이 더 낫습니다. 경사 하강법은 매우 유용한 알고리즘입니다. 선형 회귀는 매우 많은 피처를 가지는 경우가 많고, 다른 알고리즘들은 정규 방정식을 적용하기 어렵습니다. 하지만, 선형 회귀의 특정 모델은 정규 방정식이 경사 하강법보다 더 빠릅니다. 결국, 어떤 알고리즘을 사용하는지에 따라 학습 문제가 무엇인지에 따라 경사 하강법과 정규 방정식을 적절히 사용해야 합니다.

앤드류 응의 머신 러닝 동영상 강의

정리하며

경사 하강법은 비용 함수 J(θ)가 전역 최소값에 수렴하기 위해 많은 스텝과 많은 반복을 하는 알고리즘입니다. 이와 반대로 정규 방정식은 J(θ)의 최소값에 해당하는 파라미터 θ를 한 번에 구합니다. 공식은 다음과 같습니다.

hθ(X) = θ0x0 + θ1x1 + θ2x2 + θ3x3 + θ4x4 +... 일 때

= θ^T*X

이때, θ를 구하는 공식은 다음과 같다

θ = (X^T *X)^-1 * X^T * y

이것을 옥타브 프로그램에서 θ를 구하는 코드는 다음과 같다. (X이 전치 행렬 X^T를 X'로 표현)

pinv(X'*X)*X'*y

그리고, 정규 방정식은 피처 스케일이 필요 없습니다.

경사 하강법 장점과 단점은 다음과 같습니다.

1) 학습 비율 알파를 결정해야 합니다.

서로 다른 학습 비율 알파를 가지고 여러 번 경사 하강법을 시도하고 최적의 값을 찾아야 합니다.

2) 반복이 많습니다.

다수의 스템을 시도하여 최적의 값을 찾습니다.

3) Featue의 수가 많을수록 효과적

수백만 개의 Feature가 있을 때

정규 방정식의 장점과 단점은 다음과 같습니다.

1) 학습 비율 알파가 없습니다

2) 반복도 없이 계산만 합니다.

3) 역행렬을 계산하는 시간이 많이 걸립니다.

역행렬 계산 시간은 보통 세제곱만큼 걸립니다.

가설 함수에서 경사 하강법과 정규 방정식 중 어느 것을 선택해야 할까요? Feature의 개수가 10,000개를 넘어가면 고민을 해야 합니다. 10,000 x 10,000 행렬부터 역행렬을 구하는 것이 느려집니다. 전문가들은 F피처가 1만 개 정도 일 때 경사 하강법이나 뒤에 배울 알고리즘을 고민합니다.