brunch

You can make anything
by writing

C.S.Lewis

by 라인하트 Oct 22. 2020

앤드류 응의 머신러닝(7-3): 정규화된 선형 회귀

온라인 강의 플랫폼 코세라의 창립자인 앤드류 응 (Andrew Ng) 교수는 인공지능 업계의 거장입니다. 그가 스탠퍼드 대학에서 머신 러닝 입문자에게 한 강의를 그대로 코세라 온라인 강의 (Coursera.org)에서 무료로 배울 수 있습니다. 이 강의는 머신러닝 입문자들의 필수코스입니다. 인공지능과 머신러닝을 혼자 공부하면서 자연스럽게 만나게 되는 강의입니다.

Regularization

정규화

Solving the Problem of Overfitting

(과적합 문제 해결하기)

Regularized Linear Regression (정규화된 선형 회귀)

For linear regression, we have previously worked out two learning algorithms. One based on gradient descent and one based on the normal equation. In this video, we'll take those two algorithms and generalize them to the case of regularized linear regression.

선형 회귀를 활용한 학습 알고리즘은 경사 하강 알고리즘과 정규 방정식입니다. 이번 강의에서 두 알고리즘에 적용할 수 있는 정규화된 선형 회귀를 다룹니다.

Here's the optimization objective that we came up with last time for regularized linear regression. This first part is our usual objective for linear regression. And we now have this additional regularization term, where lambda is our regularization parameter, and we like to find parameters theta that minimizes this cost function, this regularized cost function, J of theta.

여기 정규화된 선형 회귀의 비용 함수 J(θ)와 최적화 목표가 있습니다.

정규화된 비용 함수 J(θ)의 첫 번째 항은 선형 회귀의 일반적인 목표이고, 두 번째 항은 정규화 파라미터 λ를 포함한 정규화 항입니다. 정규화된 비용 함수 J(θ)를 최소화하는 파라미터 θ를 찾습니다.

Previously, we were using gradient descent for the original cost function without the regularization term. And we had the following algorithm, for regular linear regression, without regularization, we would repeatedly update the parameters theta J as follows for J equals 0, 1, 2, up through n.

여기 정규화 항이 없는 비용 함수에 경사 하강법이 있습니다.

정규화를 하지 않은 일반적인 선형 회귀의 경사 하강법 업데이트 공식입니다. j는 0,1,2,... , n까지의 값에 대해 파라미터 θj를 동시에 업데이트합니다.

Let me take this and just write the case for theta 0 separately. So I'm just going to write the update for theta 0 separately than for the update for the parameters 1, 2, 3, and so on up to n. And so this is, I haven't changed anything yet, right. This is just writing the update for theta 0 separately from the updates for theta 1, theta 2, theta 3, up to theta n. And the reason I want to do this is you may remember that for our regularized linear regression, we penalize the parameters theta 1, theta 2, and so on up to theta n. But we don't penalize theta 0. So, when we modify this algorithm for regularized linear regression, we're going to end up treating theta zero slightly differently.

θj에서 j=1,2,3,... 등에 대한 경사 하강법 업데이트와 θ0에 대한 경사 하강 업데이트 공식을 따로 작성합니다. 이유는 정규화된 선형 회귀는 θ1, θ2, θ3,... θn까지 페널티를 부과하지만, θ0는 페널티를 부과하지 않기 때문입니다. 정규화된 선형 회귀에서 경사 하강 알고리즘을 수정할 때 θ0를 약간 다르게 취급합니다.

Concretely, if we want to take this algorithm and modify it to use the regularized objective, all we need to do is take this term at the bottom and modify it as follows. We'll take this term and add minus lambda over m times theta j. And if you implement this, then you have gradient descent for trying to minimize the regularized cost function, j of theta. And concretely, I'm not gonna do the calculus to prove it, but concretely if you look at this term, this term hat I've written in square brackets, if you know calculus it's possible to prove that that term is the partial derivative with respect to J of theta using the new definition of J of theta with the regularization term. And similarly, this term up on top which I'm drawing the cyan box, that's still the partial derivative respect of theta zero of J of theta.

If you look at the update for theta j, it's possible to show something very interesting. Concretely, theta j gets updated as theta j minus alpha times, and then you have this other term here that depends on theta J. So if you group all the terms together that depend on theta j, you can show that this update can be written equivalently as follows. And all I did was add theta j here is, so theta j times 1. And this term is, right, lambda over m, there's also an alpha here, so you end up with alpha lambda over m multiplied into theta j.

정규화 목표를 달성하기 위해 경사 하강법 알고리즘을 수정합니다. 경사 하강 업데이트 공식에 정규화 항을 추가합니다. 이것은 정규화된 비용 함수 J(θ)를 최소화하기 위한 경사 하강법입니다. 경사 하강법에 추가된 정규항은 비용 함수 J(θ)의 정규화 항을 미분한 것입니다. 증명을 하기 위해 미분을 하지 않습니다.

θj에 대한 정규화된 경사 하강 업데이트 공식은 흥미로운 사실을 알 수 있습니다. 우선 다음과 같이 식을 유도할 수 있습니다.

And this term here, 1 minus alpha times lambda m, is a pretty interesting term. It has a pretty interesting effect. Concretely this term, 1 minus alpha times lambda over m, is going to be a number that is usually a little bit less than one, because alpha times lambda over m is going to be positive, and usually if your learning rate is small and if m is large, this is usually pretty small. So this term here is gonna be a number that's usually a little bit less than 1, so think of it as a number like 0.99, let's say. And so the effect of our update to theta j is, we're going to say that theta j gets replaced by theta j times 0.99, right? So theta j times 0.99 has the effect of shrinking theta j a little bit towards zero. So this makes theta j a bit smaller. And more formally, this makes the square norm of theta j a little bit smaller.

θj에 대한 정규화된 경사 하강 업데이트 공식의 첫항은 매우 흥미롭습니다.

'(1 - αλ /m)'항은 일반적으로 1 보다 약간 작은 숫자입니다. 왜냐하면 αλ /m 은 양수이고, α는 매우 작은 값입니다. m은 매우 큰 값입니다. 따라서 αλ /m 은 매우 작은 값이므로 '(1 - αλ /m)'은 1 보다 약간 작은 값입니다. 아마도 0.99와 같은 숫자로 생각할 수 있습니다. 그래서, θj에 대한 업데이트 효과는 'θj * 0.99'로 대체된다고 할 수 있습니다. 따라서, θj * 0.99는 θj를 약간 축소하는 효과가 있습니다. θj를 조금 더 작게 만듭니다.

And then after that, the second term here, that's actually exactly the same as the original gradient descent update that we had, before we added all this regularization stuff. So, hopefully this gradient descent, hopefully this update makes sense. When we're using a regularized linear regression and what we're doing is on every iteration we're multiplying theta j by a number that's a little bit less then one, so its shrinking the parameter a little bit, and then we're performing a similar update as before. Of course that's just the intuition behind what this particular update is doing. Mathematically what it's doing is it's exactly gradient descent on the cost function J of theta that we defined on the previous slide that uses the regularization term.

θj에 대한 정규화된 경사 하강 업데이트 공식의 두 번째 항은 정규화 항을 추가하기 전 원래의 경사 하강법 업데이트 공식과 같습니다. 여러분이 정규화된 경사 하강법 업데이트를 이해할 수 있기를 바랍니다. 정규화된 선형 회귀의 경사 하강 업데이트를 반복할 때마다 θj에 1보다 약간 작은 숫자를 반복적으로 곱할 뿐입니다. 파라미터를 약간 축소한 다음 전과 유사한 업데이트를 수행합니다. 수학적으로 정규화된 선형 회귀의 경사 하강 업데이트는 정확히 비용 함수 J(θ)에 대한 경사 하강 업데이트입니다.

Gradient descent was just one of our two algorithms for fitting a linear regression model. The second algorithm was the one based on the normal equation, where what we did was we created the design matrix X where each row corresponded to a separate training example. And we created a vector y, so this is a vector, that's an m dimensional vector. And that contained the labels from my training set. So whereas X is an m by (n+1) dimensional matrix, y is an m dimensional vector. And in order to minimize the cost function J, we found that one way to do so is to set theta to be equal to this. Right, you have X transpose X, inverse, X transpose Y. I'm leaving room here to fill in stuff of course. And what this value for theta does is this minimizes the cost function J of theta, when we were not using regularization.

데이터에 적합한 선형 회귀 모델을 찾는 또 다른 방법은 정규 방정식입니다. 여기 각 행이 학습 데이터 셋의 예제에 해당하는 디자인 행렬 X와 벡터 y가 있습니다. 벡터 y는 m 차원이고 학습 데이터 셋의 레이블의 집합입니다. 따라서, 행렬 X는 m X (n+1) 차원 행렬입니다. 정규 방정식은 비용 함수 J(θ)를 최소화하는 파라미터 θ를 직접 구합니다.

물론, 여기에 내용을 기입할 공간을 남겨두겠습니다. 정규화하지 않은 정규 방정식은 비용 함수 J(θ)를 최소화하는 파라미터 θ 를 구하는 것입니다.

Now that we are using regularization, if you were to derive what the minimum is, and just to give you a sense of how to derive the minimum, the way you derive it is you take partial derivatives with respect to each parameter. Set this to zero, and then do a bunch of math and you can then show that it's a formula like this that minimizes the cost function. And concretely, if you are using regularization, then this formula changes as follows. Inside this parenthesis, you end up with a matrix like this. 0, 1, 1, 1, and so on, 1, until the bottom. So this thing over here is a matrix whose upper left-most entry is 0. There are ones on the diagonals, and then zeros everywhere else in this matrix. Because I'm drawing this rather sloppily.

But as a example, if n = 2, then this matrix is going to be a three by three matrix. More generally, this matrix is an (n+1) by (n+1) dimensional matrix. So if n = 2, then that matrix becomes something that looks like this. It would be 0, and then 1s on the diagonals, and then 0s on the rest of the diagonals. And once again, I'm not going to show this derivation, which is frankly somewhat long and involved, but it is possible to prove that if you are using the new definition of J of theta, with the regularization objective, then this new formula for theta is the one that we give you, the global minimum of J of theta.

정규 방정식에 정규화를 적용합니다. 최소값을 도출하는 방법을 이해하기 위해 각 파라미터에 관한 편미분을 할 것입니다. 이 값을 0으로 설정한 다음 많은 계산을 하면 비용 함수 J(θ)를 최소화하는 공식을 보여줄 수 있습니다. 정규화 공식은 다음과 같습니다.

괄호 안에 이와 같이 대각선으로 0,1,1,1,1,1 아래쪽까지 적습니다. 첫 번째 행렬 성분은 0이고 대각선으로 1의 값을 가지고 나머지는 모두 0인 행렬입니다. 예를 들어, n = 2 일 때 (n+1) X (n+1) 차원이므로 3 X 3차원 행렬입니다.

n=2일 때 대각선으로 0, 1, 1이고 나머지 성분은 0입니다. 미분을 하지 않을 것입니다. 하지만, 정규화 목표와 함께 정규화된 θj의 새로운 정의를 사용한다면 새로운 공식을 증명할 수 있습니다. θ는 J(θ)의 전역 최소값입니다.

So finally I want to just quickly describe the issue of non-invertibility. This is relatively advanced material, so you should consider this as optional. And feel free to skip it, or if you listen to it and positive it doesn't really make sense, don't worry about it either. But earlier when I talked about the normal equation method,

we also had an optional video on the non-invertibility issue. So this is another optional part to this, sort of an add-on to that earlier optional video on non-invertibility.

마지막으로 비가역성 문제를 빠르게 설명합니다. 이것은 상대적으로 고급 과정이므로 옵션입니다. 이 부분을 건너뛰어도 상관없고, 듣고 이해하지 못해도 상관없습니다. 정규 방정식에서 비가역성 문제에 대한 강의도 선택사항이었습니다.

Now, consider a setting where m, the number of examples, is less than or equal to n, the number of features. If you have fewer examples than features, than this matrix, X transpose X will be non-invertible, or singular. Or the other term for this is the matrix will be degenerate. And if you implement this in Octave anyway and you use the pinv function to take the pseudo inverse, it will kind of do the right thing, but it's not clear that it would give you a very good hypothesis, even though numerically the Octave pinv function will give you a result that kinda makes sense. But if you were doing this in a different language, and if you were taking just the regular inverse, which in Octave denoted with the function inv, we're trying to take the regular inverse of X transpose X. Then in this setting, you find that X transpose X is singular, is non-invertible, and if you're doing this in different program language and using some linear algebra library to try to take the inverse of this matrix, it just might not work because that matrix is non-invertible or singular.

Fortunately, regularization also takes care of this for us. And concretely, so long as the regularization parameter lambda is strictly greater than 0, it is actually possible to prove that this matrix, X transpose X plus lambda times this funny matrix here, it is possible to prove that this matrix will not be singular and that this matrix will be invertible. So using regularization also takes care of any non-invertibility issues of the X transpose X matrix as well.

학습 데이터 셋의 예제의 총 수 m이 피처의 총 개수 n 보다 작은 경우가 있습니다. 행렬 X의 전치 행렬 X^T와 X의 곱은 비가역 행렬 또는 특이 행렬 또는 degenerate 행렬입니다. 옥타브 프로그램에서 이것을 구현할 때 pinv() 함수를 사용하면 제대로 계산을 할 수 있지만, 옥타브 pinv() 함수가 매우 좋은 가설을 제공할지는 분명하지 않습니다. pinv() 함수는 단지 당신이 이해할 수 있는 결과를 반환할 뿐입니다. 하지만, 다른 프로그래밍 언어로 X'X의 역행렬을 계산하거나 옥타브 프로그램에서 inv(X'X) 함수의 역행렬을 계산한다면, 행렬 X'X가 비가역 또는 특이 행렬이라는 것을 알 수 있습니다. 다른 프로그래밍 언어의 선형 대수 라이브러리가 X'X의 역행렬 계산을 제대로 하지 못할 수 있습니다.

운 좋게도 정규화(Regularization)는 비가역 문제를 처리합니다. 구체적으로 정규화 파라미터 λ가 0보다 크다면 비가역 문제를 일으키지 않습니다. 실제로 행렬 'X'X + λ*정규화 행렬'은 특이 행렬이나 비가역 행렬이 아니라는 것을 증명할 수 있습니다.' λ*정규화 행렬'은 X'X의 비가역성 문제를 해결합니다.

So you now know how to implement regularized linear regression. Using this you'll be able to avoid overfitting even if you have lots of features in a relatively small training set. And this should let you get linear regression to work much better for many problems. In the next video we'll take this regularization idea and apply it to logistic regression. So that you'd be able to get logistic regression to avoid overfitting and perform much better as well.

지금까지 정규화된 선형 회귀를 설명했습니다. 정규화된 선형 회귀는 상대적으로 학습 데이터 셋은 작지만 상대적으로 피처의 개수가 많은 경우 발생하는 과적합 문제를 피할 수 있습니다. 정규화된 선형 회귀는 많은 학습 문제에 대해 훨씬 더 잘 작동합니다. 다음 강의에서 정규화된 로지스틱 회귀를 설명할 것입니다. 정규화된 로지스틱 회귀는 과적합을 방지하고 훨씬 더 나은 성능을 낼 수 있습니다.

앤드류 응의 머신러닝 동영상 강의

정리하며

실제로 어떤 피처를 정규화할지를 선택할 수 없기 때문에 모든 파라미터에 페널티를 부여합니다. 모든 항에 페널티를 부과하는 단순한 방법은 비용 함수 J(θ)에 정규화 항을 추가하는 것입니다.

여기서, 관습적으로 페널티는 θ1부터 시작하고 θ0에는 페널티를 주지 않습니다. θ0은 항상 1이기 때문입니다. 정규화 파라미터 λ(람다)는 비용 함수 J(θ)의 값을 최소화하는 목표와 함께 파라미터 θ의 값을 작게 유지하는 목표를 가지고 있습니다.

여기서 정규화된 선형 회귀는θ0와 θ1, θ2, θ3,... θn로 분리합니다. 이유는 θ0는 페널티를 부과하지 않기 때문입니다

미분항에 비용 함수 J(θ)를 미분하면 다음과 같습니다.

하지만, 미분항에 비용 함수 J(θ)를 정규화된 비용 함수 J(θ)로 대체한 후 미분합니다. 정규화된 비용 함수 J(θ)는 정규 화항이 이미 포함되어 있습니다. (1-αλ /m)항은 1 보다 약간 작은 숫자입니다. 왜냐하면 αλ /m 은 양수이고, α는 매우 작은 값이고, m은 매우 큰 값이기 때문입니다. 따라서, (1-αλ /m) 항의 값은 0.99에 가까운 숫자일 것입니다. 정규화된 비용 함수의 경사 하강 알고리즘에 의한 θj 업데이트는 기존과 달리 'θj * 0.99'의 효과를 보일 것입니다. 즉, 기존보다 약간 작은 값입니다.