brunch

You can make anything
by writing

C.S.Lewis

by 라인하트 Nov 26. 2020

앤드류 응의 머신러닝(14-3): PCA와 차원 축소

온라인 강의 플랫폼 코세라의 창립자인 앤드류 응 (Andrew Ng) 교수는 인공지능 업계의 거장입니다. 그가 스탠퍼드 대학에서 머신 러닝 입문자에게 한 강의를 그대로 코세라 온라인 강의 (Coursera.org)에서 무료로 배울 수 있습니다. 이 강의는 머신러닝 입문자들의 필수코스입니다. 인공지능과 머신러닝을 혼자 공부하면서 자연스럽게 만나게 되는 강의입니다.

Dimensionality Reduction

(차원 축소)

Principal Component Analysis (주성분 분석)

PCA Problem Formulation (주성분 분석 문제의 수학적 의미)

For the problem of dimensionality reduction, by far the most popular, by far the most commonly used algorithm is something called principle components analysis, or PCA. In this video, I'd like to start talking about the problem formulation for PCA. In other words, let's try to formulate, precisely, exactly what we would like PCA to do

차원 축소를 실행하는 가장 인기 있는 알고리즘은 주성분 분석 (PCA)입니다. 이번 강의에서 PCA를 설명합니다. PCA의 동작 방식을 수학적으로 공식화할 것입니다.

Let's say we have a data set like this. So, this is a data set of examples x and R2 and let's say I want to reduce the dimension of the data from two-dimensional to one-dimensional. In other words, I would like to find a line onto which to project the data. So what seems like a good line onto which to project the data, it's a line like this, might be a pretty good choice. And the reason we think this might be a good choice is that if you look at where the projected versions of the point scales, so I take this point and project it down here. Get that, this point gets projected here, to here, to here, to here. What we find is that the distance between each point and the projected version is pretty small.That is, these blue line segments are pretty short. So what PCA does formally is it tries to find a lower dimensional surface, really a line in this case, onto which to project the data so that the sum of squares of these little blue line segments is minimized. The length of those blue line segments, that's sometimes also called the projection error. And so what PCA does is it tries to find a surface onto which to project the data so as to minimize that. As an aside, before applying PCA, it's standard practice to first perform mean normalization at feature scaling so that the features x1 and x2 should have zero mean, and should have comparable ranges of values. I've already done this for this example, but I'll come back to this later and talk more about feature scaling and the normalization in the context of PCA later.

여기 데이터 셋이 있습니다. 예제 x는 R^(2) 차원 데이터 셋입니다. 2차원 데이터 셋을 1차원으로 축소합니다. 그림에서 데이터를 투영하기 좋은 직선을 찾습니다. 그림의 빨간색 직선은 꽤 좋은 선택입니다. 이유는 원래의 점들이 빨간색 선에 투영된 위치를 보면 알 수 있습니다. 빨간색 직선 위에 모든 데이터를 투영해 보면, 각 점들과 투영된 점들 사이의 거리인 파란색 선분이 매우 짧기 때문입니다. PCA는 수학적으로 파란색 선분의 제곱의 합이 최소화되도록 투영할 더 낮은 차원의 표면 또는 직선을 찾는 것입니다. 파란색 선분을 투영 오차라고 합니다. PCA는 데이터가 투영할 표면이 최소화되게 합니다. 그리고, PCA를 적용하기 전에 먼저 피처 스케일링을 통해 평균 정규화(Mean Normalization)를 수행합니다. 피처 x1 및 x2는 0을 중심으로 한 평균과 비슷한 범위를 가집니다. 나중에 다시 PCA에 대한 피처 스케일링과 정규화를 설명할 것입니다.

But coming back to this example, in contrast to the red line that I just drew, here's a different line onto which I could project my data, which is this magenta line. And, as we'll see, this magenta line is a much worse direction onto which to project my data, right? So if I were to project my data onto the magenta line, we'd get a set of points like that. And the projection errors, that is these blue line segments, will be huge. So these points have to move a huge distance in order to get projected onto the magenta line. And so that's why PCA, principal components analysis, will choose something like the red line rather than the magenta line down here.

그림으로 다시 돌아갑니다. 방금 그린 빨간색 직선과 달리 데이터를 투영할 수 있는 분홍색 직선을 그립니다. 분홍색 직선은 데이터를 투영할 경우 투영 오차가 엄청나게 크기 때문에 나쁜 선택입니다. 즉 점들이 분홍색 직선에 투영하기 위해 엄청나게 먼 거리를 이동합니다. 이것이 PCA가 분홍색 직선이 아닌 빨간색 직선을 선택하는 이유입니다.

Let's write out the PCA problem a little more formally. The goal of PCA, if we want to reduce data from two-dimensional to one-dimensional is, we're going to try find a vector that is a vector u1, which is going to be an Rn, so that would be an R2 in this case. I'm gonna find the direction onto which to project the data, so it's to minimize the projection error. So, in this example I'm hoping that PCA will find this vector, which l wanna call u(1), so that when I project the data onto the line that I define by extending out this vector, I end up with pretty small reconstruction errors. And that reference of data that looks like this. And by the way, I should mention that where the PCA gives me u(1) or -u(1), doesn't matter. So if it gives me a positive vector in this direction, that's fine. If it gives me the opposite vector facing in the opposite direction, so that would be like minus u(1). Let's draw that in blue instead, right? But it gives a positive u(1) or negative u(1), it doesn't matter because each of these vectors defines the same red line onto which I'm projecting my data. So this is a case of reducing data from two-dimensional to one-dimensional.

PCA 문제를 좀 더 수학적으로 정리합니다. PCA의 최적화 목표는 데이터를 2차원에서 1차원으로 줄이는 벡터 u^(1)을 찾는 것입니다. u^(1)은 R^(n) 차원 벡터이고, 이 그림에서는 R^(2) 차원 벡터입니다. u^(1)은 데이터를 투영할 때 발생하는 투영 오차를 최소화하는 직선입니다. u^(1) 벡터를 확장하여 정의한 직선에 데이터를 투영할 때 아주 작은 재구성 오차(Reconstruction Error)가 발생합니다. PCA가 u^(1) 또는 -u^(1)을 제공하는 위치는 중요하지 않지만, 양의 값이면 좋습니다. 반대 방향을 향하는 벡터는 -u^(1)입니다. 대신에 양의 u^(1) 또는 음의 u^(1)을 파란색 선분입니다. 각각의 벡터는 데이터를 투영하는 빨간색 직선을 정의합니다. 이것이 2차원 데이터를 1차원으로 축소하는 것입니다.

In the more general case we have n-dimensional data and we'll want to reduce it to k-dimensions. In that case we want to find not just a single vector onto which to project the data but we want to find k-dimensions onto which to project the data. So as to minimize this projection error. So here's the example. If I have a 3D point cloud like this, then maybe what I want to do is find vectors. So find a pair of vectors. And I'm gonna call these vectors. Let's draw these in red. I'm going to find a pair of vectors, sustained from the origin. Here's u(1), and here's my second vector, u(2). And together, these two vectors define a plane, or they define a 2D surface, right? Like this with a 2D surface onto which I am going to project my data. For those of you that are familiar with linear algebra, for this year they're really experts in linear algebra, the formal definition of this is that we are going to find the set of vectors u(1), u(2), maybe up to u(k). And what we're going to do is project the data onto the linear subspace spanned by this set of k vectors. But if you're not familiar with linear algebra, just think of it as finding k directions instead of just one direction onto which to project the data. So finding a k-dimensional surface is really finding a 2D plane in this case, shown in this figure, where we can define the position of the points in a plane using k directions. And that's why for PCA we want to find k vectors onto which to project the data.

좀 더 일반적인 상황으로 정리하면, PCA는 n차원 데이터를 k차원으로 축소합니다. 수학적으로 PCA는 데이터를 투영할 단일 벡터를 찾는 것이 아니라 투영 오차를 최소화하는 k차원을 찾습니다. 예를 들면, 오른쪽 그림은 3D로 흩어진 데이터입니다. 데이터에서 빨간색으로 그린 한 쌍의 벡터를 찾습니다. 하나의 벡터를 원점에서 확장합니다. 3차원을 2차원으로 축소하기 위한 한 쌍의 벡터를 u^(1)과 u^(2)라고 합니다. 여기 두 벡터는 평면이나 2D 표면을 정의합니다. 이렇게 2D 표면을 만든 후에 데이터를 투영합니다. 선형 대수에 익숙한 경우, 수학적 정의는 벡터 u^(1), u^(2),.., u^(k)까지를 찾는 것입니다. 데이터를 k개의 벡터 셋에 걸쳐있는 선형 공간에 투영합니다. 선형 대수에 익숙하지 않은 경우, 쉽게 데이터를 투영할 방향이 아니라 k 방향을 찾는 것으로 생각하십시오. 이 그림에서 k 차원의 표면을 찾는 것이고 이 예제에서 k = 2이므로 2차원 평면을 찾는 것입니다. 여기서 k 방향을 사용하여 평면에서 점의 위치를 정의합니다. 이것이 PCA가 데이터를 투영할 때 k 벡터를 찾는 이유입니다.

And so more formally in PCA, what we want to do is find this way to project the data so as to minimize the sort of projection distance, which is the distance between the points and the projections. And so in this 3D example too. Given a point we would take the point and project it onto this 2D surface. We are done with that. And so the projection error would be, the distance between the point and where it gets projected down to my 2D surface. And so what PCA does is I try to find the line, or a plane, or whatever, onto which to project the data, to try to minimize that square projection, that 90 degree or that orthogonal projection error.

수학적으로 PCA가 데이터 투영하는 방법은 투영 오차 또는 투영 거리를 최소화하는 것입니다. 3D 예제에서도 마찬가지입니다. 주어진 점을 2D 표면에 투영합니다. 투영 오차는 점과 2D 표면으로 투영되는 점 사이의 거리입니다. 그래서, PCA는 데이터를 투영할 선이나 평면 또는 그 어떤 것이든 찾아서 90도로 직교 투영한 후 투영 오차의 제곱을 최소화합니다.

Finally, one question I sometimes get asked is how does PCA relate to linear regression? Because when explaining PCA, I sometimes end up drawing diagrams like these and that looks a little bit like linear regression. It turns out PCA is not linear regression, and despite some cosmetic similarity, these are actually totally different algorithms. If we were doing linear regression, what we would do would be, on the left we would be trying to predict the value of some variable y given some info features x. And so linear regression, what we're doing is we're fitting a straight line so as to minimize the square error between point and this straight line. And so what we're minimizing would be the squared magnitude of these blue lines. And notice that I'm drawing these blue lines vertically. That these blue lines are the vertical distance between the point and the value predicted by the hypothesis. Whereas in contrast, in PCA, what it does is it tries to minimize the magnitude of these blue lines, which are drawn at an angle. These are really the shortest orthogonal distances. The shortest distance between the point x and this red line. And this gives very different effects depending on the dataset. And more generally, when you're doing linear regression, there is this distinguished variable y they we're trying to predict. All that linear regression as well as taking all the values of x and try to use that to predict y. Whereas in PCA, there is no distinguish, or there is no special variable y that we're trying to predict. And instead, we have a list of features, x1, x2, and so on, up to xn, and all of these features are treated equally, so no one of them is special.

마지막으로 가끔 학생들은 PCA와 선형 회귀의 차이점을 질문합니다. PCA를 설명할 때 가끔 왼쪽의 그림과 같은 다이어그램을 그리는 데 선형 회귀처럼 보이기 때문입니다. PCA는 외관상 유사성에도 불구하고 선형 회귀와 완전히 다른 알고리즘입니다. 선형 회귀는 피처 x가 주어진 경우 변수 y의 값을 예측합니다. 그리고, 선형 회귀는 점과 직선 사이의 오차의 제곱을 최소화하는 파란색 선분을 찾습니다. 파란색 선은 점과 가설로 예측된 값 사이의 수평축에 수직인 거리에 제곱의 평균입니다. 반면에 PCA는 점과 직선 사이의 가장 짧은 거리인 90도로 직교하는 파란색 선분을 최소화합니다. 점 x와 빨간색 선 사이의 직교하는 파란색 선분은 둘 사이의 최단 거리입니다. 데이터 셋에 따라 매우 다른 효과가 나타납니다. 일반적으로 선형 회귀는 예측하려는 고유 변수 y 가 있습니다. 모든 선형 회귀는 y를 예측하기 위해 x의 값을 사용합니다. 반면에 PCA는 구별이 없거나 예측하려는 변수 y가 없습니다. 대신에 x1, x2,.., xn 등의 피처 목록이 있고, 모든 피처는 동일하게 취급합니다. 그중 어느 것도 특별하지 않습니다.

As one last example, if I have three-dimensional data and I want to reduce data from 3D to 2D, so maybe I wanna find two directions, u(1) and u(2), onto which to project my data. Then what I have is I have three features, x1, x2, x3, and all of these are treated alike. All of these are treated symmetrically and there's no special variable y that I'm trying to predict. And so PCA is not a linear regression, and even though at some cosmetic level they might look related, these are actually very different algorithms.

여기 마지막 예가 있습니다. 3차원 데이터가 있고 3D에서 2D로 축소할 것입니다. 데이터를 투영할 두 방향은 u^(1)과 u^(2)를 찾아야 합니다. 그런 다음 x1, x2, x3의 세 피처는 모두 동일하게 취급합니다. 모든 피처를 동일하게 취급하고 특별한 변수 y는 없습니다. 따라서, PCA는 선형 회귀가 아니며 외관이 비슷할 뿐 완전히 다른 알고리즘입니다.

So hopefully you now understand what PCA is doing. It's trying to find a lower dimensional surface onto which to project the data, so as to minimize this squared projection error. To minimize the square distance between each point and the location of where it gets projected. In the next video, we'll start to talk about how to actually find this lower dimensional surface onto which to project the data.

이제 PCA가 무엇인지 이해하길 바랍니다. PCA는 투영 오차의 제곱을 최소화하기 위해 데이터를 투영할 더 낮은 차원의 표면을 찾는 알고리즘입니다. 각 점과 투영되는 위치 사이의 거리의 제곱을 최소화합니다. 다음 강의에서 데이터를 투영할 낮은 차원의 표면을 실제로 찾는 방법을 설명할 것입니다.

앤드류 응의 머신러닝 동영상 강의

정리하며

PCA는 차원 축소 문제를 해결하는 가장 인기 있는 알고리즘 중 하나입니다. 수학적으로 데이터가 직교 투영할 때 점과 표면 사이의 거리의 제곱의 합이 최소가 되는 더 낮은 차원의 표면 또는 직선을 찾는 것입니다. PCA는 n 차원 데이터를 k차원으로 축소하고 k <= n입니다. PCA는 데이터를 투영할 더 낮은 차원을 찾아서 직교 투영합니다. 그리고, 투영 오차의 제곱을 최소화합니다.

예를 들면, PCA의 최적화 목표는 2차원 데이터를 1차원으로 축소하는 벡터 u^(1)을 찾는 것이고, 3차원 데이터를 2차원으로 축소하는 벡터 u^(1)과 u^(2)를 찾는 것입니다.

PCA는 외관상 유사성에도 불구하고 선형 회귀와 완전히 다른 알고리즘입니다. 선형 회귀는 피처 x가 주어진 경우 변수 y의 값을 예측하고 PCA는 레이블 y는 없고 모든 피처를 동일하게 취급합니다. 그리고, 선형 회귀는 점과 직선 사이의 오차의 제곱을 최소화합니다. 점과 가설로 예측된 값 사이의 수평축에 수직인 거리에 제곱의 평균입니다. 반면에 PCA는 점과 직선 사이의 가장 짧은 거리인 90도로 직교하는 파란색 선분을 최소화합니다. 점 x와 빨간색 선 사이의 직교하는 파란색 선분은 둘 사이의 최단 거리입니다.

문제 풀이

아래의 데이터 셋에 PCA를 실행합니다. 데이터를 투영할 수 있는 가장 합리적인 벡터 u^(1)의 값은 무엇일까요?

정답은 4번입니다.

브런치는 최신 브라우저에 최적화 되어있습니다. IE chrome safari