brunch

You can make anything
by writing

C.S.Lewis

by 라인하트 Nov 26. 2020

앤드류 응의 머신러닝(14-2): 데이터 시각화

온라인 강의 플랫폼 코세라의 창립자인 앤드류 응 (Andrew Ng) 교수는 인공지능 업계의 거장입니다. 그가 스탠퍼드 대학에서 머신 러닝 입문자에게 한 강의를 그대로 코세라 온라인 강의 (Coursera.org)에서 무료로 배울 수 있습니다. 이 강의는 머신러닝 입문자들의 필수코스입니다. 인공지능과 머신러닝을 혼자 공부하면서 자연스럽게 만나게 되는 강의입니다.

Dimensionality Reduction

(차원 축소)

Motivation (동기 부여)

Motivation II: Visulization (동기부여 : 데이터 시각화)

In the last video, we talked about dimensionality reduction for the purpose of compressing the data. In this video, I'd like to tell you about a second application of dimensionality reduction and that is to visualize the data. For a lot of machine learning applications, it really helps us to develop effective learning algorithms, if we can understand our data better. If there is some way of visualizing the data better, and so, dimensionality reduction offers us, often, another useful tool to do so.

지난 강의에서 차원 축소의 첫 번째 목적인 데이터 압축을 설명했습니다. 이번 강의는 차원 축소(Dimensionality reduction)의 두 번째 목적인 데이터 시각화를 설명합니다. 많은 머신러닝 응용 프로그램에서 데이터를 더 잘 이해할 수 있다면 학습 알고리즘을 더 효과적으로 개발할 수 있습니다. 차원 축소는 데이터 시각화에 매우 유용한 도구입니다.

Let's start with an example. Let's say we've collected a large data set of many statistics and facts about different countries around the world. So, maybe the first feature, X1 is the country's GDP, or the Gross Domestic Product, and X2 is a per capita, meaning the per person GDP, X3 human development index, life expectancy, X5, X6 and so on. And we may have a huge data set like this, where, you know, maybe 50 features for every country, and we have a huge set of countries.

So is there something we can do to try to understand our data better? I've given this huge table of numbers. How do you visualize this data? If you have 50 features, it's very difficult to plot 50-dimensional data. What is a good way to examine this data? Using dimensionality reduction, what we can do is, instead of having each country represented by this featured vector, xi, which is 50-dimensional, so instead of, say, having a country like Canada, instead of having 50 numbers to represent the features of Canada. let's say we can come up with a different feature representation that is these z vectors, that is in R2.

예를 들어 보겠습니다. 여기 전 세계 여러 국가에 대한 많은 통계와 대규모 데이터가 있습니다. 첫 번째 피처 x1은 국가의 GDP 또는 국내 총생산이고, x2는 1인당 GDP, x3는 인간 개발 지수, x4는 기대수명, x5는 빈곤지수, x6는 평균 가계 소득입니다. 국가마다 50개의 피처가 있고, 국가에 대한 방대한 데이터가 있습니다.

데이터를 더 잘 이해하기 위해 무엇을 해야 할까요? 숫자로 나열된 표를 시각화한다면 훨씬 더 잘 이해할 수 있을 것입니다. 하지만, 50개의 피처에 대한 50차원 데이터를 도식화하는 것은 불가능합니다. 유일한 방법은 차원 축소를 사용하여 각 국가의 50차원의 피처 벡터 x^(i)를 표현하는 것입니다. 예를 들어, 캐나다에 대한 50개의 숫자를 R^(2) 차원 벡터로 표현하는 것입니다.

If that's the case, if we can have just a pair of numbers, z1 and z2 that somehow, summarizes my 50 numbers, maybe what we can do is to plot these countries in R2 and use that to try to understand the space in of features of different countries would be the better and so, here, what you can do is reduce the data from 50 D, from 50 dimensions to 2D, so you can plot this as a 2 dimensional plot, and, when you do that, it turns out that, if you look at the output of the Dimensionality Reduction algorithms, It usually doesn't astride a physical meaning to these new features you want to. It's often up to us to figure out you know, roughly what these features means. But, And if you plot those features, here is what you might find

50개의 숫자를 2개의 숫자로 표현할 수 있다면, 국가들을 이차원 평면에 도식화할 수 있습니다. 데이터를 50차원에서 2차원으로 줄이기 위해 새로운 피처 z1과 z2를 사용하고, 2차원 그래프를 그린다면 어떤 국가가 더 나은 지를 한눈에 이해할 수 있습니다. 차원 축소 알고리즘의 출력에서 새로운 피처 z1, z2에 물리적 의미는 없습니다. 피처의 의미를 대략적으로 파악하는 것은 여러분에게 달렸습니다. 하지만, 피처를 도식화하면 다음과 같은 결과를 얻을 수 있습니다.

So, here, every country is represented by a point ZI, which is an R2 and so each of those. Dots, and this figure represents a country, and so, here's Z1 and here's Z2, and of these. So, you might find, for example, That the horizontial axis the Z1 axis corresponds roughly to the overall country size, or the overall economic activity of a country. So the overall GDP, overall economic size of a country. Whereas the vertical axis in our data might correspond to the per person GDP. Or the per person well being, or the per person economic activity, and, you might find that, given these 50 features, you know, these are really the 2 main dimensions of the deviation.

여기에 모든 국가가 z^(i)로 표시되어 있습니다. z^(i)는 R^(2) 차원 벡터이고 점은 각각의 국가를 표시합니다. 수평축이 z1이고, 수직축이 z2입니다. z1은 국가 규모(country size) 또는 GDP와 같은 경제 활동에 해당합니다. z2는 국민 일인당 GDP, 복지, 경제활동입니다. 50가지 피처를 고려할 때 이것은 편차의 그래프입니다.

And so, out here you may have a country like the U.S.A., which is a relatively large GDP, you know, is a very large GDP and a relatively high per-person GDP as well. Whereas here you might have a country like Singapore, which actually has a very high per person GDP as well, but because Singapore is a much smaller country the overall economy size of Singapore is much smaller than the US. And, over here, you would have countries where individuals are unfortunately some are less well off, maybe shorter life expectancy, less health care, less economic maturity that's why smaller countries, whereas a point like this will correspond to a country that has a fair, has a substantial amount of economic activity, but where individuals tend to be somewhat less well off. So you might find that the axes Z1 and Z2 can help you to most succinctly capture really what are the two main dimensions of the variations amongst different countries. Such as the overall economic activity of the country projected by the size of the country's overall economy as well as the per-person individual well-being, measured by per-person GDP, per-person healthcare, and things like that.

미국과 같은 국가가 우측 최상단 해당합니다. 상대적으로 GDP가 매우 크고 1인당 GDP도 높기 때문입니다. 싱가포르는 중앙 최상단에 해당합니다. 싱가포르는 1인당 GDP는 매우 높지만 상대적으로 국토의 크기가 경제 규모면에서 작기 때문입니다. 미국보다 상대적으로 훨씬 작습니다. 그리고 왼쪽 하단에는 국민에게 좋지 않은 국가가 배치되었습니다. 기대 수명도 짧고 건강 관리도 잘 안되고 경제 규모도 떨어지기 때문입니다. 반면에 우측 하단의 국가들은 국민 일인당으로 볼 때는 똑같지만, 국가 차원에서는 상당한 경제 활동이 있습니다. 따라서, z1 축과 z2 축은 2차원 변수로 가장 간결하게 국가 간의 차이를 드러내는 데 도움을 줄 수 있습니다. 국가 전체 경제 규모는 1인당 GDP, 1인당 헬스케어 등으로 측정되는 1인당 복지와 같습니다.

So that's how you can use dimensionality reduction, in order to reduce data from 50 dimensions or whatever, down to two dimensions, or maybe down to three dimensions, so that you can plot it and understand your data better.

In the next video, we'll start to develop a specific algorithm, called PCA, or Principal Component Analysis, which will allow us to do this and also do the earlier application I talked about of compressing the data.

지금까지 차원 축소를 활용하여 50차원의 데이터를 2차원이나 3차원으로 축소하여 도식화하고 데이터를 잘 이해할 수 있는 방법을 배웠습니다.

다음 강의에서 주성분 분석(PCA, Principal Component Analysis)이라 불리는 알고리즘을 설명합니다. PCA 알고리즘은 데이터 압축에 매우 유용합니다.

앤드류 응의 머신 러닝 동영상 강의

정리하며

많은 머신러닝 응용 프로그램에서 데이터를 더 잘 이해할 수 있다면 효과적인 학습 알고리즘을 개발하는 데 큰 도움이 됩니다. 데이터를 더 잘 시각화하기 위해 차원 축소는 매우 유용한 방법입니다.

예를 들면, 각 국가별로 주요 지표 50개를 피처를 가질 때, 한 국가를 50개의 피처를 가지고 평가하기는 어렵습니다. 차원 축소를 통해 50차원의 지표를 2차원의 데이터로 축소하여 2차원 그래프를 그린다면 사람들은 더 쉽게 이해할 수 없습니다.

지금까지 차원 축소를 활용하여 50차원의 데이터를 2차원이나 3차원으로 축소하여 도식화하면 시각적으로 사람들이 더 잘 이해할 수 있습니다.

문제 풀이

데이터 셋 x(i)가 있습니다. 시각화를 위해 k 차원으로 차원 축소를 할 것입니다. 일반적인 설정에서 사실인 것은?

정답은 2번과 4번입니다.

브런치는 최신 브라우저에 최적화 되어있습니다. IE chrome safari