brunch

You can make anything
by writing

C.S.Lewis

by 라인하트 Nov 22. 2020

앤드류 응의 머신러닝(13-2): K-평균 알고리즘

온라인 강의 플랫폼 코세라의 창립자인 앤드류 응 (Andrew Ng) 교수는 인공지능 업계의 거장입니다. 그가 스탠퍼드 대학에서 머신 러닝 입문자에게 한 강의를 그대로 코세라 온라인 강의 (Coursera.org)에서 무료로 배울 수 있습니다. 이 강의는 머신러닝 입문자들의 필수코스입니다. 인공지능과 머신러닝을 혼자 공부하면서 자연스럽게 만나게 되는 강의입니다.

Unsupervised Learning

비지도 학습

Clustering (클러스터링)

K-Means Algoritm (K-평균 알고리즘)

In the clustering problem we are given an unlabeled data set and we would like to have an algorithm automatically group the data into coherent subsets or into coherent clusters for us. The K Means algorithm is by far the most popular, by far the most widely used clustering algorithm, and in this video I would like to tell you what the K Means Algorithm is and how it works.

클러스터링 알고리즘은 레이블이 지정되지 않은 데이터 셋을 자동으로 일정한 클러스터, 집합 또는 그룹으로 묶습니다. 가장 많이 사용하는 클러스터링 알고리즘 중에 하나는 K-평균 알고리즘입니다. 이번 강의는 K-평균(K-Means) 알고리즘 의 정의와 동작 방식을 설명합니다.

The K means clustering algorithm is best illustrated in pictures. Let's say I want to take an unlabeled data set like the one shown here, and I want to group the data into two clusters.

K-평균 클러스터링 알고리즘은 그림에 잘 나타나 있습니다. 여기서 레이블이 없는 데이터 셋을 두 개의 클러스터로 그룹화할 것입니다.

If I run the K Means clustering algorithm, here is what I'm going to do. The first step is to randomly initialize two points, called the cluster centroids. So, these two crosses here, these are called the Cluster Centroids and I have two of them because I want to group my data into two clusters. K Means is an iterative algorithm and it does two things. First is a cluster assignment step, and second is a move centroid step. So, let me tell you what those things mean.

K-평균 클러스터링 알고리즘의 동작 방식은 다음과 같습니다. 우선 클러스터의 중심이라고 예상되는 두 개의 점을 무작위로 초기화합니다. 십자가 모양은 클러스터 중심(Cluster Centroid)이고, 두 개의 클러스터로 그룹화하기 위해 두 점을 초기화합니다. K-평균은 두 개의 단계를 반복하는 알고리즘입니다. 첫 번째 단계는 클러스터 할당(Cluster Assignment)이고 두 번째 단계는 클러스터 중심이동(move centroid)입니다.

The first of the two steps in the loop of K means, is this cluster assignment step. What that means is that, it's going through each of the examples, each of these green dots shown here and depending on whether it's closer to the red cluster centroid or the blue cluster centroid, it is going to assign each of the data points to one of the two cluster centroids. Specifically, what I mean by that, is to go through your data set and color each of the points either red or blue, depending on whether it is closer to the red cluster centroid or the blue cluster centroid, and I've done that in this diagram here. So, that was the cluster assignment step.

K-평균 루프의 첫 번째는 클러스터 할당(Cluster assignment) 단계입니다. 클러스터 할당 단계는 녹색점이 빨간색 클러스터 중심과 파란색 클러스터 중심 중에 어느 중심에 더 가까운지를 계산합니다. 각 예제를 가장 가까운 클러스터 중심에 할당합니다. 예를 들면, 데이터 셋의 학습 예제가 어느 클러스터 중심이 가장 가까운 지에 따라 빨간색 또는 파란색을 칠합니다. 클러스터 할당 단계는 모든 학습 예제에 클러스터를 할당합니다.

The other part of K means, in the loop of K means, is the move centroid step, and what we are going to do is, we are going to take the two cluster centroids, that is, the red cross and the blue cross, and we are going to move them to the average of the points colored the same colour. So what we are going to do is look at all the red points and compute the average, really the mean of the location of all the red points, and we are going to move the red cluster centroid there. And the same things for the blue cluster centroid, look at all the blue dots and compute their mean, and then move the blue cluster centroid there.

K-평균 루프의 두 번째는 클러스트 중심 이동(Move centroid) 단계입니다. 클러스터 중심 빨간색 십자가와 파란색 십자가를 같은 색으로 칠해진 데이터 셋의 평균값으로 이동합니다. 즉, 모든 빨간색 점들의 평균을 계산하고 빨간색 십자가를 새로운 평균으로 이동합니다. 모든 파란색 점들의 평균을 계산합니다. 파란색 점들의 위치의 평균으로 파란색 십자가를 이동합니다.

So, let me do that now. We're going to move the cluster centroids as follows and I've now moved them to their new means. The red one moved like that and the blue one moved like that and the red one moved like that. And then we go back to another cluster assignment step, so we're again going to look at all of my unlabeled examples and depending on whether it's closer the red or the blue cluster centroid, I'm going to color them either red or blue. I'm going to assign each point to one of the two cluster centroids, so let me do that now. And so the colors of some of the points just changed.

클러스터 중심을 나타내는 두 십자가는 새로운 평균으로 이동합니다. 그리고 클러스터 할당 단계를 다시 시작합니다. 모든 학습 예제에 대해 다시 빨간색 또는 파란색 클러스터 중심 중에 어느 것에 더 가까운 지를 계산하고 색을 칠합니다.

And then I'm going to do another move centroid step. So I'm going to compute the average of all the blue points, compute the average of all the red points and move my cluster centroids like this.

이제 클러스터 중심 이동 단계를 실행합니다. 모든 파란색 점들의 평균을 계산하고 모든 빨간색 점들의 평균을 계산한 후 클러스터 중심을 이동합니다.

And so, let's do that again. Let me do one more cluster assignment step. So colour each point red or blue, based on what it's closer to

그리고 다시 반복합니다. 클러스터 할당 단계를 수행합니다. 각 점을 빨간색 또는 파란색으로 색칠합니다.

And then do another move centroid step and we're done. And in fact if you keep running additional iterations of K means from here the cluster centroids will not change any further and the colours of the points will not change any further. And so, this is the, at this point, K means has converged and it's done a pretty good job finding the two clusters in this data.

각 색깔들의 평균을 구한 다음 클러스터 중심을 이동합니다. 실제로 K-평균을 반복적으로 계속 실행하면 클러스터 중심이 더 이상 변하지 않고 학습 예제의 색상이 더 이상 변경되지 않는 단계에 다릅니다. 따라서, K-평균은 수렴되고 두 클러스터를 잘 그룹핑합니다.

Let's write out the K means algorithm more formally. The K means algorithm takes two inputs. One is a parameter K, which is the number of clusters you want to find in the data. I'll later say how we might go about trying to choose k, but for now let's just say that we've decided we want a certain number of clusters and we're going to tell the algorithm how many clusters we think there are in the data set. And then K means also takes as input this sort of unlabeled training set of just the Xs and because this is unsupervised learning, we don't have the labels Y anymore. And for unsupervised learning of the K means I'm going to use the convention that XI is an RN dimensional vector. And that's why my training examples are now N dimensional rather N plus one dimensional vectors. This is what the K means algorithm does.

K-평균 알고리즘을 수학적으로 정리합니다. K-평균 알고리즘은 두 개의 입력이 필요합니다. 하나는 데이터 셋에서 클러스터의 수를 나타내는 파라미터 K입니다. 나중에 K를 선택하는 방법을 설명할 것입니다. 지금은 데이터 셋에 얼마나 많은 수의 클러스터가 필요한 지를 알고리즘에 알려줍니다. 또 다른 하나는 레이블이 없는 x들의 학습 셋입니다. K-평균 알고리즘은 레이블 y가 없는 비지도 학습입니다. 그리고 x^(i)는 R^(n) 차원 벡터입니다. x0는 없습니다. 이것이 학습 예제가 R^(n+1) 차원 벡터가 아니라 R^(n) 차원 벡터인 이유입니다. 이것이 K-평균 알고리즘의 동작 방식입니다.

The first step is that it randomly initializes k cluster centroids which we will call mu 1, mu 2, up to mu k. And so in the earlier diagram, the cluster centroids corresponded to the location of the red cross and the location of the blue cross. So there we had two cluster centroids, so maybe the red cross was mu 1 and the blue cross was mu 2, and more generally we would have k cluster centroids rather than just 2. Then the inner loop of k means does the following, we're going to repeatedly do the following.

First for each of my training examples, I'm going to set this variable CI to be the index 1 through K of the cluster centroid closest to XI. So this was my cluster assignment step, where we took each of my examples and coloured it either red or blue, depending on which cluster centroid it was closest to. So CI is going to be a number from 1 to K that tells us, you know, is it closer to the red cross or is it closer to the blue cross, and another way of writing this is I'm going to, to compute Ci, I'm going to take my Ith example Xi and and I'm going to measure it's distance to each of my cluster centroids, this is mu and then lower-case k, right, so capital K is the total number centroids and I'm going to use lower case k here to index into the different centroids. But so, Ci is going to, I'm going to minimize over my values of k and find the value of K that minimizes this distance between Xi and the cluster centroid, and then, you know, the value of k that minimizes this, that's what gets set in Ci.

첫 번째 단계는 클러스터 중심을 무작위로 초기화하는 것입니다. 따라서, 앞의 다이어그램을 상기해 봅시다. 클러스터 중심은 빨간색 십자가와 파란색 십자가입니다. 빨간색 십자가는 μ1, 파란색 십자가는 μ2입니다. 일반적으로 클러스터 중심은 2가 아닌 K 개입니다. 그리고, K 평균의 내부 루프는 다음과 같고 반복적으로 수행할 것입니다.

먼저 각 학습 예제 x^(i)에 대한 변수 C^(i)는 어떤 클러스터 중심이 x^(i)에 가장 가까운 지를 나타내는 인덱스이고, 1부터 K까지의 값 중 하나입니다. 첫 번째 For 루프는 클러스터 할당 단계입니다. 여기서 각 예제에 대해 가장 가까운 클러스터 중심에 따라 빨간색 또는 파란색으로 지정합니다. 그래서 C^(i)는 1에서 K까지의 숫자입니다. C^(i)는 빨간색 십자가에 더 가까운 지 파란색 십자가에 더 가까운 지를 알려줍니다. C^(i)를 표현하는 또 다른 방법이 있습니다. C^(i)는 x^(i) 예제를 각 클러스터 중심까지의 거리를 측정합니다.

여기서 k는 소문자입니다. 대문자 K는 클러스터 중심의 총 수이고 소문자 k는 클러스터 중심을 인덱싱 합니다. C^(i)는 k값을 최소화하고 x^(i)와 클러스터 중심 사이의 거리를 최소화하는 k 값을 찾을 것입니다. 이것이 첫 번째 For 루프에서 C^(i)가 하는 일입니다.

So, here's another way of writing out what Ci is. If I write the norm between Xi minus Mu-k, then this is the distance between my ith training example Xi and the cluster centroid Mu subscript K, this is--this here, that's a lowercase K. So uppercase K is going to be used to denote the total number of cluster centroids, and this lowercase K's a number between one and capital K. I'm just using lower case K to index into my different cluster centroids. Next is lower case k. So that's the distance between the example and the cluster centroid and so what I'm going to do is find the value of K, of lower case k that minimizes this, and so the value of k that minimizes you know, that's what I'm going to set as Ci, and by convention here I've written the distance between Xi and the cluster centroid, by convention people actually tend to write this as the squared distance. So we think of Ci as picking the cluster centroid with the smallest squared distance to my training example Xi. But of course minimizing squared distance, and minimizing distance that should give you the same value of Ci, but we usually put in the square there, just as the convention that people use for K means. So that was the cluster assignment step.

C^(i)를 표현하는 또 다른 방법이 있습니다.

즉, 이것은 i 번째 학습 예제 x^(i)와 μk 사이의 거리입니다. 여기서 k는 소문자입니다. 대문자 K는 클러스터 중심의 총 수이고, 소문자 k는 1과 대문자 K 사이의 숫자입니다. 소문자 k는 클러스터 중심을 인덱싱 합니다. ||x^(i) - μk||은 학습 예제와 클러스터 중심 사이의 거리이고 ||x^(i) - μk|| 을 최소화하는 소문자 k의 값을 찾습니다.

||x^(i) - μk||를 최소화하는 k값은 C^(i) 값으로 설정합니다. ||x^(i) - μk||로 표현했지만, 관례적으로 사람들은 ||x^(i) - μk||^2으로 표현하기도 합니다. 그래서 C^(i)를 학습 예제 x^(i)와 클러스터 중심과의 가장 작은 거리의 제곱입니다.

이것이 클러스터 할당 단계입니다.

The other in the loop of K means does the move centroid step. And what that does is for each of my cluster centroids, so for lower case k equals 1 through K, it sets Mu-k equals to the average of the points assigned to cluster. So as a concrete example, let's say that one of my cluster centroids, let's say cluster centroid two, has training examples, you know, 1, 5, 6, and 10 assigned to it. And what this means is, really this means that C1 equals to C5 equals to C6 equals to and similarly well c10 equals, too, right? If we got that from the cluster assignment step, then that means examples 1,5,6 and 10 were assigned to the cluster centroid two. Then in this move centroid step, what I'm going to do is just compute the average of these four things. So X1 plus X5 plus X6 plus X10. And now I'm going to average them so here I have four points assigned to this cluster centroid, just take one quarter of that. And now Mu2 is going to be an n-dimensional vector. Because each of these example x1, x5, x6, x10 each of them were an n-dimensional vector, and I'm going to add up these things and, you know, divide by four because I have four points assigned to this cluster centroid, I end up with my move centroid step, for my cluster centroid mu-2. This has the effect of moving mu-2 to the average of the four points listed here.

One thing that I've asked is, well here we said, let's let mu-k be the average of the points assigned to the cluster. But what if there is a cluster centroid no points with zero points assigned to it. In that case the more common thing to do is to just eliminate that cluster centroid. And if you do that, you end up with K minus one clusters instead of k clusters. Sometimes if you really need k clusters, then the other thing you can do if you have a cluster centroid with no points assigned to it is you can just randomly reinitialize that cluster centroid, but it's more common to just eliminate a cluster if somewhere during K means it with no points assigned to that cluster centroid, and that can happen, altthough in practice it happens not that often.

So that's the K means Algorithm. Before wrapping up this video I just want to tell you about one other common application of K Means and that's to the problems with non well separated clusters.

K 평균 알고리즘의 루프에서 두 번째 역할이자 슬라이드에서 두 번째 For 루프는 클러스터 중심 이동 단계입니다. μk에서 소문자 k는 1에서 대문자 K까지 중 하나의 값입니다. 여기서 μk는 클러스터 k에 할당된 평균값입니다. 구체적인 예를 들면, 하나의 클러스터 중심 μ2에 할당된 학습 예제가 4개가 있다고 가정합니다.

따라서, 클러스터 할당 단계에서 x^(1), x^(5), x(6), x^(10)는 클러스터 중심 μ2에 할당합니다. 그런 다음 클러스터 중심 이동 단계에서 네 가지 학습 예제의 평균을 계산합니다.

μ2는 R^(n) 차원 벡터입니다. x^(1), x^(5), x(6), x^(10) 학습 예제는 R^(n) 차원 벡터였기 때문입니다. 따라서 클러스터 중심 μ2는 새로운 평균값으로 이동합니다.

여기서 μk는 클러스터에 할당된 새로운 평균값입니다. 특정 클러스터 중심에 할당된 데이터가 없어서 0의 값을 가진다면 어떻게 될까요? 일반적으로 클라스터 중심을 제거합니다. 따라서 k 개의 클러스터 대신에 k-1 개의 클러스터입니다. 만일 k개의 클러스터가 반드시 필요하다면, 무작위로 k개의 클러스터를 초기화합니다. 자주 있는 일은 아니지만 간혹 클러스터 중심에 할당된 데이터가 없을 수 있습니다.

이것이 K-평균 알고리즘입니다. 이번 강의를 마무리하기 전에 K-평균을 활용할 수 있는 일반적인 사례를 설명합니다. 지금부터는 잘 분리되지 않는 클러스터 문제입니다.

Here's what I mean. So far we've been picturing K Means and applying it to data sets like that shown here where we have three pretty well separated clusters, and we'd like an algorithm to find maybe the 3 clusters for us. But it turns out that very often K Means is also applied to data sets that look like this where there may not be several very well separated clusters.

Here is an example application, to t-shirt sizing. Let's say you are a t-shirt manufacturer you've done is you've gone to the population that you want to sell t-shirts to, and you've collected a number of examples of the height and weight of these people in your population and so, well I guess height and weight tend to be positively highlighted so maybe you end up with a data set like this, you know, with a sample or set of examples of different peoples heights and weight. Let's say you want to size your t shirts. Let's say I want to design and sell t shirts of three sizes, small, medium and large. So how big should I make my small one? How big should I my medium? And how big should I make my large t-shirts.

지금까지 K-평균 알고리즘은 여기에 표시된 데이터 셋과 같이 클리스터로 꽤 잘 분리된 데이터를 다루었습니다. 여기서 3 개의 클러스터를 찾는 알고리즘이 필요합니다. 그러나 K-평균은 잘 분리된 클러스터가 없을 수 있는 데이터 셋에 적용하는 경우가 많습니다.

오른쪽 그림은 티셔츠 크기를 조장하는 응용 프로그램 예제입니다. 티셔츠 제조 기업은 사람들의 키와 몸무게에 대한 데이터를 수집했습니다. 티셔츠 제작에 인구의 키와 몸무게의 데이터에서 의미 있는 정보를 추출할 것입니다. 오른쪽 그림의 데이터 셋에서 티셔츠의 크기를 결정하고자 합니다. 티셔츠를 Small, Medium, Large의 세 가지 크기로 디자인하여 판매할 것입니다. Small 사이즈의 티셔츠, Medium 사이즈의 티셔츠 그리고 Large 사이즈의 티셔츠를 어떤 크기로 만들어야 할까요?

One way to do that would to be to run my k means clustering logarithm on this data set that I have shown on the right and maybe what K Means will do is group all of these points into one cluster and group all of these points into a second cluster and group all of those points into a third cluster. So, even though the data, you know, before hand it didn't seem like we had 3 well separated clusters, K Means will kind of separate out the data into multiple pluses for you. And what you can do is then look at this first population of people and look at them and, you know, look at the height and weight, and try to design a small t-shirt so that it kind of fits this first population of people well and then design a medium t-shirt and design a large t-shirt. And this is in fact kind of an example of market segmentation where you're using K Means to separate your market into 3 different segments. So you can design a product separately that is a small, medium, and large t-shirts, that tries to suit the needs of each of your 3 separate sub-populations well.

한 가지 방법은 데이터 셋에 K-평균 클러스터링 알고리즘을 실행하는 것입니다. K-평균 알고리즘은 데이터를 크게 세 개의 클러스터로 그룹화합니다. 데이터가 잘 분리된 것처럼 보이지는 않지만 3개의 클러스터로 분리합니다. 그리고, 첫 번째 인구 집단의 키와 몸무게를 보고 Small 사이즈 티셔츠를 디자인합니다. 두 번째 인구 집단의 키와 몸무게를 보고 Medium 사이즈 티셔츠를 디자인합니다. 마지막으로 세 번째 인구 집단의 키와 몸무게를 보고 Large 사이즈 티셔츠를 디자인합니다. 이것은 실제로 K-평균 알고리즘이 시장을 3 개의 다른 세그먼트로 분리하는 시장 세분화의 사례입니다. 따라서, 세 가지 클러스터에 맞는 Small, Medium, Large 티셔츠를 디자인할 수 있습니다.

So that's the K Means algorithm. And by now you should know how to implement the K Means Algorithm and kind of get it to work for some problems. But in the next few videos what I want to do is really get more deeply into the nuts and bolts of K means and to talk a bit about how to actually get this to work really well.

이것이 K-평균 알고리즘입니다. K-평균 알고리즘을 구현하는 방법과 어떻게 작동하는 지를 알아야 합니다. 하지만, 다음 몇 개의 강의에서 K-평균 알고리즘을 더욱 깊고 자세하게 이해하고 실제로 잘 동작하게 만드는 방법에 대해 공부할 것입니다.

앤드류 응의 머신러닝 동영상 강의

정리하며

클러스터링 알고리즘은 레이블이 없는 데이터 셋을 자동으로 일정한 클러스터, 집합 또는 그룹으로 묶습니다. 대표적인 알고리즘은 K-평균 알고리즘입니다.

K-평균 알고리즘은 클러스터의 중심이라고 예상되는 위치를 무작위로 초기화합니다. 그리고 두 가지 작업을 반복 수행합니다. 하나는 클러스터링 할당 단계로 학습 예제가 어떤 클러스터 중심에 가까운 지를 결정합니다. 두 번째는 클러스터 중심 이동 단계로 같은 클러스터의 학습 예제들의 평균값을 계산하고 그곳으로 클러스터 중심이 이동합니다. 이 두 작업을 반복적으로 수행하면 클러스터 중심이 더 이상 변하지 않고 학습 예제의 클러스터가 더 이상 변하지 않는 단계에 이릅니다. 이것을 K-평균 알고리즘이 수렴한다고 합니다.

클러스터 할당 단계에서는 다음과 공식을 사용합니다.

대문자 K는 클러스터 또는 클러스터 중심의 총 수이고 소문자 k는 클러스터 중심을 인덱싱 합니다. C^(i)는 x^(i)와 클러스터 중심 사이의 거리가 가장 짧은 클러스터 중심의 k 값으로 클러스터 중심 μk입니다. 예를 들어, x^(1)가 클러스 클러스터 중심 μ1, μ2, μ3, μ4 중에서 μ2와의 거리가 제일 작다면, C^(1) = 2입니다.

또한, K-평균 알고리즘은 분류가 명확한 데이터 셋 뿐만 아니라 분류가 명확하지 않은 데이터에도 적용할 수 있습니다. 예를 들면, 티셔프의 사이즈가 S, M, L과 같이 세 종류로 나누고자 할 때 어떤 크기에 맞출지를 K-평균 알고리즘은 계산할 수 있습니다. 사람들의 키와 몸무게 데이터를 기준으로 크기를 결정합니다.