brunch

You can make anything
by writing

C.S.Lewis

by 라인하트 Nov 24. 2020

앤드류 응의 머신러닝(13-5): K-평균 클러스터 수

온라인 강의 플랫폼 코세라의 창립자인 앤드류 응 (Andrew Ng) 교수는 인공지능 업계의 거장입니다. 그가 스탠퍼드 대학에서 머신 러닝 입문자에게 한 강의를 그대로 코세라 온라인 강의 (Coursera.org)에서 무료로 배울 수 있습니다. 이 강의는 머신러닝 입문자들의 필수코스입니다. 인공지능과 머신러닝을 혼자 공부하면서 자연스럽게 만나게 되는 강의입니다.

Unsupervised Learning

비지도 학습

Clustering (클러스터링)

Choosing the Number of Clusters (클러스터의 수를 선택하기)

In this video I'd like to talk about one last detail of K-means clustering which is how to choose the number of clusters, or how to choose the value of the parameter capsule K. To be honest, there actually isn't a great way of answering this or doing this automatically and by far the most common way of choosing the number of clusters, is still choosing it manually by looking at visualizations or by looking at the output of the clustering algorithm or something else.

이번 강의는 K-평균 알고리즘에서 클러스터의 수를 결정하는 방법 또는 파라미터 K의 값을 결정하는 방법을 자세히 다룹니다. 솔직히 가장 좋은 해답이나 자동으로 해결하는 방법은 없습니다. 일반적으로 직접 시각화된 데이터를 살펴보거나 클러스터링 알고리즘의 결과를 확인하는 것입니다.

But I do get asked this question quite a lot of how do you choose the number of clusters, and so I just want to tell you know what are peoples' current thinking on it although, the most common thing is actually to choose the number of clusters by hand. A large part of why it might not always be easy to choose the number of clusters is that it is often generally ambiguous how many clusters there are in the data.

하지만, 저는 클러스터의 수를 선택하는 방법에 대한 질문을 많이 받습니다. 가장 일반적인 방법은 직접 클러스터의 수를 결정하는 것일지라도 사람들이 생각하는 방식을 설명합니다. 클러스터의 수를 결정하는 것이 항상 쉽지 않은 이유는 데이터에 있는 클러스터의 수가 모호하기 때문입니다.

Looking at this data set some of you may see four clusters.

여기 데이트 셋이 있습니다.

And that would suggest using K equals 4. Or some of you may see two clusters.

4 개의 클러스터로 나뉠 수 있을 것 같아서 K = 4를 제안할 수 있습니다.

And that will suggest K equals 2 and now this may see three clusters. And so, looking at the data set like this, the true number of clusters, it actually seems genuinely ambiguous to me, and I don't think there is one right answer. And this is part of unsupervised learning. We are aren't given labels, and so there isn't always a clear cut answer. And this is one of the things that makes it more difficult to say, have an automatic algorithm for choosing how many clusters to have.

또는 2개의 클러스터로 나뉠 수 있을 것 같아서 K = 2를 제안할 수 있습니다. 또는 3개의 클러스터로 나뉠 수 있을 것 같아서 K = 3을 제안합니다. 그래서, 데이터 셋에 대한 실제 클러스터의 개수를 생각해보면 정말 모호하고 정답은 없는 것 같습니다. 이것은 비지도 학습의 일부입니다. 레이블이 없는 학습 예제를 활용하기 때문에 항상 명확한 답이 있는 것은 아닙니다. 그리고 이것이 얼마나 많은 클러스터를 사용할지를 결정하는 자동 알고리즘을 갖는 것이 어려운 이유 중 하나입니다.

When people talk about ways of choosing the number of clusters, one method that people sometimes talk about is something called the Elbow Method. Let me just tell you a little bit about that, and then mention some of its advantages but also shortcomings. So the Elbow Method, what we're going to do is vary K, which is the total number of clusters. So, we're going to run K-means with one cluster, that means really, everything gets grouped into a single cluster and compute the cost function or compute the distortion J and plot that here. And then we're going to run K means with two clusters, maybe with multiple random initial agents, maybe not. But then, you know, with two clusters we should get, hopefully, a smaller distortion, and so plot that there. And then run K-means with three clusters, hopefully, you get even smaller distortion and plot that there. I'm gonna run K-means with four, five and so on. And so we end up with a curve showing how the distortion, you know, goes down as we increase the number of clusters. And so we get a curve that maybe looks like this.

클러스터의 수를 선택할 때 사용할 수 있는 방법 중에 엘보 메쏘드(팔꿈치 방법)가 있습니다. 엘보 메쏘의 장점과 단점을 정리합니다. 엘보 메쏘드는 클러스터의 총 수인 K를 변화시키는 것입니다. 하나의 클러스터로 K-평균 알고리즘을 실행하여 비용 함수를 계산하거나 왜곡 J를 계산하고 도식화합니다. 그리고 두 개의 클러스터로 K-평균 알고리즘을 실행합니다. 여러 번의 랜덤 초기화를 하거나 하지 않을 수도 있습니다. 두 개의 클러스터를 사용하여 K-평균 알고리즘이 더 작은 왜곡을 반환하였다면 여기에 도식화합니다. 그런 다음 세 개의 클러스터로 K-평균 알고리즘을 실행하면 더 작은 왜곡을 반환하고 도식화합니다. 4개, 5개 등의 클러스터로 순차적으로 K-평균 알고리즘을 실행합니다. 결국 클러스터의 수를 늘릴 때 왜곡이 어떻게 감소하는 지를 보여주는 곡선을 그릴 수 있습니다. 그래서 이와 같은 곡선을 얻을 수 있습니다.

And if you look at this curve, what the Elbow Method does it says "Well, let's look at this plot. Looks like there's a clear elbow there". Right, this is, would be by analogy to the human arm where, you know, if you imagine that you reach out your arm, then, this is your shoulder joint, this is your elbow joint and I guess, your hand is at the end over here. And so this is the Elbow Method. Then you find this sort of pattern where the distortion goes down rapidly from 1 to 2, and 2 to 3, and then you reach an elbow at 3, and then the distortion goes down very slowly after that. And then it looks like, you know what, maybe using three clusters is the right number of clusters, because that's the elbow of this curve, right? That it goes down, distortion goes down rapidly until K equals 3, really goes down very slowly after that. So let's pick K equals 3. If you apply the Elbow Method, and if you get a plot that actually looks like this, then, that's pretty good, and this would be a reasonable way of choosing the number of clusters.

이 곡선을 보면 "이 그래프에 명확한 엘보가 있는 것 같습니다."라고 말할 수 있습니다. 그래프의 모양이 사람의 팔과 유사하다고 상상할 수 있니다. 팔을 뻗었다고 상상한다면 맨 왼쪽 상단이 어깨 관절이고, K=3의 그래프가 엘보(팔꿈치)이고, K=8의 그래프가 손입니다. 이것이 엘보 메쏘드입니다. K가 1에서 2로, 2에서 3으로 빠르게 하강하고 K=3에서 왜곡이 매우 천천히 하강합니다. 마치 3개의 클러스터를 사용하는 것이 올바른 것 같습니다. 이 부분이 곡선의 팔꿈치이기 때문입니다. 왜곡은 K = 3까지 빠르게 줄어들고 그 후에는 완만하게 하강합니다. 그러므로 K=3을 선택합니다. 엘보 메쏘드를 적용하고 실제로 이런 그래프를 얻는다면 매우 좋은 결과입니다. 이것이 클러스터의 수를 결정하는 합리적인 방법입니다.

It turns out the Elbow Method isn't used that often, and one reason is that, if you actually use this on a clustering problem, it turns out that fairly often, you know, you end up with a curve that looks much more ambiguous, maybe something like this. And if you look at this, I don't know, maybe there's no clear elbow, but it looks like distortion continuously goes down, maybe 3 is a good number, maybe 4 is a good number, maybe 5 is also not bad. And so, if you actually do this in a practice, you know, if your plot looks like the one on the left and that's great. It gives you a clear answer, but just as often, you end up with a plot that looks like the one on the right and is not clear where the ready location of the elbow is. It makes it harder to choose a number of clusters using this method. So maybe the quick summary of the Elbow Method is that is worth the shot but I wouldn't necessarily, you know, have a very high expectation of it working for any particular problem.

그러나 실제로 엘보 메쏘드는 자주 사용하는 방법은 아닙니다. 한 가지 이유는 클러스터링 문제에 실제로 적용해 보면 상당히 모호한 곡선으로 끝납니다. 오른쪽 그래프를 보시면 분명한 팔꿈치가 없을 수도 있지만 계속해서 하강하는 것 같습니다. 아마도 K = 3은 좋은 숫자, K = 4는 좋은 숫자, K = 5도 나쁘지 않습니다. 그래서 실제로 사용할 때 왼쪽 그래프와 같다면 훌륭합니다. 왜냐하면 명확한 답을 제공하기 때문입니다. 만일 오른쪽 그래프와 같은 결과를 얻는다면, 팔꿈치의 위치가 명확하지 않습니다. 이런 경우 클러스터의 수를 선택하기 더 어려워집니다. 따라서, 엘보 메쏘드는 시도할 가치가 있지만 특정 문제에서 제대로 동작하지 않을 수 있습니다.

Finally, here's one other way of how, thinking about how you choose the value of K, very often people are running K-means in order you get clusters for some later purpose, or for some sort of downstream purpose. Maybe you want to use K-means in order to do market segmentation, like in the T-shirt sizing example that we talked about. Maybe you want K-means to organize a computer cluster better, or maybe a learning cluster for some different purpose, and so, if that later, downstream purpose, such as market segmentation. If that gives you an evaluation metric, then often, a better way to determine the number of clusters, is to see how well different numbers of clusters serve that later downstream purpose.

마지막으로 K 값을 선택하는 또 다른 방법이 있습니다. 나중이나 일종의 후속 작업을 위해 K-평균 알고리즘을 활용하는 경우가 많습니다. 시장 세분화를 하거나 티셔츠의 크기를 조정하거나 컴퓨터 클러스터를 조정하고 또는 기타 다른 목적으로 K-평균 알고리즘을 활용합니다. 평가 지표가 제공될 때 클러스터의 수를 결정하는 더 좋은 방법은 다른 수의 클러스터가 후속 작업에 얼마나 부합하는 지를 확인하는 것입니다.

Let me step through a specific example. Let me go through the T-shirt size example again, and I'm trying to decide, do I want three T-shirt sizes? So, I choose K equals 3, then I might have small, medium and large T-shirts. Or maybe, I want to choose K equals 5, and then I might have, you know, extra small, small, medium, large and extra large T-shirt sizes. So, you can have like 3 T-shirt sizes or four or five T-shirt sizes. We could also have four T-shirt sizes, but I'm just showing three and five here, just to simplify this slide for now. So, if I run K-means with K equals 3, maybe I end up with, that's my small and that's my medium and that's my large. Whereas, if I run K-means with 5 clusters, maybe I end up with, those are my extra small T-shirts, these are my small, these are my medium, these are my large and these are my extra large.

구체적인 예를 들어 봅시다. 다시 티셔츠 사이즈를 결정하는 예제로 돌아갑시다. 세 가지 크기의 티셔츠를 만들 것입니다. K = 3을 선택하고, S, M, L 클러스터를 찾습니다. 또는 K = 5를 선택하고, XS, S, M, L, XL 클러스터를 찾습니다. 따라서 3개 크기 분류 또는 4 또는 5 개의 티셔츠 크기 분류를 선택할 수 있습니다. 4개의 티셔츠 크기 분류를 선택할 수 있지만 여기서는 단순화하기 위해 K = 3과 K = 5만을 고려합니다. 그래서, K = 3일 때 K-평균 알고리즘을 실행한다면 이렇게 구분될 것입니다. 반면에 K = 5일 때 K-평균 알고리즘을 실행한다면 이렇게 구분될 것입니다.

And the nice thing about this example is that, this then maybe gives us another way to choose whether we want 3 or 4 or 5 clusters, and in particular, what you can do is, you know, think about this from the perspective of the T-shirt business and ask: "Well if I have five segments, then how well will my T-shirts fit my customers and so, how many T-shirts can I sell? How happy will my customers be?" What really makes sense, from the perspective of the T-shirt business, in terms of whether, I want to have Goer T-shirt sizes so that my T-shirts fit my customers better. Or do I want to have fewer T-shirt sizes so that I make fewer sizes of T-shirts. And I can sell them to the customers more cheaply. And so, the t-shirt selling business, that might give you a way to decide, between three clusters versus five clusters. So, that gives you an example of how a later downstream purpose like the problem of deciding what T-shirts to manufacture, how that can give you an evaluation metric for choosing the number of clusters.

이 예제의 좋은 점은 이것이 우리가 3 개 또는 4개 또는 5개의 클러스터를 원하는지 선택할 수 있는 또 다른 방법을 제공합니다. 여러분들이 티셔츠 사업을 한다면 다음과 같이 질문할 것입니다. "내가 5 개의 세그먼트를 가지고 있다면 내 티셔츠가 내 고객에게 얼마나 잘 맞을까요? 그래서 얼마나 많은 티셔츠를 팔 수 있을까요?" 고객은 얼마나 행복할까요? 티셔츠 사업 관점에서 볼 때 티셔츠가 고객들에게 더 잘 맞는 티셔츠 크기를 갖고 싶은 지 여부는 의미가 있습니다. 아니면 티셔츠 사이즈를 줄여서 더 싸게 고객에게 팔 수도 있습니다. 그래서 티셔츠 판매 사업은 3개의 클러스터와 5 개 클러스터 사이에서 결정하는 방법을 제공할 것입니다.

따라서 어떤 티셔츠를 제작할지를 결정하는 문제와 같은 후속 작업의 목적이 클러스터의 수를 선택하는 평가지표를 제공할 수 있습니다.

For those of you that are doing the program exercises, if you look at this week's program exercise associative K-means, that's an example there of using K-means for image compression. And so if you were trying to choose how many clusters to use for that problem, you could also, again use the evaluation metric of image compression to choose the number of clusters, K? So, how good do you want the image to look versus, how much do you want to compress the file size of the image, and, you know, if you do the programming exercise, what I've just said will make more sense at that time. So, just summarize, for the most part, the number of customers K is still chosen by hand by human input or human insight.

프로그래밍을 연습하고 있는 분들은 이번 주 프로그램 연습에서 K-평균 알고리즘 문제를 해결할 것입니다. 문제는 이미지 압축에 사용하는 예제입니다. 해당 문제에 사용할 클러스터의 수를 선택할 때 이미지 압축 평가 지표를 사용하여 클러스터의 수 K를 결정합니다. 후속 작업 목표는 이미지가 얼마나 잘 보이는 지, 이미지 파일의 크기를 얼마나 압축하고 싶은 지입니다. 프로그래밍 실습을 하면 방금 말한 내용이 더 의미가 있을 것입니다. 요약하자면 여전히 사람들은 대부분의 K의 값을 의견이나 통찰력으로 결정합니다.

One way to try to do so is to use the Elbow Method, but I wouldn't always expect that to work well, but I think the better way to think about how to choose the number of clusters is to ask, for what purpose are you running K-means? And then to think, what is the number of clusters K that serves that, you know, whatever later purpose that you actually run the K-means for.

이를 시도하는 한 가지 방법은 엘보 메쏘드를 사용하는 것입니다. 이것은 항상 잘 작동할 것이라고 기대하지 않습니다. 그러나 클러스터의 수를 선택하는 좋은 방법 중에 하나는 K-평균 알고리즘을 활용하는 후속 작업의 목적이 무엇인지 찾는 것입니다. 후속 작업의 목적에 부합하는 K-평균 알고리즘의 클러스터 K의 수는 얼마입니까?

앤드류 응의 머신러닝 동영상 강의

정리하며

K-평균 알고리즘은 레이블이 없는 데이터 셋을 자동으로 일정한 클러스터, 집합 또는 그룹으로 묶는 클러스터링 알고리즘 중의 하나입니다. K-평균 알고리즘의 동작 방식은 다음과 같습니다.

1) 클러스터의 수 K를 결정

일반적으로 클러스터 중심의 총 수 K는 훈련 예제의 총 수 m 보다 작게 설정합니다. 엘보 메쏘드는 클러스터의 총 수인 K를 변화에 따른 비용 함숫값의 변화를 확인합니다. K가 1에서 2로, 2에서 3으로 빠르게 하강하고 K=3에서 왜곡이 매우 천천히 하강합니다. 곡선이 마치 사람의 팔처럼 움직이며, 갑자기 완만해지는 시작하는 부분이 팔꿈치처럼 보입니다. 엘보 메쏘드는 시도할 가치가 있는 좋은 방법이지만 경우에 따라 제대로 동작하지 않습니다.

평가 지표가 제공될 때 클러스터의 수를 결정하는 더 좋은 방법은 다른 수의 클러스터가 후속 작업에 얼마나 부합하는 지를 확인하는 것입니다. 예를 들면, 티셔츠의 사이즈를 3개로 할지 5개로 할지를 결정하는 것은 쉽지 않습니다. 후속 작업에서 사람들에게 더 잘 맞는 옷을 만들 계획이라면 5개의 티셔츠 사이즈가 효과적이고, 값싼 티셔츠를 만들 계획이라면 공정을 줄이는 3개의 티셔츠 사이즈가 효과적입니다.

2) 클러스터 중심 위치를 무작위로 초기화

일반적으로 무작위로 K개의 학습 예제를 선택하여 초기화합니다. 하지만, 운이 나쁠 경우 K-평균 알고리즘은 전역 최적 값이 아닌 로컬 최적 값에 도달할 수 있습니다. 이를 해결하는 방법은 랜덤 초기화를 여러 번 시도하는 것입니다. 초기화를 여러 번 수행할수록 좋은 최소값을 찾을 수 있습니다.

K-평균 알고리즘이 K의 값이 2에서 10 사이의 값이라면 잦은 랜덤 초기화는 더 나은 클러스터링을 찾을 것입니다. 그러나, K의 값이 10 보다 훨씬 큰 수인 수 백개의 클러스터를 찾는다면 여러 번의 랜덤 초기화는 큰 차이를 만들기 어렵습니다

3) 클러스터링 할당 단계

모든 학습 예제가 가장 가까운 클러스터를 결정합니다.

하나의 학습 예제와 다수의 클러스터 중심까지의 거리를 측정한 후에 가장 가까운 클러스터를 선택합니다. 대문자 K는 클러스터 또는 클러스터 중심의 총 수이고 소문자 k는 클러스터 중심에 대한 인덱스이며 1부터 k 사이의 숫자입니다. C^(i)는 x^(i)와 클러스터 중심 사이의 거리가 가장 짧은 클러스터 중심을 말합니다. 예를 들어, x^(1)가 클러스 클러스터 중심 μ1, μ2, μ3, μ4 중에서 μ2와의 거리가 제일 작다면, C^(1) = 2입니다.

4) 클러스터 중심 이동 단계

클러스터 별로 학습 예제들의 평균값을 계산하고 새로운 클러스터 중심으로 이동하는 단계입니다. 각 평균을 계산하는 공식은 다음과 같습니다.

K-평균 알고리즘도 비용 함수가 있습니다. 최적화 목표를 통해 K-평균 알고리즘이 올바르게 실행되는지를 확인하고 디버깅에 도움이 됩니다. 또한, 로컬 최적화를 피하고 더 나은 최적 값을 찾을 수 있습니다. K-평균 알고리즘은 세 개의 변수를 추적합니다.

C^(i)는 학습 예제 x^(i)에 할당된 클러스터 번호 또는 인덱스

μk는 클러스터 중심에 대한 인덱스

μc^(i)는 학습 예제 x^(i)가 할당된 클러스터의 클러스터 중심

예를 들면 x^(i)에 클러스터 5를 할당했다고 가정합니다. C^(i) = 5이므로 μc^(i) = μ5입니다. 결국 x^(i)가 클러스터 5에 할당된다는 의미입니다. 비용 함수 J를 최소화하기 위한 C와 μ를 찾습니다. 비용 함수는 왜곡된 비용 함수 또는 K-평균 알고리즘의 왜곡이라고 부릅니다.

5) 수렴할 때까지 반복 수행

클러스터링 할당 단계와 클러스터링 중심 이동 단계를 반복 수행합니다. 클러스터 중심이 변하지 않고 학습 예제의 클러스터도 변하지 않을 때까지 반복하고 이것을 수렴한다고 합니다.