brunch

You can make anything
by writing

C.S.Lewis

by 라인하트 Nov 23. 2020

앤드류 응의 머신러닝(13-4): K-평균 랜덤 초기화

온라인 강의 플랫폼 코세라의 창립자인 앤드류 응 (Andrew Ng) 교수는 인공지능 업계의 거장입니다. 그가 스탠퍼드 대학에서 머신 러닝 입문자에게 한 강의를 그대로 코세라 온라인 강의 (Coursera.org)에서 무료로 배울 수 있습니다. 이 강의는 머신러닝 입문자들의 필수코스입니다. 인공지능과 머신러닝을 혼자 공부하면서 자연스럽게 만나게 되는 강의입니다.

Unsupervised Learning

비지도 학습

Clustering (클러스터링)

Random Initialization (랜덤 초기화)

In this video, I'd like to talk about how to initialize K-means and more importantly, this will lead into a discussion of how to make K-means avoid local optima as well.

이번 강의는 K-평균을 초기화하는 방법에 대해 설명합니다. 이것이 K-평균 알고리즘이 로컬 최적화를 피하는 방법에 대한 논의로 이어질 것입니다.

Here's the K-means clustering algorithm that we talked about earlier. One step that we never really talked much about was this step of how you randomly initialize the cluster centroids. There are few different ways that one can imagine using to randomly initialize the cluster centroids. But, it turns out that there is one method that is much more recommended than most of the other options one might think about. So, let me tell you about that option since it's what often seems to work best.

여기 전에 언급했던 K-평균 클러스터링 알고리즘입니다. 지금까지 다루지 않은 것은 클러스터링 중심을 무작위로 초기화하는 단계입니다. 클러스터 중심을 무작위로 초기화할 때 사용할 수 있는 방법은 몇 가지가 있습니다. 생각할 수 있는 여러 방법보다 더 추천하는 방법이 하나 있습니다. 우선 가장 잘 작동하는 것처럼 보이는 옵션을 먼저 설명하겠습니다.

Here's how I usually initialize my cluster centroids. When running K-means, you should have the number of cluster centroids, K, set to be less than the number of training examples M. It would be really weird to run K-means with a number of cluster centroids that's, you know, equal or greater than the number of examples you have, right?

여기 일반적으로 클러스터 중심을 초기화하는 방법이 있습니다. K-평균 알고리즘을 실행할 때 클러스터 중심의 총 수 K는 훈련 예제의 총 수 m 보다 작게 설정해야 합니다. K-평균 알고리즘이 클러스터 중심의 총 수가 학습 예제의 수와 같거나 더 크다면 매우 이상합니다.

So the way I usually initialize K-means is, I would randomly pick k training examples. So, and, what I do is then set Mu1 of MuK equal to these k examples. Let me show you a concrete example. Lets say that k is equal to 2 and so on this example on the right let's say I want to find two clusters. So, what I'm going to do in order to initialize my cluster centroids is, I'm going to randomly pick a couple examples. And let's say, I pick this one and I pick that one. And the way I'm going to initialize my cluster centroids is, I'm just going to initialize my cluster centroids to be right on top of those examples. So that's my first cluster centroid and that's my second cluster centroid, and that's one random initialization of K-means.

일반적으로 K-평균 알고리즘을 초기화하는 방법은 무작위로 K개의 학습 예제를 선택하는 것입니다. 클러스터 중심 μ1부터 μk의 값을 학습 예제와 동일하게 설정합니다. 구체적인 예를 들면, k = 2와 같고 오른쪽 데이터 셋에서 두 개의 클러스터를 찾는다고 가정합니다. 클러스터 중심을 초기화하기 위해 무작위로 두 학습 예제를 선택합니다. 빨간색 십자가와 파란색 십자가로 표시합니다. 클러스터 중심을 초기화하는 작업은 클러스터 중심이 두 개의 학습 예제와 동일한 값이 되게 합니다. 빨간색 십자가는 첫 번째 클러스터 중심이고, 파란색 십자가는 두 번째 클러스터 중심입니다. 이것이 K-평균 알고리즘이 랜덤 초기화하는 방법 중 하나입니다. 임의로 선택한 두 클러스터 중심이 좋은 위치에 있습니다.

The one I drew looks like a particularly good one. And sometimes I might get less lucky and maybe I'll end up picking that as my first random initial example, and that as my second one. And here I'm picking two examples because k equals 2. Some we have randomly picked two training examples and if I chose those two then I'll end up with, may be this as my first cluster centroid and that as my second initial location of the cluster centroid. So, that's how you can randomly initialize the cluster centroids. And so at initialization, your first cluster centroid Mu1 will be equal to x(i) for some randomly value of i and Mu2 will be equal to x(j) for some different randomly chosen value of j and so on, if you have more clusters and more cluster centroid.

우측 상단의 다이어그램은 운이 좋은 상황입니다. 때때로 운이 좋지 않을 수 있습니다. 첫 번째 무작위 초기 예제를 선택하고 두 번째 예제도 무작위로 선택합니다. 여기도 k = 2이기 때문에 두 개의 예제를 선택합니다. 빨간색 십자가는 첫 번째 예제이고, 파란색 십자가는 두 번째 예제입니다. 이것이 클러스터 중심을 무작위로 초기화하는 방법입니다. 따라서, 첫 번째 클러스터 중심은 μ1은 x^(i)와 값이 같고, 두 번째 클러스터 중심은 μ2는 특정 x^(j)와 값이 같습니다.

And sort of the side common. I should say that in the earlier video where I first illustrated K-means with the animation. In that set of slides. Only for the purpose of illustration. I actually used a different method of initialization for my cluster centroids. But the method described on this slide, this is really the recommended way. And the way that you should probably use, when you implement K-means. So, as they suggested perhaps by these two illustrations on the right. You might really guess that K-means can end up converging to different solutions depending on exactly how the clusters were initialized, and so, depending on the random initialization. K-means can end up at different solutions. And, in particular, K-means can actually end up at local optima.

K-평균 알고리즘은 랜덤 초기화를 할 때 약간의 문제가 있습니다. K-평균 알고리즘을 애니메이션으로 처음 설명했던 강의에서 이야기했어야 했습니다. 그림의 목적은 실제로 클러스터 중심을 초기화하는 다른 방법을 사용했습니다. 그러나 지금까지 설명하고 위의 그림에서 사용했던 방법은 K-평균 알고리즘을 구현할 때 실제로 권장하는 방법입니다. 실제로 클러스터가 정확히 어떻게 초기화되는지에 따라 K-평균 알고리즘이 다른 해결책으로 수렴될 수도 있습니다. 즉, 랜덤 초기화로 인해 K-평균 알고리즘은 다른 해결책을 제시할 수 있습니다. 즉, K-평균 알고리즘은 로컬 최적 값에 도달할 수 있습니다.

If you're given the data sale like this. Well, it looks like, you know, there are three clusters, and so, if you run K-means and if it ends up at a good local optima this might be really the global optima, you might end up with that cluster ring. But if you had a particularly unlucky, random initialization, K-means can also get stuck at different local optima.

여기 데이터 셋이 있습니다. 3 개의 클러스터로 그룹화할 수 있습니다. 실제로 K-평균 알고리즘을 실행한다면 괜찮은 로컬 최적 값에 도달할 수 있고 우측 상단의 그림처럼 3개의 클러스터로 잘 분리할 것입니다. 그러나 랜덤 초기화가 운이 좋지 않을 때 K-평균 알고리즘은 로컬 최적 값에서 멈출 수 있습니다.

So, in this example on the left it looks like this blue cluster has captured a lot of points of the left and then the they were on the green clusters each is captioned on the relatively small number of points. And so, this corresponds to a bad local optima because it has basically taken these two clusters and used them into 1 and furthermore, has split the second cluster into two separate sub-clusters like so, and it has also taken the second cluster and split it into two separate sub-clusters like so, and so, both of these examples on the lower right correspond to different local optima of K-means and in fact, in this example here, the cluster, the red cluster has captured only a single optima example.

왼쪽 하단의 그림은 파란색 클러스터가 왼쪽의 많은 데이터 그룹화하고, 녹색 클러스터와 빨간색 클러스터는 상대적으로 적은 수의 데이터를 그룹화한 것처럼 보입니다. 즉, 두 개의 클러스터를 마치 하나의 클러스터처럼 분할하고 두 번째 클러스터를 다시 두 개의 클러스터로 분할했기 때문에 나쁜 로컬 최적화입니다. 오른쪽 하단의 그림도 마찬가지로 나쁜 로컬 최적화에 해당합니다. 오른쪽 하단의 빨간색 클러스터는 단 하나의 학습 예제에 최적화한 사례입니다.

And the term local optima, by the way, refers to local optima of this distortion function J, and what these solutions on the lower left, what these local optima correspond to is really solutions where K-means has gotten stuck to the local optima and it's not doing a very good job minimizing this distortion function J. So, if you're worried about K-means getting stuck in local optima, if you want to increase the odds of K-means finding the best possible clustering, like that shown on top here, what we can do, is try multiple, random initializations. So, instead of just initializing K-means once and hopping that that works, what we can do is, initialize K-means lots of times and run K-means lots of times, and use that to try to make sure we get as good a solution, as good a local or global optima as possible.

그런데 로컬 최적 값은 왜곡 함수 J(C^(1), C^(2),... , C^(m), μ1, μ2,... , μk)의 로컬 최적 값을 의미합니다. 왼쪽 아래에 있는 해결책은 K-평균 알고리즘이 로컬 최적 값에 고착된 것입니다. 이 왜곡 함수 J를 최소화하는 것은 좋지 않습니다. 따라서, K-평균 알고리즘이 로컬 최적 값에 갇히지 않게 하고 최적의 클러스터링을 찾을 확률을 높이려면 다음과 같이 해야 합니다. 우선 여러 번의 랜덤 초기화를 시도하는 것입니다. K-평균 알고리즘을 한번 초기화하고 실행하는 것이 아니라 K-평균 알고리즘을 많이 초기화하고 여러 번 실행하여 가능한 좋은 로컬 또는 전역 최적 값을 가진 해결책을 찾는 것입니다.

Concretely, here's how you could go about doing that. Let's say, I decide to run K-meanss a hundred times so I'll execute this loop a hundred times and it's fairly typical a number of times when came to will be something from 50 up to may be 1000. So, let's say you decide to say K-means one hundred times. So what that means is that we would randomnly initialize K-means. And for each of these one hundred random intializations we would run K-means and that would give us a set of clusteringings, and a set of cluster centroids, and then we would then compute the distortion J, that is compute this cost function on the set of cluster assignments and cluster centroids that we got.

구체적으로 여기 자세한 방법이 있습니다. 예를 들어, K-평균 알고리즘을 100번 실행하기로 결정했다면 For 루프를 100번 실행합니다. 일반적으로 50번에서 1,000번까지 반복합니다. K-평균 알고리즘을 무작위로 100번 초기화합니다. 클러스터링 셋 C^(i), C^(2),..., C^(m)과 클러스터 중심 셋 μ1, μ2,... , μk를 제공한 후 왜곡 함수 J(C^(1), C^(2),... , C^(m), μ1, μ2,... , μk)를 계산합니다. 클러스터 할당과 클러스터 중심 이동을 하는 비용 함수를 계산하는 것입니다.

Finally, having done this whole procedure a hundred times. You will have a hundred different ways of clustering the data and then finally what you do is all of these hundred ways you have found of clustering the data, just pick one, that gives us the lowest cost. That gives us the lowest distortion. And it turns out that if you are running K-means with a fairly small number of clusters , so you know if the number of clusters is anywhere from two up to maybe 10 - then doing multiple random initializations can often, can sometimes make sure that you find a better local optima. Make sure you find the better clustering data. But if K is very large, so, if K is much greater than 10, certainly if K were, you know, if you were trying to find hundreds of clusters, then, having multiple random initializations is less likely to make a huge difference and there is a much higher chance that your first random initialization will give you a pretty decent solution already and doing, doing multiple random initializations will probably give you a slightly better solution but, but maybe not that much. But it's really in the regime of where you have a relatively small number of clusters, especially if you have, maybe 2 or 3 or 4 clusters that random initialization could make a huge difference in terms of making sure you do a good job minimizing the distortion function and giving you a good clustering.

마지막으로 모든 절차를 백번 실행합니다. 데이터를 클러스터링 하는 수백 가지 방법이 있고 마지막으로 모든 방법을 찾았습니다. 그중에 하나만 선택하면 비용이 가장 저렴합니다. 그것이 가장 낮은 왜곡 값입니다. K-평균 알고리즘이 적은 수의 클러스터를 찾는다면, K는 2에서 10 사이의 값이라면, 잦은 반복은 더 나은 로컬 최적 값이나 더 나은 클러스터링 데이터를 찾을 수 있습니다. 그러나, K의 값이 매우 크다면, K가 10 보다 훨씬 큰 수 백개의 클러스터를 찾는다면 여러 번의 랜덤 초기화는 큰 차이를 만들기 어렵습니다. 첫 번째 랜덤 초기화가 이미 괜찮은 해결책을 제공할 가능성이 훨씬 더 높습니다. 여러 번의 랜덤 초기화를 하더라도 약간 더 나은 해결책을 얻을 뿐일 것입니다. 하지만, 실제로 상대적으로 적은 수의 클러스터가 있는 영역이 있습니다. 특히, 2개 , 3개 또는 4개의 클러스터를 찾는 경우 반복적인 랜덤 초기화는 왜곡을 최소화할 수 있습니다. 좋은 클러스터 링을 제공할 수 있습니다.

So, that's K-means with random initialization. If you're trying to learn a clustering with a relatively small number of clusters, 2, 3, 4, 5, maybe, 6, 7, using multiple random initializations can sometimes, help you find much better clustering of the data. But, even if you are learning a large number of clusters, the initialization, the random initialization method that I describe here. That should give K-means a reasonable starting point to start from for finding a good set of clusters.

지금까지 랜덤 초기화에 대한 K-평균 알고리즘이었습니다. 비교적 적은 수의 클러스터를 찾는 경우에 아마도 2,3,4,5,6,7 개라면 여러 번 랜덤 초기화를 사용하여 훨씬 더 나은 클러스터링을 찾는 데 도움이 될 것입니다. 그러나 많은 수의 클러스터를 찾더라도 여기서 설명한 랜덤 초기화 방법을 사용합니다. 그러면 K-평균 알고리즘은 클러스터링을 찾기 위해 시작할 수 있는 합리적인 시작점을 찾을 수 있습니다.

앤드류 응의 머신러닝 동영상 강의

정리하며

K-평균 알고리즘은 레이블이 없는 데이터 셋을 자동으로 일정한 클러스터, 집합 또는 그룹으로 묶는 클러스터링 알고리즘 중의 하나입니다. K-평균 알고리즘의 운영 방식은 다음과 같습니다.

1) 클러스터의 수를 결정

일반적으로 클러스터 중심의 총 수 K는 훈련 예제의 총 수 m 보다 작게 설정합니다.

2) 클러스터 중심 위치를 무작위로 초기화

일반적으로 무작위로 K개의 학습 예제를 선택하여 초기화합니다. 하지만, 운이 나쁠 경우 K-평균 알고리즘은 전역 최적 값이 아닌 로컬 최적 값에 도달할 수 있습니다. 이를 해결하는 방법은 랜덤 초기화를 여러 번 시도하는 것입니다. 초기화를 여러 번 수행할수록 좋은 최소값을 찾을 수 있습니다.

K-평균 알고리즘이 K의 값이 2에서 10 사이의 값이라면 잦은 랜덤 초기화는 더 나은 클러스터링을 찾을 것입니다. 그러나, K의 값이 10 보다 훨씬 큰 수인 수 백개의 클러스터를 찾는다면 여러 번의 랜덤 초기화는 큰 차이를 만들기 어렵습니다

3) 클러스터링 할당 단계

모든 학습 예제가 가장 가까운 클러스터를 결정합니다.

하나의 학습 예제와 다수의 클러스터 중심까지의 거리를 측정한 후에 가장 가까운 클러스터를 선택합니다. 대문자 K는 클러스터 또는 클러스터 중심의 총 수이고 소문자 k는 클러스터 중심에 대한 인덱스이며 1부터 k 사이의 숫자입니다. C^(i)는 x^(i)와 클러스터 중심 사이의 거리가 가장 짧은 클러스터 중심을 말합니다. 예를 들어, x^(1)가 클러스 클러스터 중심 μ1, μ2, μ3, μ4 중에서 μ2와의 거리가 제일 작다면, C^(1) = 2입니다.

4) 클러스터 중심 이동 단계

클러스터 별로 학습 예제들의 평균값을 계산하고 새로운 클러스터 중심으로 이동하는 단계입니다.

합니다. 각 평균을 계산하는 공식은 다음과 같습니다.

K-평균 알고리즘도 비용 함수가 있습니다. 최적화 목표를 통해 K-평균 알고리즘이 올바르게 실행되는지를 확인하고 디버깅에 도움이 됩니다. 또한, 로컬 최적화를 피하고 더 나은 최적 값을 찾을 수 있습니다. K-평균 알고리즘은 세 개의 변수를 추적합니다.

C^(i)는 학습 예제 x^(i)에 할당된 클러스터 번호 또는 인덱스

μk는 클러스터 중심에 대한 인덱스

μc^(i)는 학습 예제 x^(i)가 할당된 클러스터의 클러스터 중심

예를 들면 x^(i)에 클러스터 5를 할당했다고 가정합니다. C^(i) = 5이므로 μc^(i) = μ5입니다. 결국 x^(i)가 클러스터 5에 할당된다는 의미입니다. 비용 함수 J를 최소화하기 위한 C와 μ를 찾습니다. 비용 함수는 왜곡된 비용 함수 또는 K-평균 알고리즘의 왜곡이라고 부릅니다.

5) 수렴할 때까지 반복 수행

클러스터링 할당 단계와 클러스터링 중심 이동 단계를 반복 수행합니다. 클러스터 중심이 변하지 않고 학습 예제의 클러스터도 변하지 않을 때까지 반복하고 이것을 수렴한다고 합니다.