brunch

You can make anything
by writing

C.S.Lewis

by 라인하트 Feb 08. 2021

머신러닝 옥타브 실습(7-2): K-평균 클러스터링

온라인 강의 플랫폼 코세라의 창립자인 앤드류 응 (Andrew Ng) 교수는 인공지능 업계의 거장입니다. 그가 스탠퍼드 대학에서 머신 러닝 입문자에게 한 강의를 그대로 코세라 온라인 강의 (Coursera.org)에서 무료로 배울 수 있습니다. 이 강의는 머신러닝 입문자들의 필수코스입니다. 인공지능과 머신러닝을 혼자 공부하면서 자연스럽게 만나게 되는 강의입니다.

Programming Exercise 7:

K-means Clustering and Principal Component Analysis (K-평균 클러스터링과 주성분 분석)

1. K-means Clustering (K-평균 클러스터링)

1.2 K-means on example dataset (데이터 셋 예제에 K-평균)

After you have completed the two functions (findClosestCentroids and computeCentroids), the next step in ex7.m will run the K-means algorithm on a toy 2D dataset to help you understand how K-means works. Your functions are called from inside the runKmeans.m script. We encourage you to take a look at the function to understand how it works. Notice that the code calls the two functions you implemented in a loop.

When you run the next step, the K-means code will produce a visualiza- tion that steps you through the progress of the algorithm at each iteration. Press enter multiple times to see how each step of the K-means algorithm changes the centroids and cluster assignments. At the end, your figure should look as the one displayed in Figure 1.

findCloestCentroids.m 파일과 computeCentroids.m 파일의 코드를 완료하였습니다. 여러분들이 K-평균의 동작 방식을 이해할 수 있도록 ex7.m 파일은 장난감 2D 데이터 셋에서 K-평균 알고리즘을 실행합니다. ex7.m 파일은 runKmeans.m를 호출합니다. 코드는 두 함수를 호출합니다.

다음 단계는 K-평균 알고리즘은 각 반복에서 알고리즘의 진행과정을 시각화합니다. 'Enter'키를 여러 번 눌러 K-평균 알고리즘의 각 단계가 중심과 군집 할당을 어떻게 변경하는지 확인합니다. 마지막으로 그림 1과 표시된 것과 같습니다.

<Part 3 : K-Means Clustering>

(1) 데이터 업로드

clear ; % 옥타브 프로그램에 모든 변수를 제거

close all; % 터미널 이외의 창을 닫음

clc % 터미널을 깨끗이 정리

load('ex7data2.mat');

(2) K-평균 알고리즘을 위한 변수 설정

K = 3;

max_iters = 10;

initial_centroids = [ 3 3; 6 2; 8 5];

(3) runMeans.m 파일 분석

function [centroids, idx] = runkMeans(X, initial_centroids, max_iters, plot_progress)

%RUNKMEANS 데이터 행렬 X에 대해 K-means 알고리즘을 실행

% [centroids, idx] = RUNKMEANS(X, initial_centroids, max_iters, ...

% plot_progress)

% initial_centroids : 초기 클러스터 중심의 값

% max_iters : K-평균 알고리즘 반복 회수

% plot_progress : 학습 후 클러스터 중심 표시 방식 결정

% runMeans 함수는 centroids를 반환

% centroids는 K X n 행렬이고, idx는 mX1 벡터

% plot_progress 여부에 대한 기본 값 설정

if ~exist('plot_progress', 'var') || isempty(plot_progress)

plot_progress = false;

end

% 데이터셋을 도식화

if plot_progress

figure;

hold on;

end

% 초기값 결정

[m n] = size(X);

K = size(initial_centroids, 1); % 클러스터 중심의 초기값의 개수로 K 결정

centroids = initial_centroids; % 3개의 클러스터 중심 값을 centroids 변수에 입력

previous_centroids = centroids; % previous_centroids 변수 선언

idx = zeros(m, 1);

% K-Means 실행

for i=1:max_iters

fprintf('K-Means iteration %d/%d...\n', i, max_iters);

if exist('OCTAVE_VERSION')

fflush(stdout);

end

% 데이터 행렬 X에서 가장 가까운 클러스터 중심을 할당하고 인덱스에 표시

idx = findClosestCentroids(X, centroids);

% 그래프를 표시 (plot_progress가 1일 때)

if plot_progress

plotProgresskMeans(X, centroids, previous_centroids, idx, K, i);

previous_centroids = centroids;

fprintf('Press enter to continue.\n');

pause;

end

% 새로운 클러스터 중심을 계산

centroids = computeCentroids(X, idx, K);

end

% 그림 창에 더 이상 그릴 필요 없음

if plot_progress

hold off;

end

첫 번째 If 문을 분석합니다. plot_progress 값에 따른 변화를 정리합니다.

if ~exist('plot_progress', 'var') || isempty(plot_progress)

plot_progress = false;

end

간단하게 값의 변화를 추적합니다. ex7.m 파일은 runkMeans.m 파일을 호출할 때 plot_progress 값을 true로 설정합니다.

~exist('plot_progress', 'var')의 값은 0입니다.

>> plot_progress = true

plot_progress = 1

>> plot_progress

plot_progress = 1

>> exist('plot_progress')

ans = 1

>> exist('plot_progress','var')

ans = 1

>> ~exist('plot_progress','var')

ans = 0

isempty(plot_progress)의 값은 0입니다.

>> plot_progress = true

plot_progress = 1

>> isempty(plot_progress)

ans = 0

논리 연산 OR는 || 또는 | 로 표시합니다. 비교 값이 모두 0일 때만 0입니다.

>> 1||1

ans = 1

>> 1||0

ans = 1

>> 0||1

ans = 1

>> 0||0

ans = 0

따라서, 아래 구문의 의미는 plot_progress 변수가 true이면 그대로 둡니다. 하지만, plot_proegress 변수가 없거나 값이 없다면 false로 설정합니다.

if ~exist('plot_progress', 'var') || isempty(plot_progress)

plot_progress = false;

end

plot_progress 가 true 또는 1 이면 반복할 때마다 키보드를 눌러야 합니다.

>>[centroids, idx] = runkMeans(X, initial_centroids, max_iters, true);

K-Means iteration 1/10...

Press enter to continue.

K-Means iteration 2/10...

Press enter to continue.

K-Means iteration 3/10...

Press enter to continue.

K-Means iteration 4/10...

Press enter to continue.

K-Means iteration 5/10...

Press enter to continue.

K-Means iteration 6/10...

Press enter to continue.

K-Means iteration 7/10...

Press enter to continue.

K-Means iteration 8/10...

Press enter to continue.

K-Means iteration 9/10...

Press enter to continue.

K-Means iteration 10/10...

Press enter to continue.

plot_progress가 false이면 10번 반복을 자동으로 수행합니다.

>> [centroids, idx] = runkMeans(X, initial_centroids, max_iters);

K-Means iteration 1/10...

K-Means iteration 2/10...

K-Means iteration 3/10...

K-Means iteration 4/10...

K-Means iteration 5/10...

K-Means iteration 6/10...

K-Means iteration 7/10...

K-Means iteration 8/10...

K-Means iteration 9/10...

K-Means iteration 10/10...

그다음 구문은 plot_progress가 true이면 그래프 창을 열고 클러스터 중심을 표시할 준비를 합니다.

if plot_progress

figure;

hold on;

end

ex7.m 파일이 runkMeans.m 파일을 호출할 때 max_iters 변수 값을 10이라고 정의합니다.

for i=1:max_iters

fprintf('K-Means iteration %d/%d...\n', i, max_iters);

if exist('OCTAVE_VERSION')

fflush(stdout);

end

화면에 반복 회수/ 최대 반복 회수로 표시합니다.

fprintf('K-Means iteration %d/%d...\n', i, max_iters);

OCTVE_VERSION은 기본 상수로 옥타브 프로그램의 버전을 표시합니다.

>> OCTAVE_VERSION

ans = 3.8.0

>> exist(OCTAVE_VERSION)

ans = 0

fflush() 함수는 출력 또는 입력을 위해 정보가 쌓여있는 버퍼를 비웁니다. 버퍼에 정보를 쌓아서 한꺼번에 표시하는 것이 더 효율적이기 때문에 사용합니다.

fflush(stdin) 은 입력 버퍼의 데이터를 삭제합니다.

fflush(stdout)은 출력 버퍼의 데이터를 삭제합니다.

(4) plotProgresskMeans.m 파일 분석

function plotProgresskMeans(X, centroids, previous, idx, K, i)

%PLOTPROGRESSKMEANS is K-평균 알고리즘의 진행 상황을 2차원 평면에 표시

% PLOTPROGRESSKMEANS(X, centroids, previous, idx, K, i) plots the data

% 클러스터 중심에 할당된 데이터를 표시

% 이전 클러스터 중심 위치와 현재 클러스터 중심 위치 사이를 선으로 연결

% 데이터 행렬 X 그리기

plotDataPoints(X, idx, K);

% 검은색 x로 클러스터 중심 그리기

plot(centroids(:,1), centroids(:,2), 'x', 'MarkerEdgeColor','k', ...

'MarkerSize', 5, 'LineWidth', 5);

% 클러스터 중심 이전 데이터와 선 연결

for j=1:size(centroids,1)

drawLine(centroids(j, :), previous(j, :));

end

% 제목 표시

title(sprintf('Iteration number %d', i))

end

반복을 할 때마다 클러스터 중심의 이동을 보기 위해 데이터 행렬 X를 클러스터 별로 도식화하고, 클러스터 중심의 이동 경로를 그립니다.

>> centroids

centroids =

1.9540 5.0256

3.0437 1.0154

6.0337 3.0005

클러스터 중심의 총 수 K=3이고, 클러스터 중심을 표시하는 centroids 변수는 3 X 2차원 행렬입니다. 데이터 행렬 X를 클러스터 별로 색깔로 보여주는 plotDataPoints(X, idx, K) 함수는 조금 있다가 분석합니다.

현재 입력받은 3개의 클러스터 중심 위치를 X 표시로 표시합니다. plot() 함수는 너무 많이 설명했기 때문에 분석을 생략합니다.

plot(centroids(:,1), centroids(:,2), 'x', 'MarkerEdgeColor','k', ...

'MarkerSize', 5, 'LineWidth', 5);

그리고, drawLine.m 은 이전의 클러스터 중심 previous_centroids의 값과 서로 선으로 연결합니다.

for j=1:size(centroids,1)

drawLine(centroids(j, :), previous(j, :));

end

(5) drawLine.m 파일 분석

function drawLine(p1, p2, varargin)

%DRAWLINE p1과 p2를 연결하는 선을 그림

% DRAWLINE(p1, p2) 현재 그림 창에 p1과 p2를 연결하는 선을 그림

plot([p1(1) p2(1)], [p1(2) p2(2)], varargin{:});

end

현재 클러스터 중심 centroids의 각 행과 이전 클러스터 중심 previous의 각 행을 전달하여 두 점에 대해 선을 그립니다. 선을 그릴 때는 plot과 line 함수를 사용할 수 있습니다.

(6) 선 그리기 관련 함수 정리

옥타브/매트랩 프로그램에서 선을 그리는 함수가 몇 가지 있습니다.

>> x = [ 1 2 3 4 5];

>> y = [5 6 10 12 5];

>> plot(x,y)

line() 함수는 선을 그립니다. 이차원 그래프를 그릴 때는 line(x, y)를 사용합니다.

>> x = [ 1 2 3 4 5];

>> y = [5 6 10 12 5];

>> line(x,y);

삼차원 그래프를 그릴 때는 plot3(x, y, z)를 사용합니다.

>> x = [ 1 2 3 4 5];

>> y = [5 6 10 12 5];

>> z = [4 8 9 10 12];

>> plot3 (x , y , z)

삼차원 그래프를 그릴 때는 line(x, y, z)를 사용합니다.

>> x = [ 1 2 3 4 5];

>> y = [5 6 10 12 5];

>> z = [4 8 9 10 12];

>> line (x , y , z)

(7) ex7.m 파일에서 runkMeans 함수를 호출

옥타브 프로그램에서 다음과 같이 실행합니다. plot_progress가 true 이므로 반복할 때마다 키보드의 'Enter' 키를 쳐야 합니다.

[centroids, idx] = runkMeans(X, initial_centroids, max_iters, true);

최초 3개의 클러스터 중심을 기준으로 모든 학습 예제에 클러스터를 할당하고, 클러스터가 할당된 데이터는 빨간색, 녹색, 파란색으로 표시합니다.

K-Means iteration 1/10...

Press enter to continue.

클러스터 별로 데이터의 평균을 계산하여 두 번째 클러스터 중심을 설정하고, 첫 번째 클러스터 중심으로부터 얼마큼 이동했는 지를 선으로 표시합니다. 그리고, 두 번째 클러스터 중심을 기준으로 학습 예제를 할당합니다.

K-Means iteration 2/10...

Press enter to continue.

클러스터 별로 데이터의 평균을 계산하여 세 번째 클러스터 중심을 설정하고, 두 번째 클러스터 중심으로부터 얼마만큼 이동했는 지를 선으로 표시합니다. 그리고, 세 번째 클러스터 중심을 기준으로 학습 예제를 할당합니다.

K-Means iteration 3/10...

Press enter to continue.

클러스터 별로 데이터의 평균을 계산하여 네 번째 클러스터 중심을 설정하고, 세 번째 클러스터 중심으로부터 얼마만큼 이동했는 지를 선으로 표시합니다. 그리고, 네 번째 클러스터 중심을 기준으로 학습 예제를 할당합니다.

K-Means iteration 4/10...

Press enter to continue.

K-Means iteration 5/10...

Press enter to continue.

K-Means iteration 6/10...

Press enter to continue.

K-Means iteration 7/10...

Press enter to continue.

K-Means iteration 8/10...

Press enter to continue.

이제 클러스터 중심은 최소값에 수렴하였기 때문에 더 이상 움직이지 않습니다. 마지막 10번째는 동일합니다.

1.3 Random initialization (랜덤 초기화)

The initial assignments of centroids for the example dataset in ex7.m were designed so that you will see the same figure as in Figure 1. In practice, a good strategy for initializing the centroids is to select random examples from the training set.

In this part of the exercise, you should complete the function kMeansInitCentroids.m with the following code:

ex7.m 파일은 데이터 셋에 클러스터 중심 (Cluster Centroids)을 임의로 지정하였습니다. 실제로 클러스터 중심을 초기화하는 전략은 학습 셋에서 임의의 예제를 선택하는 것입니다.

실습에서 kMeansInitCentroids.m 파일의 코드를 작성합니다. 코드는 다음과 같습니다.

% 랜덤 하게 선택한 예제로 클러스터 중심(centroids)을 초기화

% 랜덤 하게 학습 예제를 재정렬

randidx = randperm(size(X, 1));

% 처음부터 K 개의 학습 예제를 선택

centroids = X(randidx(1:K), :);

The code above first randomly permutes the indices of the examples (using randperm). Then, it selects the first K examples based on the random permutation of the indices. This allows the examples to be selected at ran- dom without the risk of selecting the same example twice.

You do not need to make any submissions for this part of the exercise.

코드는 학습 예제의 인덱스를 randperm() 함수로 무작위로 정렬합니다. 임의 순열에 기반하여 처음 K 개의 예제를 선택합니다. 동일한 예제를 두 번 선택할 위험 없이 임의의 예제를 선택할 수 있습니다.

실습 결과를 제출할 필요는 없습니다.

<해설>

(1) kMeansInitCentroids.m 파일 분석

function centroids = kMeansInitCentroids(X, K)

%KMEANSINITCENTROIDS 데이터 행렬 X에서 K개의 클러스터 중심을 무작위로 결정

% centroids = KMEANSINITCENTROIDS(X, K) returns K initial centroids to be

% 데이터 행렬 X에서 K개의 클러스터 중심을 무작위로 결정

% K개의 데이터 행렬 X의 학습 예제를 centroids로 반환

% 반환할 centrodis를 초기화

centroids = zeros(K, size(X, 2));

% ====================== YOUR CODE HERE ======================

% Instructions: 데이터 행렬 X에서 무작위로 초기화

% =============================================================

end

무작위로 데이터 행렬 X를 재 정렬하기 위해 randperm() 함수를 사용합니다.

>> randperm(10) %1에서 10까지 랜덤 하게 재정렬

ans =

1 3 10 2 7 4 9 8 5 6

>> randperm(10, 5) %1에서 10까지 랜덤 하게 재 정렬하게 5개만 표시

ans =

2 3 9 5 6

>> size(X,1)

ans = 300

>> randperm(size(X,1))

ans =

Columns 1 through 11:

218 90 8 178 144 29 48 197 168 211 7

>> randidx = randperm(size(X, 1));

1에서 300까지의 숫자를 랜덤 하게 재 정렬합니다. 이것을 인덱스로 이용하여 데이터 값을 반환합니다.

>> randidx = randperm(size(X, 1));

>> centroids = X(randidx(1:K),:)

centroids =

6.2921 2.7757

3.0897 1.0881

2.4030 5.0815

(2) 정답

function centroids = kMeansInitCentroids(X, K)

%KMEANSINITCENTROIDS 데이터 행렬 X에서 K개의 클러스터 중심을 무작위로 결정

% centroids = KMEANSINITCENTROIDS(X, K) returns K initial centroids to be

% 데이터 행렬 X에서 K개의 클러스터 중심을 무작위로 결정

% K개의 데이터 행렬 X의 학습 예제를 centroids로 반환

% 반환할 centrodis를 초기화

centroids = zeros(K, size(X, 2));

% ====================== YOUR CODE HERE ======================

% Instructions: 데이터 행렬 X에서 무작위로 초기화

randidx = randperm(size(X, 1));

centroids = X(randidx(1:K),:);

% =============================================================

end

브런치는 최신 브라우저에 최적화 되어있습니다. IE chrome safari