brunch

You can make anything
by writing

C.S.Lewis

by 라인하트 Nov 21. 2020

앤드류 응의 머신러닝(13-1):비지도학습 클러스터링

온라인 강의 플랫폼 코세라의 창립자인 앤드류 응 (Andrew Ng) 교수는 인공지능 업계의 거장입니다. 그가 스탠퍼드 대학에서 머신 러닝 입문자에게 한 강의를 그대로 코세라 온라인 강의 (Coursera.org)에서 무료로 배울 수 있습니다. 이 강의는 머신러닝 입문자들의 필수코스입니다. 인공지능과 머신러닝을 혼자 공부하면서 자연스럽게 만나게 되는 강의입니다.

Unsupervised Learning

비지도 학습

Clustering (클러스터링)

Unsupervised Learning: Introduction (비지도 학습: 소개)

In this video, I'd like to start to talk about clustering. This will be exciting, because this is our first unsupervised learning algorithm, where we learn from unlabeled data instead from labelled data. So, what is unsupervised learning? I briefly talked about unsupervised learning at the beginning of the class but it's useful to contrast it with supervised learning.

이번 강의에서 클러스터링을 설명합니다. 클러스터링은 레이블이 있는 데이터가 아닌 레이블이 없는 데이터에서 학습하는 비지도 학습 알고리즘입니다. 비지도 학습이란 무엇일까요? 이 과정 초반에 비지도 학습을 간단히 다루었습니다. 먼저 지도 학습과 차이점을 알아봅니다.

So, here's a typical supervised learning problem where we're given a labeled training set and the goal is to find the decision boundary that separates the positive label examples and the negative label examples. So, the supervised learning problem in this case is given a set of labels to fit a hypothesis to it.

여기 전형적인 지도 학습 문제가 있습니다. 레이블이 지정된 학습 셋이 있고, 목표는 파지티브 레이블 예제와 네거티브 레이블 예제를 구분하는 결정 경계를 찾는 것입니다. 따라서, 지도 학습 문제는 반드시 가설에 맞는 레이블 셋이 있습니다.

In contrast, in the unsupervised learning problem we're given data that does not have any labels associated with it. So, we're given data that looks like this. Here's a set of points add in no labels, and so, our training set is written just x1, x2, and so on up to x m and we don't get any labels y. And that's why the points plotted up on the figure don't have any labels with them. So, in unsupervised learning what we do is we give this sort of unlabeled training set to an algorithm and we just ask the algorithm find some structure in the data for us.

반대로 비지도 학습 문제는 레이블이 없는 데이터를 다룹니다. 여기에 레이블이 없는 점 집합이 있습니다. 학습 셋은 x1, x2,..., xm까지 있지만 레이블 y는 없습니다. 그림에 표시된 점은 레이블이 없습니다. 비지도 학습은 이런 종류의 레이블이 없는 학습 셋을 알고리즘에 제공하고 구조를 찾는 것입니다.

Given this data set one type of structure we might have an algorithm find is that it looks like this data set has points grouped into two separate clusters and so an algorithm that finds clusters like the ones I've just circled is called a clustering algorithm. And this would be our first type of unsupervised learning, although there will be other types of unsupervised learning algorithms that we'll talk about later that finds other types of structure or other types of patterns in the data other than clusters. We'll talk about this after we've talked about clustering.

주어진 데이터 셋에서 지금 그린 두 개의 원처럼 점들을 두 개의 클러스터로 묶을 수 있습니다. 이처럼 데이터들을 클러스터로 묶어주는 알고리즘을 클러스터링 알고리즘이라고 합니다. 클러스터링 알고리즘은 비지도 학습의 대표적인 사례입니다. 나중에 배울 다른 유형의 비지도 학습 알고리즘은 데이터에서 다른 유형의 구조나 패턴을 찾습니다. 클러스터링에 대한 이야기는 나중에 자세히 하겠습니다.

So, what is clustering good for? Early in this class I already mentioned a few applications. One is market segmentation where you may have a database of customers and want to group them into different marker segments so you can sell to them separately or serve your different market segments better. Social network analysis. There are actually groups have done this things like looking at a group of people's social networks. So, things like Facebook, Google+, or maybe information about who other people that you email the most frequently and who are the people that they email the most frequently and to find coherence in groups of people. So, this would be another maybe clustering algorithm where you know want to find who are the coherent groups of friends in the social network? Here's something that one of my friends actually worked on which is, use clustering to organize computer clusters or to organize data centers better. Because if you know which computers in the data center in the cluster tend to work together, you can use that to reorganize your resources and how you layout the network and how you design your data center communications. And lastly, something that actually another friend worked on using clustering algorithms to understand galaxy formation and using that to understand astronomical data. So, that's clustering which is our first example of an unsupervised learning algorithm.

클러스터링의 장점은 무엇일까요? 이 과정 초반에 이미 몇 가지 응용 분야를 설명했습니다. 하나는 고객 데이터베이스에서 고객을 시장별로 그룹화하는 마켓 세그멘테이션입니다. 마켓 세그멘테이션 별로 판매 전략이나 다른 마켓 세그멘테이션을 공략하는 전략을 세울 수 있습니다. 다음은 페이스북, 구글 플러스와 같은 소셜 네트워크 분석입니다. 예를 들면, 여러분이 가장 자주 이메일을 보내는 사람들의 정보와 그들이 가장 자주 이메일을 보내는 사람들에 대한 정보를 바탕으로 그룹 간의 일관성을 찾는 것입니다. 소셜 네트워크에서 일관성을 가진 친구 그룹을 찾는 것과 같은 클러스터링 알고리즘을 데이터 센터에 활용합니다. 실제로 제 친구 중에 한 명은 클러스터링을 사용하여 컴퓨터 클러스터를 구성하여 데이터 센터가 효율적으로 움직이도록 구성하였습니다. 데이터 센터에서 자주 상호작용하는 경향이 있는 애플리케이션이나 서버를 재구성하여 같은 네트워크에 배치하거나 네트워크를 재설계합니다. 마지막으로 저의 다른 친구는 은하의 생성과 천문 데이터를 이해하기 위해 클러스터링 알고리즘을 사용합니다.이것이 비지도 학습을 활용하는 사례들입니다.

앤드류 응의 머신러닝 동영상 강의

정리하며

지도 학습과 비지도 학습의 가장 큰 차이는 학습 셋에 있습니다. 지도 학습은 반드시 가설에 맞는 레이블 셋 y의 값을 제공합니다. 비지도 학습은 y의 값에 대한 레이블이 없는 데이터를 제공합니다.

비지도 학습의 대표적인 알고리즘은 클러스터링입니다. 클러스터링 알고리즘은 레이블이 없는 학습 셋에서 구조나 패턴을 찾습니다.

비지도 학습을 활용한 사례가 몇 가지 있습니다. 고객 데이터베이스에서 고객을 시장별로 그룹화하는 마켓 세그멘테이션, 페이스북, 구글 플러스와 같은 소셜 네트워크 분석, 데이터 센터의 컴퓨터나 서버 간의 상호 작용을 분석하여 네트워크 구조를 재설계, 또는 우주에서 수집한 데이터를 분석합니다.