brunch

You can make anything
by writing

C.S.Lewis

by 라인하트 Sep 23. 2020

앤드류 응의 머신러닝 (1-3) : 지도 학습

온라인 강의 플랫폼 코세라의 창립자인 앤드류 응 (Andrew Ng) 교수는 인공지능 업계의 거장입니다. 그가 스탠퍼드 대학에서 머신 러닝 입문자에게 한 강의를 그대로 코세라 온라인 강의 (Coursera.org)에서 무료로 배울 수 있습니다. 이 강의는 머신러닝 입문자들의 필수코스입니다. 인공지능과 머신러닝을 혼자 공부하면서 자연스럽게 만나게 되는 강의입니다. 강의를 간략하게 정리합니다.

Welcome

환영

Introduction (소개)

Supervised learning (지도 학습)

In this video, I'm going to define what is probably the most common type of Machine Learning problem, which is Supervised Learning. I'll define Supervised Learning more formally later, but it's probably best to explain or start with an example of what it is, and we'll do the formal definition later.

이번 강의는 아마도 가장 일반적인 머신 러닝 문제인 지도 학습(Supervised Learning)에 대해 설명합니다. 지도 학습을 제대로 정의하는 것은 나중에 하겠지만, 지도 학습 사례부터 먼저 설명합니다.

Let's say you want to predict housing prices. A while back a student collected data sets from the City of Portland, Oregon, and let's say you plot the data set and it looks like this. Here on the horizontal axis, the size of different houses in square feet, and on the vertical axis, the price of different houses in thousands of dollars. So, given this data, let's say you have a friend who owns a house that is say 750 square feet, and they are hoping to sell the house, and they want to know how much they can get for the house. So, how can the learning algorithm help you?

여러분은 주택 가격을 예측할 예정입니다. 얼마 전에 한 학생이 오레건 주 포틀랜드 시의 데이터 셋을 수집하여 그림과 같이 도식화했습니다. 여기 수평축은 주택의 크기를 평방 피트 (1 피트는 30.48cm) 단위로 표시하고, 수직축은 주택의 가격을 천 달러 단위로 표시합니다. 여러분은 750 평방 피트짜리 집을 얼마에 팔 수 있는 지를 알고 싶은 친구가 있습니다. 학습 알고리즘이 어떻게 여러분을 도와줄 수 있을까요?

One thing a learning algorithm might be want to do is put a straight line through the data, also fit a straight line to the data. Based on that, it looks like maybe their house can be sold for maybe about $150,000.

학습 알고리즘은 데이터를 통과하는 직선을 그릴 수 있습니다. 직선에 친구의 집 크기를 맞추면 친구의 집은 15만 불에 팔릴 것 같습니다.

But maybe this isn't the only learning algorithm you can use, and there might be a better one. For example, instead of fitting a straight line to the data, we might decide that it's better to fit a quadratic function, or a second-order polynomial to this data. If you do that and make a prediction here, then it looks like, well, maybe they can sell the house for closer to $200,000.

하지만, 학습 알고리즘은 데이터에 맞는 더 나은 선을 그릴 수 있습니다. 예를 들면, 이차 함수나 다항식 함수의 곡선을 데이터에 맞출 수 있습니다. 직선보다 좀 더 정확한 예측을 할 수 있습니다. 곡선에 친구의 집을 맞추면 아마도 20만 달러 근처에서 집을 팔 수 있을 것입니다.

One of the things we'll talk about later is how to choose, and how to decide, do you want to fit a straight line to the data? Or do you want to fit a quadratic function to the data? There's no fair picking whichever one gives your friend the better house to sell. But each of these would be a fine example of a learning algorithm.

나중에 이야기할 것들 중 하나가 선택하고 결정하는 법입니다. 여러분들은 직선을 데이터 맞추길 원하십니까? 아니면 이차 함수로 그린 곡선을 데이터에 맞추길 원하십니까? 어떤 것도 여러분의 친구에게 팔 수 있는 더 좋은 집을 주지는 않습니다. 하지만, 두 가지 방법은 학습 알고리즘의 좋은 예입니다.

So, this is an example of a Supervised Learning algorithm. The term Supervised Learning refers to the fact that we gave the algorithm a data set in which the, called, "right answers" were given. That is we gave it a data set of houses in which for every example in this data set,

we told it what is the right price. So, what was the actual price that that house sold for, and the task of the algorithm was to just produce more of these right answers such as for this new house that your friend may be trying to sell.

이것이 지도 학습 알고리즘의 사례입니다. 지도 학습(Supervised Learning)이라는 말은 우리가 알고리즘에게 정답이 표시된 데이터 셋을 준다는 것을 의미합니다. 예를 들면, 우리는 알고리즘에 집의 크기에 따른 정확한 가격이 있는 데이터 셋을 제공했습니다. 데이터 셋은 집이 매매된 실제 가격입니다. 알고리즘의 역할은 여러분의 친구가 팔려고 하는 집의 가격과 같은 정답을 예측하는 것입니다.

To define a bit more terminology, this is also called a. regression problem. By regression problem, I mean we're trying to predict a continuous valued output. Namely the price. So technically, I guess prices can be rounded off to the nearest cent. So, maybe prices are actually discrete value. But usually, we think of the price of a house as a real number, as a scalar value, as a continuous value number, and the term regression refers to the fact that we're trying to predict the sort of continuous values attribute.

또 하나의 새로운 용어는 회귀(Regression) 문제입니다. 회귀(Regression) 문제는 연속된 값의 출력을 예측하려는 것입니다. 대표적인 예는 가격입니다. 일반적으로 가격은 가장 가까운 센트로 반올림할 수 있기 때문에 불연속적인 값입니다. 그러나 일반적으로 사람들은 집값을 스칼라 값인 실수로 생각합니다. 따라서, 회귀 문제는 일종의 연속적인 속성을 가진 결과 값을 예측합니다.

Here's another Supervised Learning examples. Some friends and I were actually working on this earlier. Let's say you want to look at medical records and try to predict of a breast cancer as malignant or benign. If someone discovers a breast tumor, a lump in their breast, a malignant tumor is a tumor that is harmful and dangerous, and a benign tumor is a tumor that is harmless. So obviously, people care a lot about this.

여기 암을 판별하는 학습 알고리즘이 있습니다. 의료기록을 보고 유방암이 악성인지 양성인지 예측합니다. 악성 종양은 해롭고 위험한 종양이고, 양성 종양은 무해한 종양입니다. 만약 누군가가 유방에 응어리진 종양을 발견했다면, 악성인지 양성인지에 판별해야 합니다.

Let's see collected data set. Suppose you are in your dataset, you have on your horizontal axis the size of the tumor, and on the vertical axis, I'm going to plot one or zero, yes or no, whether or not these are examples of tumors we've seen before are malignant, which is one, or zero or not malignant or benign. So, let's say your dataset looks like this, where we saw a tumor of this size that turned out to be benign, one of this size, one of this size, and so on. Sadly, we also saw a few malignant tumors cell, one of that size, one of that size, one of that size, so on. So in this example, I have five examples of benign tumors shown down here, and five examples of malignant tumors shown with a vertical axis value of one.

여기 데이터 셋이 있습니다. 수평축은 종양 크기를, 수직축에는 1 또는 0, 즉 Yes 또는 No를 나타냅니다. 해당 예시의 종양이 악성이면 1, 악성이 아닌 양성이면 0입니다. 여기 다양한 크기의 종양이 있습니다. 어떤 것들은 양성이었고, 슬프게도 어떤 것들은 악성이었습니다. 정리하면, 아래쪽에 표시된 다섯 개의 종양은 양성이고, 세로축에서 1의 값을 가지는 악성 종양도 다섯 개가 있습니다.

Let's say a friend who tragically has a breast tumor, and let's say her breast tumor size is maybe somewhere around this value, the Machine Learning question is, can you estimate what is the probability, what's the chance that a tumor as malignant versus benign? To introduce a bit more terminology, this is an example of a classification problem. The term classification refers to the fact, that here, we're trying to predict a discrete value output zero or one, malignant or benign. It turns out that in classification problems, sometimes you can have more than two possible values for the output. As a concrete example, maybe there are three types of breast cancers. So, you may try to predict a discrete value output zero, one, two, or three, where zero may mean benign, benign tumor, so no cancer, and one may mean type one cancer, maybe three types of cancer, whatever type one means, and two mean a second type of cancer, and three may mean a third type of cancer. But this will also be a classification problem because this are the discrete value set of output corresponding to you're no cancer, or cancer type one, or cancer type two, or cancer types three.

어떤 친구가 슬프게도 유방에 종양이 있다고 가정합니다. 그녀의 종양 크기는 분홍색 화살표 부분에 위치합니다. 머신 러닝 알고리즘에게 이 종양이 악성인지 양성인지를 묻거나 악성일 확률이 얼마인지를 묻을 수 있습니다. 여기서 용어를 한 가지 더 소개합니다. 이것은 분류(Classification) 문제의 예입니다. 분류(Classification) 문제는 불연속적인 값 0 또는 1, 악성 또는 양성과 같이 불연속적인 결과 값을 예측합니다. 어떤 분류 문제는 두 개 이상의 출력 값일 수 있습니다. 예를 들면, 유방암은 세 가지 종류가 있을 수 있습니다. 그래서, 여러분은 0, 1, 2,3과 같은 불연속적인 값을 예측합니다. 0은 양성 종양으로 암이 아닌 경우이고, 1은 1번 타입의 암이고, 2는 2번 타입의 암이고, 그리고 3은 3번 타입의 암을 의미합니다. 이것은 출력의 결과가 연속적이지 않은 이산적인 값으로 분류하는 문제입니다

In classification problems, there is another way to plot this data. Let me show you what I mean. I'm going to use a slightly different set of symbols to plot this data. So, if tumor size is going to be the attribute that I'm going to use to predict malignancy or benignness, I can also draw my data like this. I'm going to use different symbols to denote my benign and malignant, or my negative and positive examples. So, instead of drawing crosses, I'm now going to draw O's for the benign tumors, like so, and I'm going to keep using X's to denote my malignant tumors. I hope this figure makes sense. All I did was I took my data set on top, and I just mapped it down to this real line like so, and started to use different symbols, circles and crosses to denote malignant versus benign examples. Now, in this example, we use only one feature or one attribute, namely the tumor size in order to predict whether a tumor is malignant or benign.

분류 문제에서 데이터를 도식화하는 다른 방법이 있습니다. 이 데이터를 그리기 위해 서로 다른 모양의 기호를 사용합니다. 종양의 크기라는 속성으로 종양이 악성인지 양성인지를 예측할 때 데이터를 한 줄로 표현할 수 있습니다. 악성인지 양성인지 또는 유해 한지 무해한 지를 기호로 표시합니다. 엑스 표시(X)는 악성 종양을 나타내고 원 표시(O)는 양성 종양을 나타냅니다. 한 줄에 표시한 기호를 쉽게 이해할 수 있기를 바랍니다. 위 쪽의 두 줄로 나뉜 데이터를 아래쪽의 한 줄로 배치하면서 양성 종양을 O표로 악성 종양을 X표로 그렸습니다. 여기서 단 하나의 피처와 속성만을 사용합니다. 즉, 종양의 크기만으로 종양이 악성인지 양성인지를 예측합니다.

In other machine learning problems, when we have more than one feature or more than one attribute. Here's an example, let's say that instead of just knowing the tumor size, we know both the age of the patients and the tumor size. In that case, maybe your data set would look like this, where I may have a set of patients with those ages, and that tumor size, and they look like this, and different set of patients that look a little different, whose tumors turn out to be malignant as denoted by the crosses. So, let's say you have a friend who tragically has a tumor, and maybe their tumor size and age falls around there. So, given a data set like this, what the learning algorithm might do is fit a straight line to the data to

try to separate out the malignant tumors from the benign ones, and so the learning algorithm may decide to put a straight line like that to separate out the two causes of tumors. With this, hopefully we can decide that your friend's tumor is more likely, if it's over there that hopefully your learning algorithm will say that your friend's tumor falls on this benign side and is therefore more likely to be benign than malignant. In this example, we had two features namely, the age of the patient and the size of the tumor.

여기 데이터 셋이 한 개 이상의 피처와 속성을 가진 머신 러닝 문제가 있습니다. 예를 들면, 종양의 크기와 환자의 나이 둘 다 안다고 가정합니다. 두 개의 피처와 속성에 대해 데이터 셋을 2차원 평면으로 그릴 수 있습니다. 어떤 집단의 환자들이 나이와 종양 크기에 따른 상관관계를 보입니다. 엑스 표시(X)는 악성 종양을 나타내고 원 표시(O)는 양성 종양을 나타냅니다. 슬프게도 종양을 가진 한 친구가 있다고 가정합니다. 종양의 크기와 나이는 분홍색 점의 위치와 같습니다. 여기 데이터 셋에서 학습 알고리즘은 검은색 직선을 그려서 종양을 분리합니다. 다행히도 친구의 종양은 양성으로 보입니다. 학습 알고리즘은 그 친구의 종양이 양성 영역에 있기 때문에 양성일 확률이 높다고 표시합니다. 이 예는 피처와 속성은 정확히 환자의 나이와 종양의 크기로 두 개입니다

In other Machine Learning problems, we will often have more features. My friends that worked on this problem actually used other features like these, which is clump thickness,

clump thickness of the breast tumor, uniformity of cell size of the tumor, uniformity of cell shape the tumor, and so on, and other features as well. It turns out one of the most interesting learning algorithms that we'll see in this course, as the learning algorithm that can deal with not just two, or three, or five features, but an infinite number of features.

On this slide, I've listed a total of five different features. Two on the axis and three more up here. But it turns out that for some learning problems what you really want is not to use like three or five features, but instead you want to use an infinite number of features, an infinite number of attributes, so that your learning algorithm has lots of attributes, or features, or cues with which to make those predictions. So, how do you deal with an infinite number of features? How do you even store an infinite number of things in the computer when your computer is going to run out of memory? It turns out that when we talk about an algorithm called the Support Vector Machine, there will be a neat mathematical trick that will allow a computer to deal with an infinite number of features. Imagine that I didn't just write down two features here and three features on the right, but imagine that I wrote down an infinitely long list. I just kept writing more and more features, like an infinitely long list of features. It turns out we will come up with an algorithm that can deal with that.

머신 러닝 문제는 더 많은 피처와 속성을 가질 수 있습니다. 실제로 유방암 문제를 다루는 전문가들은 여러 가지 피처와 속성을 사용했습니다. 종양의 두께, 종양의 크기의 일관성, 종양의 모양의 일관성입니다. 실제로 훨씬 더 많은 피처를 사용합니다. 이 강좌에서 다루는 학습 알고리즘은 피처가 두 개나 세 개, 아니면 다섯 개 정도가 아니라 무한대의 피처를 다룹니다. 이 슬라이드는 총 다섯 개의 피처를 나열하였습니다. 평면 축에 두 개 피처가 있고, 오른쪽에 세 개의 피처가 있습니다. 어떤 학습 알고리즘은 문제에선 고작 세 개 또는 다섯 개의 피처가 아니라 무한히 많은 피처를 다루기도 합니다. 학습 알고리즘은 정확히 예측하기 위해 더 많은 속성이나 피처가 필요할 수 있습니다. 그러면, 학습 알고리즘은 무한히 많은 피처를 어떻게 다룰까요? 무한대의 피처를 컴퓨터의 메모리에 저장한다면, 컴퓨터의 메모리를 금방 고갈될 것입니다. 나중에 배울 서포트 벡터 머신 알고리즘은 어떤 깔끔한 수학적 방법으로 컴퓨터가 무한한 개수의 피처를 다룰 수 있습니다. 이 슬라이드에 피처 두 개와 오른쪽에 피처 세 개를 적은 것이 아니라 무한히 긴 피처의 목록을 적었다고 상상해 보세요. 피처를 계속 적어다 보면 무한히 긴 목록이 될 것이고, 특정 알고리즘이 처리할 것입니다.

So, just to recap, in this course, we'll talk about Supervised Learning, and the idea is that in Supervised Learning, in every example in our data set, we are told what is the correct answer that we would have quite liked the algorithms have predicted on that example. Such as the price of the house, or whether a tumor is malignant or benign. We also talked about the regression problem, and by regression that means that our goal is to predict a continuous valued output. We talked about the classification problem where the goal is to predict a discrete value output.

정리하면, 이번 강의에서 지도 학습을 설명했습니다. 지도 학습에서 데이터 셋의 학습 예제는 정답이 무엇인지 알고 있고, 알고리즘에 정답을 포함한 데이터를 넘겨주면 알고리즘은 예측을 합니다. 알고리즘은 정답이 있는 학습 예제에서 집값과 종양을 예측할 수 있습니다. 또 회귀 문제도 다루었습니다. 회귀는 연속적인 출력 값을 예측하는 것입니다. 분류 문제도 다루었습니다. 분류 문제는 이산적인 출력 값을 예측합니다.

Just a quick wrap up question. Suppose you're running a company and you want to

develop learning algorithms to address each of two problems. In the first problem, you have a large inventory of identical items. So, imagine that you have thousands of copies of some identical items to sell, and you want to predict how many of these items you sell over the next three months. In the second problem, problem two, you have lots of users, and you want to write software to examine each individual of your customer's accounts, so each one of your customer's accounts. For each account, decide whether or not the account has been hacked or compromised. So, for each of these problems, should they be treated as a classification problem or as a regression problem? When the video pauses, please use your mouse to select whichever of these four options on the left you think is the correct answer.

마무리 질문 하나가 있습니다. 여러분이 어떤 회사를 경영한다고 가정합니다. 여러분들은 두 가지 문제를 해결하기 위해 학습 알고리즘을 개발하는 중입니다. 첫 번째 문제는 엄청난 재고로 쌓인 똑같은 상품을 3개월 안에 얼마나 팔 수 있을지 예측하는 것입니다. 두 번째 문제는 수많은 개별 고객사가 해킹 또는 손상됐는 지를 판단하는 소프트웨어를 개발 중입니다. 가각의 문제를 분류 문제 또는 회귀 문제 중에 어떤 문제로 풀어야 할까요? 영상이 멈추면 마우스를 사용해서 네 개의 예 중에서 정답을 고르시기 바랍니다.

So hopefully, you got that. This is the answer. For problem one, I would treat this as

a regression problem because if I have thousands of items, well, I would probably just treat this as a real value, as a continuous value. Therefore, the number of items I sell as a continuous value. For the second problem, I would treat that as a classification problem,

because I might say set the value I want to predict with zero to denote the account has not been hacked, and set the value one to denote an account that has been hacked into. So, just like your breast cancers where zero is benign, one is malignant. So, I might set this be zero or one depending on whether it's been hacked, and have an algorithm try to predict each one of these two discrete values. Because there's a small number of discrete values, I would therefore treat it as a classification problem. So, that's it for Supervised Learning. In the next video, I'll talk about Unsupervised Learning, which is the other major category of learning algorithm.

정답을 맞혔길 바랍니다. 첫 번째 문제는 회귀 문제로 처리합니다. 왜냐하면 수천 개의 제품을 그냥 실수인 연속적인 값으로 볼 수 있기 때문입니다. 판매할 제품의 수를 연속인 값으로 정의합니다. 두 번째 문제는 분류 문제로 처리합니다. 왜냐하면 해킹을 당하지 않은 고객은 0으로 해킹을 당한 고객은 1로 표시할 수 있기 때문입니다. 결국 0은 양성으로 1은 악성으로 예측한 종양 문제와 동일합니다. 해킹 여부에 따라 0이나 1의 값을 설정하고 알고리즘이 이산적인 값을 예측할 수 있게 만듭니다. 이산적인 값의 수가 적어서 분류 문제로 다루는 것이 좋습니다. 지도 학습은 여기까지입니다. 다음 강의에서 학습 알고리즘의 또 다른 중요한 분야인 비지도 학습을 설명하겠습니다.

앤드류 응의 머신러닝 동영상 강의

정리하며

지도 학습은 정답이 표시된 데이터 셋을 통해 학습 알고리즘이 학습한다는 것을 의미합니다. 학습 알고리즘은 입력과 출력 사이에 관계를 설정하고 예측합니다. 지도 학습은 연속적인 값의 변화를 예측하는 회귀 문제와 이산적인 값의 변화를 예측하는 분류 문제로 크게 나뉩니다.

1) 회귀 문제(Regression Problem): 연속적인 값의 변화를 예측

어떤 학습 알고리즘은 집의 크기에 따른 가격의 변화를 예측합니다. 집의 크기와 현재 시세를 기준한 가격 정보를 매핑한 학습 데이터 셋을 알고리즘에 입력합니다. 학습 알고리즘은 학습 예제를 통해 학습을 완료한 후 예측을 합니다. 학습 알고리즘은 가설에 따라 학습합니다. 일차함수의 직선 또는 이차 함수나 다항 함수의 곡선을 그릴 수도 있습니다. 두 가지 방법 중에 어느 것을 선택할지는 데이터 사이언티스트의 몫입니다. 회귀 문제는 연속된 값의 출력을 예측하는 것입니다. 대표적인 예는 가격이나 수량입니다.

2) 분류 문제 (Classification Problem) : 이산적인 값의 변화를 예측 -

어떤 학습 알고리즘은 종양의 크기에 따른 유방암의 유무를 예측합니다. 종앙의 크기와 유방암 유무에 관한 훈련용 데이터 셋을 학습 알고리즘에 입력합니다. 학습 알고리즘은 훈련용 데이터 셋을 통해 학습을 완료한 후 예측을 합니다. 학습 알고리즘은 종양의 크기에 따라 양성과 악성을 판정합니다.

분류의 문제에서 피처가 하나가 아닌 여러 개일 수가 있습니다. 암의 유무를 판단하는 학습 알고리즘이 종양의 두께, 즉, 유방 종양의 두께, 종양 세포의 크기의 일관성, 종양 세포의 모양의 일관성 등등의 특성을 활용할 경우 훨씬 더 정확한 예측이 가능할 것입니다. 분류의 문제에서 특성이나 속성의 개수가 무한대로 증가할 수 있습니다. 이런 데이터를 다루기 위해 학습 알고리즘은 단순히 컴퓨터 메모리의 용량을 다 쓸지도 모릅니다. 하지만, 서포트 벡터 머신 알고리즘은 무한대의 특성을 가진 분류 문제를 깔끔하게 수학적으로 다룰 수 있습니다.