brunch

You can make anything
by writing

C.S.Lewis

by 라인하트 Oct 23. 2020

앤드류 응의 머신러닝(8-1): 비선형 가설

온라인 강의 플랫폼 코세라의 창립자인 앤드류 응 (Andrew Ng) 교수는 인공지능 업계의 거장입니다. 그가 스탠퍼드 대학에서 머신 러닝 입문자에게 한 강의를 그대로 코세라 온라인 강의 (Coursera.org)에서 무료로 배울 수 있습니다. 이 강의는 머신러닝 입문자들의 필수코스입니다. 인공지능과 머신러닝을 혼자 공부하면서 자연스럽게 만나게 되는 강의입니다.

Neural Networks : Representation

인공 신경망 : 표현

Motivations (동기 부여)

Non-linear Hypotheses (비선형 가설)

In this and in the next set of videos, I'd like to tell you about a learning algorithm called a Neural Network. We're going to first talk about the representation and then in the next set of videos talk about learning algorithms for it. Neutral networks is actually a pretty old idea, but had fallen out of favor for a while. But today, it is the state of the art technique for many different machine learning problems. So why do we need yet another learning algorithm? We already have linear regression and we have logistic regression, so why do we need, you know, neural networks? In order to motivate the discussion of neural networks, let me start by showing you a few examples of machine learning problems where we need to learn complex non-linear hypotheses.

이번 강의부터 신경망이라 불리는 학습 알고리즘을 설명합니다. 먼저 신경망의 동작 방식부터 시작해서 구체적인 학습 알고리즘까지 설명합니다. 신경망은 오래전에 제시된 아이디어지만, 한 동안 사람들의 관심을 받지 못했습니다. 그러나, 오늘날 신경망은 다양한 분야에서 머신러닝을 적용하는 최첨단 기술입니다. 머신러닝은 이미 선형 회귀와 로지스틱 회귀를 활용합니다. 왜 신경망 알고리즘이 필요할까요? 여러분이 신경망 이론에 관심을 갖도록 몇 가지 머신러닝 사례를 설명합니다. 설명하는 몇 가지 학습 문제들은 복잡한 비선형 가설이 필요합니다.

Consider a supervised learning classification problem where you have a training set like this. If you want to apply logistic regression to this problem, one thing you could do is apply logistic regression with a lot of nonlinear features like that. So here, g as usual is the sigmoid function, and we can include lots of polynomial terms like these. And, if you include enough polynomial terms then, you know, maybe you can get a hypotheses that separates the positive and negative examples. This particular method works well when you have only, say, two features - x1 and x2 - because you can then include all those polynomial terms of x1 and x2. But for many interesting machine learning problems would have a lot more features than just two.

여기 지도 학습 분류 문제가 있습니다. 로지스틱 회귀는 학습 데이터 셋에 적합한 가설을 찾기 위해 수많은 비선형 피처들을 사용합니다. 로지스틱 회귀에 사용하는 시그모이드 함수 g()는 수많은 다항식 항을 포함합니다. 충분히 많은 수의 함을 가진 고차 다항식은 파지티브 예제와 네거티브 예제를 정확히 구분할 수 있는 가설을 만들 수 있습니다. 이런 방법은 두 개의 피처만을 가진 데이터가 있을 때 효과적입니다. 피처 x1과 x2로 이루어진 모든 항들을 만들 수 있기 때문입니다. 일반적으로 머신러닝 문제는 피처가 두 개보다 훨씬 많은 피처를 사용합니다.

We've been talking for a while about housing prediction, and suppose you have a housing classification problem rather than a regression problem, like maybe if you have different features of a house, and you want to predict what are the odds that your house will be sold within the next six months, so that will be a classification problem. And as we saw we can come up with quite a lot of features, maybe a hundred different features of different houses. For a problem like this, if you were to include all the quadratic terms, all of these, even all of the quadratic that is the second or the polynomial terms, there would be a lot of them. There would be terms like x1 squared, x1x2, x1x3, you know, x1x4 up to x1x100 and then you have x2 squared, x 2x3 and so on. And if you include just the second order terms, that is, the terms that are a product of, you know, two of these terms, x1 times x1 and so on, then, for the case of n equals 100, you end up with about five thousand features. And, asymptotically, the number of quadratic features grows roughly as order n squared, where n is the number of the original features, like x1 through x100 that we had. And its actually closer to n squared over two.

So including all the quadratic features doesn't seem like it's maybe a good idea, because that is a lot of features and you might up overfitting the training set, and it can also be computationally expensive, you know, to be working with that many features. One thing you could do is include only a subset of these, so if you include only the features x1 squared, x2 squared, x3 squared, up to maybe x100 squared, then the number of features is much smaller. Here you have only 100 such quadratic features, but this is not enough features and certainly won't let you fit the data set like that on the upper left. In fact, if you include only these quadratic features together with the original x1, and so on, up to x100 features, then you can actually fit very interesting hypotheses. So, you can fit things like, you know, access a line of the ellipses like these, but you certainly cannot fit a more complex data set like that shown here.

여기 주택에 대한 데이터 셋이 있습니다. 지금까지 다룬 회귀 문제가 아닌 분류 문제로 다룹니다. 주택 가격을 예측하는 것이 아니라 향후 6개월 안에 주택이 팔릴 확률을 예측하는 분류 문제입니다. 주택에 대한 100 가지의 피처가 있습니다. 00개 피처를 가진 다항식 가설은 다음과 같이 생성합니다.

우선, 100개의 피처에 대한 1차 항과 100개 피처에 대한 2차 항을 만듭니다. 1차 항은 100개이고, 2차 항은 100 X 100 개로 10,000개처럼 보이지만 대각선을 기준으로 겹치는 항을 제외하면 약 5,000개입니다. 즉, 항의 개수는 다음과 같이 계산합니다.

따라서, 가설이 모든 2차 항을 포함하는 것은 좋은 생각이 아닙니다. 너무 많은 항을 포함한 로지스틱 회귀 가설은 학습 데이터 셋에 과적합하고, 계산에 많은 시간이 걸립니다. 가설은 모든 항이 아니라 특정 항만을 포함합니다. 예를 들면, 100개의 피처에 대한 제곱 항만을 사용하여 피처의 수를 크게 줄입니다. 하지만, 오직 100 개의 2차 항만을 활용한 다항식은 충분하지 않기 때문에 학습 데이터 셋을 잘 분류하지 못합니다. 사실 2차 항과 1차 항을 모두 포함하는 다항식을 활용하면 분홍색 곡선과 같은 매우 적합한 가설을 세울 수 있습니다.

So 5000 features seems like a lot, if you were to include the cubic, or third order known of each others, the x1, x2, x3. You know, x1 squared, x2, x10 and x11, x17 and so on. You can imagine there are gonna be a lot of these features. In fact, they are going to be order and cube such features and if any is 100 you can compute that, you end up with on the order

of about 170,000 such cubic features and so including these higher auto-polynomial features when your original feature set end is large this really dramatically blows up your feature space and this doesn't seem like a good way to come up with additional features with which

to build none many classifiers when n is large.

5,000개의 피처는 많은 것 같습니다. 세제곱 항과 3차 항들 포함한 다항식은 피처의 개수 n이 100일 때 대략 17만 개의 항이 필요합니다. 고차 다항식이 많을수록 항의 수는 급격하게 늘어나므로 좋은 방식이 아닙니다. 즉, 피처 (Feature)의 개수 n이 클 때, 분류를 위한 비선형 가설은 적합하지 않습니다.

For many machine learning problems, n will be pretty large. Here's an example. Let's consider the problem of computer vision. And suppose you want to use machine learning to train a classifier to examine an image and tell us whether or not the image is a car. Many people wonder why computer vision could be difficult. I mean when you and I look at this picture it is so obvious what this is. You wonder how is it that a learning algorithm could possibly fail to know what this picture is.

많은 머신러닝 문제들이 피처의 개수 n이 아주 큽니다. 여기 컴퓨터 비전에 대한 예제가 있습니다. 컴퓨터 비전 알고리즘은 머신러닝 분류기를 훈련시켜 이미지를 분석하여 차인지 아닌지를 판단합니다. 많은 사람들이 컴퓨터 비전이 어려운 이유를 궁금해합니다. 인간은 이미지가 무엇인지를 정확히 판단할 수 있습니다. 하지만, 학습 알고리즘은 어떻게 이미지를 구분할까요?

To understand why computer visionis hard. let's zoom into a small part of the image like that area where the little red rectangle is. It turns out that where you and I see a car, the computer sees that. What it sees is this matrix, or this grid, of pixel intensity values that tells us the brightness of each pixel in the image. So the computer vision problem is to look at this matrix of pixel intensity values, and tell us that these numbers represent the door handle of a car.

컴퓨터 비전이 이미지를 구분하기 어려운 이유를 이해하기 위해 이미지의 빨간 사각형 부분을 확대합니다 인간은 자동차 이미지를 볼 때 컴퓨터는 전혀 다른 것을 봅니다. 컴퓨터는 숫자들의 행렬이자 격자(그리드) 구조를 봅니다. 숫자는 이미지를 구성하는 픽셀의 밝기를 나타냅니다. 컴퓨터 비전 알고리즘은 픽셀 밝기 값들로 구성된 행렬을 보고 자동차의 손잡이라고 인식하는 것입니다.

Concretely, when we use machine learning to build a car detector, what we do is we come up with a label training set, with, let's say, a few label examples of cars and a few label examples of things that are not cars, then we give our training set to the learning algorithm trained a classifier and then, you know, we may test it and show the new image and ask, "What is this new thing?". And hopefully it will recognize that that is a car.

예를 들어, 자동차 이미지를 검출하는 학습 알고리즘을 구축합니다. 먼저 레이블이 있는 학습 데이터 셋을 수집합니다. "자동차"라고 레이블 된 학습 데이터 셋과 "자동차 아님"이라고 레이블 된 학습 데이터 셋이 필요합니다. 다음으로 학습 알고리즘은 학습 데이터 셋을 학습합니다. 충분히 학습한 자동차 이미지 검출 알고리즘은 새로운 이미지를 분석합니다. "이것은 무엇입니까"라고 묻는 질문에 자동차 이미지 검출기 또는 분류기는 "자동차"라고 대답할 것입니다.

To understand why we need nonlinear hypotheses, let's take a look at some of the images of cars and maybe non-cars that we might feed to our learning algorithm. Let's pick a couple of pixel locations in our images, so that's pixel one location and pixel two location,

비선형 가설이 필요한 이유를 설명하기 위해 여기 몇 개의 자동차 이미지와 자동차가 아닌 이미지가 있습니다. 학습 알고리즘이 학습한 이미지입니다. 이미지에서 몇 개의 픽셀을 고릅니다. 픽셀 1 은 사이드 미러, 픽셀 2는 자동차 바퀴입니다.

And let's plot this car, you know, at the location, at a certain point, depending on the intensities of pixel one and pixel two. And let's do this with a few other images. So let's take a different example of the car and you know, look at the same two pixel locations and that image has a different intensity for pixel one and a different intensity for pixel two. So, it ends up at a different location on the figure. And then let's plot some negative examples as well. That's a non-car, that's a non-car.

자동차의 픽셀 1과 픽셀 2의 밝기에 따라 자동차 예제를 좌표에 표시합니다. 다른 자동차 이미지도 마찬가지로 픽셀을 선택하고 밝기에 따라 좌표에 표시합니다. 각각의 이미지에서 픽셀 1과 픽셀 2의 위치에 밝기에 따라 좌표에 표시합니다. 다른 자동차 이미지를 표시합니다. "자동차"가 아닌 네거티브 예제도 좌표에 표시합니다.

And if we do this for more and more examples using the pluses to denote cars and minuses to denote non-cars, what we'll find is that the cars and non-cars end up lying in different regions of the space, and what we need therefore is some sort of non-linear hypotheses to try to separate out the two classes.

What is the dimension of the feature space? Suppose we were to use just 50 by 50 pixel images. Now that suppose our images were pretty small ones, just 50 pixels on the side.

Then we would have 2500 pixels, and so the dimension of our feature size will be N

equals 2500 where our feature vector x is a list of all the pixel testings, you know, the pixel brightness of pixel one, the brightness of pixel two, and so on down to the pixel brightness of the last pixel where, you know, in a typical computer representation, each of these may be values between say 0 to 255 if it gives us the grayscale value. So we have n equals 2500, and that's if we were using grayscale images. If we were using RGB images with separate red, green and blue values, we would have n equals 7500. So, if we were to try to learn a nonlinear hypothesis by including all the quadratic features, that is all the terms of the form, you know, Xi times Xj, while with the 2500 pixels we would end up with a total of three million features. And that's just too large to be reasonable; the computation would be very expensive to find and to represent all of these three million features per training example.

So, simple logistic regression together with adding in maybe the quadratic or the cubic features - that's just not a good way to learn complex nonlinear hypotheses when n is large because you just end up with too many features.

더 많은 예제를 분석하여 "자동차" 이미지는 +로, "자동차가 아님" 이미지는 -로 표시합니다. "자동차"와 "자동차가 아님" 예제들은 좌표 공간에서 각각 위치할 것입니다. 두 가지로 분류할 수 있는 비선형 가설이 필요합니다.

피처 공간(Feature space)은 몇 차원일까요? 가로 50 픽셀과 세로 50 픽셀의 작은 이미지는 2,500개의 픽셀로 구성합니다. 따라서, 피처 공간 n = 2500이고, x는 모든 픽셀 값의 목록입니다. 피처 공간에 픽셀 1의 밝기, 픽셀 2의 밝기, 등등 마지막 픽셀의 밝기까지 표시합니다. 일반적으로 컴퓨터는 흑백 이미지를 0에서 255 사이의 밝기로 표현합니다. RGB 이미지는 하나의 픽셀은 빨간색, 녹색, 파란색 값으로 표현합니다. 픽셀 당 3개의 피처가 필요하므로 n = 7500입니다. 만일 이차 항을 포함하는 비선형 가설을 세운다면, 2500 개의 픽셀은 3백만 개의 항이 필요한 다항식을 만들니다. 3백만 개의 피처는 너무 많을 뿐만 아니라 엄청난 연산 비용이 발생합니다. 예제 한 개 당 300만 개의 피처를 표현해야 합니다.

그래서, 단순 로지스틱 회귀에서 피처에 대해 2차 항 또는 3차 항들을 포함하는 것은 좋은 아이디어 아닙니다. 피처의 개수 n이 클 때 복잡한 비선형 가설을 이용하는 것은 좋은 방법이 아닙니다.

In the next few videos, I would like to tell you about Neural Networks, which turns out to be a much better way to learn complex hypotheses, complex nonlinear hypotheses even when your input feature space, even when n is large. And along the way I'll also get to show you a couple of fun videos of historically important applications of Neural networks as well that I

hope those videos that we'll see later will be fun for you to watch as well.

다음 몇 개의 강의에서 신경망을 설명합니다. 신경망은 복잡한 비선형 가설을 학습하는 더 좋은 방법입니다. 심지어 입력 피처 공간(Feature space)이 매우 클 때도 활용할 수 있습니다. 역사적으로 신경망의 중요한 응용 사례를 보여주는 영상들이 있습니다. 재미있기를 바랍니다.

앤드류 응의 머신러닝 동영상 강의

정리하며 - 비선형 가설에 인공 신경망이 필요한 이유

머신러닝은 이미 선형 회귀와 로지스틱 회귀를 활용합니다. 그런데 왜 신경망 알고리즘이 필요할까요?

많은 머신러닝 문제에서 피처가 아주 많습니다. 사진을 분류하는 컴퓨터 비전 알고리즘은 사진을 픽셀의 밝기 값들로 이루어진 행렬로 인식합니다. 50 X 50 픽셀 크기의 흑백 이미지는 총 2,500의 피처가 필요하고, RGB(Red, Green, Black) 컬러는 7,500개 피처가 필요합니다. 만일 2차 다항식을 활용한다면, 2,500 X 2,500 /2 개의 피처가 필요합니다. 50 X 50 이미지 한 장 당 약 3백만 개의 피처가 필요합니다. 결국, 단순 로지스틱 회귀에 피처가 많을 때 2차 항을 포함하거나 비선형 가설을 사용하는 것은 좋은 아이디어가 아닙니다.

신경망은 복잡한 비선형 가설을 학습하는 더 좋은 방법입니다. 심지어 입력 피처 공간(Feature space)이 매우 클 때도 활용할 수 있습니다.