brunch

You can make anything
by writing

C.S.Lewis

by 라인하트 Nov 15. 2020

앤드류 응의 머신러닝(12-4):SVM 커널 I

온라인 강의 플랫폼 코세라의 창립자인 앤드류 응 (Andrew Ng) 교수는 인공지능 업계의 거장입니다. 그가 스탠퍼드 대학에서 머신 러닝 입문자에게 한 강의를 그대로 코세라 온라인 강의 (Coursera.org)에서 무료로 배울 수 있습니다. 이 강의는 머신러닝 입문자들의 필수코스입니다. 인공지능과 머신러닝을 혼자 공부하면서 자연스럽게 만나게 되는 강의입니다.

Support Vector Machines

서포트 벡터 머신

Kernels (커널)

Kernels I (커널 I)

In this video, I'd like to start adapting support vector machines in order to develop complex nonlinear classifiers. The main technique for doing that is something called kernels. Let's see what this kernels are and how to use them.

이번 강의에서 복잡한 비선형 분류기를 개발하기 위한 서포트 벡터 머신은 커널이라는 기술을 사용합니다. 커널의 정의와 사용법을 설명합니다.

If you have a training set that looks like this, and you want to find a nonlinear decision boundary to distinguish the positive and negative examples, maybe a decision boundary that looks like that. One way to do so is to come up with a set of complex polynomial features, right? So, set of features that looks like this, so that you end up with a hypothesis X that predicts 1 if you know that theta 0 and plus theta 1 X1 plus dot dot dot all those polynomial features is greater than 0, and predict 0, otherwise.

그림과 같은 학습 셋에서 파지티브 예제와 네거티브 예제를 구분하기 위한 비선형 결정 경계를 찾습니다. 복잡한 다항식 피처 셋을 활용하여 파란색 타원형의 결정 경계를 그릴 수 있습니다. 다항식 가설은 다음과 같습니다.

And another way of writing this, to introduce a level of new notation that I'll use later, is that we can think of a hypothesis as computing a decision boundary using this. So, theta 0 plus theta 1 f1 plus theta 2, f2 plus theta 3, f3 plus and so on. Where I'm going to use this new denotation f1, f2, f3 and so on to denote these new sort of features that I'm computing, so f1 is just X1, f2 is equal to X2, f3 is equal to this one here. So, X1X2. So, f4 is equal to X1 squared where f5 is to be x2 squared and so on and we seen previously that coming up with these high order polynomials is one way to come up with lots more features, the question is, is there a different choice of features or is there better sort of features than this high order polynomials because you know it's not clear that this high order polynomial is what we want, and what we talked about computer vision talk about when the input is an image with lots of pixels. We also saw how using high order polynomials becomes very computationally expensive because there are a lot of these higher order polynomial terms. So, is there a different or a better choice of the features that we can use to plug into this sort of hypothesis form.

다항식을 표현하는 또 다른 방법이 있습니다. f1, f2, f3 등의 새로운 피처를 사용하여 새로운 표기법을 이용하여 결정 경계를 계산하는 다항식은 다음과 같습니다.

고차 다항식을 많은 피처를 가진 식으로 표현합니다. 예를 들면, 많은 픽셀로 이루어진 이미지를 분석하는 컴퓨터 비전 사례는 고차 다항식이 아닌 많은 피처를 사용합니다. 고차 다항식이 많을수록 연산 비용이 많이 들기 때문에 많은 피처를 가진 식으로 표현하는 것이 훨씬 더 나은 선택입니다.

So, here is one idea for how to define new features f1, f2, f3.On this line I am going to define only three new features, but for real problems we can get to define a much larger number. But here's what I'm going to do in this phase of features X1, X2, and I'm going to leave X0 out of this, the interceptor X0, but in this phase X1 X2, I'm going to just, you know, manually pick a few points, and then call these points l1, we are going to pick a different point, let's call that l2 and let's pick the third one and call this one l3, and for now let's just say that I'm going to choose these three points manually. I'm going to call these three points landmarks, so landmark one, two, three. What I'm going to do is define my new features as follows, given an example X, let me define my first feature f1 to be some measure of the similarity between my training example X and my first landmark and this specific formula that I'm going to use to measure similarity is going to be this is E to the minus the length of X minus l1, squared, divided by two sigma squared. So, depending on whether or not you watched the previous optional video, this notation, you know, this is the length of the vector W. And so, this thing here, this X minus l1, this is actually just the euclidean distance squared, is the euclidean distance between the point x and the landmark l1. We will see more about this later. That's my first feature,

여기서 새로운 피처 f1, f2, f3를 정의합니다. 실제로는 훨씬 더 많은 피처가 있지만 3 개의 새로운 Feature만을 정의합니다. 피처 x1과 x2 가 있고, x0는 제외합니다. 수평축이 x1이고 수직축이 x2인 평면에 임의의 점 3개를 찍습니다. 3 개의 점은 l^(1), l^(2), l^(3)로 명명합니다. 여기 3 개의 점을 랜드마크라고 명명합니다. 랜드마크 l^(1), 랜드마크 l^(2), 랜드마크 l^(3)입니다. 이제 새로운 피처를 정의합니다.

바로 전 강의에서 배운 ||w|| 은 벡터 w의 길이입니다. 여기에서 ||x-l^(1)||^2 은 단순히 유클리드 거리를 제곱한 것입니다. x와 랜드마크 l^(1) 사이의 유클리드 거리입니다. 이것은 나중에 다시 한번 다룰 것입니다. 첫 번째 피처 f1은 첫 번째 피처 x가 l^(1)와 얼마나 유사한지 또는 가까운지를 정의하는 유사도 함수입니다.

And my second feature f2 is going to be, you know, similarity function that measures how similar X is to l2 and the game is going to be defined as the following function. This is Exp to the minus of the square of the euclidean distance between X and the second landmark, that is what the enumerator is and then divided by 2 sigma squared and similarly f3 is, you know, similarity between X and l3, which is equal to, again, similar formula.

두 번째 피처 f2는 x가 l^(2)와 얼마나 유사한지 또는 가까운 지를 정의하는 유사도 함수입니다. 세 번째 피처 f3도 l^(3)와 얼마나 유사한지 또는 가까운 지를 정의하는 유사도 함수입니다.

And what this similarity function is, the mathematical term for this, is that this is going to be a kernel function. And the specific kernel I'm using here, this is actually called a Gaussian kernel. And so this formula, this particular choice of similarity function is called a Gaussian kernel. But the way the terminology goes is that, you know, in the abstract these different similarity functions are called kernels and we can have different similarity functions and the specific example I'm giving here is called the Gaussian kernel. We'll see other examples of other kernels. But for now just think of these as similarity functions. And so, instead of writing similarity between X and l, sometimes we also write this a kernel denoted you know, lower case k between x and one of my landmarks all right.

유사도 함수에 대한 수학 용어는 커널 함수(Kernel)입니다. 여기서는 유사도 함수는 가우시안 커널을 사용합니다. 다른 종류의 커널도 살펴보겠지만, 지금은 유사도 함수로 생각합니다.

유사도 함수를 나타내는 similarity() 대신에 커널 k()로 표현하기도 합니다.

So let's see what a kernels actually do and why these sorts of similarity functions, why these expressions might make sense. So let's take my first landmark. My landmark l1, which is one of those points I chose on my figure just now. So the similarity of the kernel between x and l1 is given by this expression. Just to make sure, you know, we are on the same page about what the numerator term is, the numerator can also be written as a sum from J equals 1 through N on sort of the distance. So this is the component wise distance between the vector X and the vector l. And again for the purpose of these slides I'm ignoring X0. So just ignoring the intercept term X0, which is always equal to 1. So, you know, this is how you compute the kernel with similarity between X and a landmark.

So let's see what this function does. Suppose X is close to one of the landmarks. Then this euclidean distance formula and the numerator will be close to 0, right. So, that is this term here, the distance was great, the distance using X and 0 will be close to zero, and so f1, this is a simple feature, will be approximately E to the minus 0 and then the numerator squared over 2 is equal to squared so that E to the 0, E to minus 0, E to 0 is going to be close to one. And I'll put the approximation symbol here because the distance may not be exactly 0, but if X is closer to landmark this term will be close to 0 and so f1 would be close 1.

이제 커널의 역할을 살펴보고 유사도 함수라 불리는 이유를 살펴봅니다. 그럼 첫 번째 랜드마크 l^(1)를 살펴보겠습니다. 여기 학습 예제 x와 랜드마크 l^(1) 사이의 커널 유사성을 나타낸 공식이 있습니다.

여기서 x0는 무시합니다. 이것이 학습 예제 x와 랜드마트 l사이의 거리 또는 유사성을 정의하는 커널을 계산하는 법입니다.

이 함수의 역할을 정리합니다. 학습 예제 x가 랜드마크 중 하나와 가깝다고 가정합니다. 그러면 유클리드 거리 공식의 분자는 0에 가까워집니다. x와 l 사이의 거리를 계산하는 식은 x - l입니다. 두 점이 가깝다는 의미는 0에 가깝다는 것입니다. 따라서, 어떤 수의 0승은 무조건 1이므로 f1은 1입니다.

거리가 정확히 0이 아닐 수 있으므로 기 때문에 근사 기호를 사용하였습니다. x가 랜드마크에 가까울수록 f1의 값은 1에 가까워 집니다.

Conversely, if X is far from 1 then this first feature f1 will be E to the minus of some large number squared, divided divided by two sigma squared and E to the minus of a large number is going to be close to 0. So what these features do is they measure how similar X is from one of your landmarks and the feature f is going to be close to one when X is close to your landmark and is going to be 0 or close to zero when X is far from your landmark.

반대로. x가 랜드마크 l에서 멀어진다면. 첫 번째 피처 f1은 0에 가깝습니다.

따라서, 커널은 x가 랜드마크와 얼마나 가까운 지 또는 얼마나 유사한 지를 측정합니다. 피처 f는 학습 예제 x가 랜드마크에 가까울 때는 1에 가까워지고, 학습 예제 x가 랜드마크에서 멀어질 때 0에 가까워집니다.

Each of these landmarks. On the previous line, I drew three landmarks, l1, l2, l3. Each of these landmarks, defines a new feature f1, f2 and f3. That is, given the the training example X, we can now compute three new features: f1, f2, and f3, given, you know, the three landmarks that I wrote just now. But first, let's look at this exponentiation function, let's look at this similarity function and plot in some figures and just, you know, understand better what this really looks like.

랜드마크 l1, l2, l3 가 있습니다. 각 랜드마크는 새로운 피처 f1, f2, f3를 정의합니다. 즉, 학습 예제 x가 있을 때 새로운 피처 f1, f2, f2를 계산할 수 있습니다. 하지만, 먼저 이 지수 함수의 모야을 보고 생김새를 정리합니다.

For this example, let's say I have two features X1 and X2. And let's say my first landmark, l1 is at a location, 3 5. So and let's say I set sigma squared equals one for now. If I plot what this feature looks like, what I get is this figure. So the vertical axis, the height of the surface is the value of f1 and down here on the horizontal axis are, if I have some training example, and there is x1 and there is x2.

Given a certain training example, the training example here which shows the value of x1 and x2 at a height above the surface, shows the corresponding value of f1 and down below this is the same figure I had showed, using a quantifiable plot, with x1 on horizontal axis, x2 on horizontal axis and so, this figure on the bottom is just a contour plot of the 3D surface.

여기 x1과 x2 2개의 피처가 있고, 첫 번째 랜드마크 l^(1) = [3;5]이고, σ^2 = 1입니다. 따라서, 피처 f1 은 그림과 같습니다. 수직축은 f1, 수평축은 x1과 x2로 이루어진 3차원 평면입니다. 여기에 학습 예제 x가 있습니다. 학습 예제 x의 x1과 x2의 값 그리고 f1의 값을 표시합니다. 맨 아래 그림은 3차원 그래프를 2차원으로 도식화한 것입니다.

You notice that when X is equal to 3 5 exactly, then we the f1 takes on the value 1, because that's at the maximum and X moves away as X goes further away then this feature takes on values that are close to 0. And so, this is really a feature, f1 measures, you know, how close X is to the first landmark and if varies between 0 and one depending on how close X is to the first landmark l1.

x = [3;5] 일 때 랜드마크 l^(1)와의 거리는 0이므로 f1의 값은 1입니다. x=[5;9] 일 때 랜드마트 l과의 거리가 멀어서 f1의 값은 0에 근사합니다. f1은 x가 랜드마크 i^(1)에 얼마나 가까운 지 또는 얼마나 유사한 지를 0과 1 사이의 값으로 나타냅니다.

Now the other was due on this slide is show the effects of varying this parameter sigma squared. So, sigma squared is the parameter of the Gaussian kernel and as you vary it, you get slightly different effects. Let's set sigma squared to be equal to 0.5 and see what we get. We set sigma square to 0.5, what you find is that the kernel looks similar, except for the width of the bump becomes narrower. The contours shrink a bit too. So if sigma squared equals to 0.5 then as you start from X equals 3 5 and as you move away, then the feature f1 falls to zero much more rapidly and conversely, if you has increase since where three in that case and as I

move away from, you know l. So this point here is really l, right, that's l1 is at location 3 5, right. So it's shown up here. And if sigma squared is large, then as you move away from l1, the value of the feature falls away much more slowly. So, given this definition of the features,

이제 파라마터 σ^2의 값을 바꾸면서 그래프의 모양을 살펴봅니다. σ^2는 가우시안 커널의 파라미터이고 값에 따라 다른 모양을 가집니다. σ^2를 0.5로 설정하면 폭이 좁아지고 등고선도 작아집니다. σ^2 = 0.5 일 때 x = [3;5]에서 시작하면 f1은 훨씬 더 빠르게 0으로 떨어집니다. 여기 꼭대기가 l1 = (3,5)입니다. σ^2=3 일 때 폭은 넓어지고 등고선은 커집니다. σ^2이 크다면 l^(1)에서 멀어질 때 훨씬 더 느리게 0으로 떨어집니다. 이것이 σ^2의 특성입니다.

Let's see what source of hypothesis we can learn. Given the training example X, we are going to compute these features f1, f2, f3 and a hypothesis is going to predict one when theta 0 plus theta 1 f1 plus theta 2 f2, and so on is greater than or equal to 0. For this particular example, let's say that I've already found a learning algorithm and let's say that, you know, somehow I ended up with these values of the parameter. So if theta 0 equals minus 0.5, theta 1 equals 1, theta 2 equals 1, and theta 3 equals 0 And what I want to do is consider what happens if we have a training example that takes has location at this magenta dot, right where I just drew this dot over here.

So let's say I have a training example X, what would my hypothesis predict? Well, If I look at this formula. Because my training example X is close to l1, we have that f1 is going to be close to 1 the because my training example X is far from l2 and l3 I have that, you know, f2 would be close to 0 and f3 will be close to 0. So, if I look at that formula, I have theta 0 plus theta 1 times 1 plus theta 2 times some value. Not exactly 0, but let's say close to 0. Then plus theta 3 times something close to 0. And this is going to be equal to plugging in these values now. So, that gives minus 0.5 plus 1 times 1 which is 1, and so on. Which is equal to 0.5 which is greater than or equal to 0. So, at this point, we're going to predict Y equals 1, because that's greater than or equal to zero.

이제 가설의 의미를 파악해 봅니다. 학습 예제 x가 주어지면 f1, f2, f3를 계산합니다. 가설이 θ0 + θ1f1 + θ2f2 + θ3f3 >= 0 일 때 y = 1을 예측할 것입니다. 특정 학습 알고리즘이 파라미터 θ의 값을 다음과 같이 계산합니다. θ0 = -0.5, θ1 = 1, θ2 = 1, θ3 = 0입니다. 왼쪽 그림의 분홍색 점은 훈련 예제입니다. 어떤 결과를 예측할지 보겠습니다.

학습 예제 x는 l^(1)과 가까워서 f1 은 1에 근사합니다. 학습 예제는 l^(2)와 l^(3)와 멀기 때문에 f2와 f3는 0에 근사합니다. 가설 θ0 + θ1f1 + θ2f2 + θ3f3에 θ 값을 대입하면,

따라서, y = 1이라고 예측합니다.

Now let's take a different point. Now lets' say I take a different point, I'm going to draw this one in a different color, in cyan say, for a point out there, if that were my training example X, then if you make a similar computation, you find that f1, f2, Ff3 are all going to be close to 0. And so, we have theta 0 plus theta 1, f1, plus so on and this will be about equal to minus 0.5, because theta 0 is minus 0.5 and f1, f2, f3 are all zero. So this will be minus 0.5, this is less than zero. And so, at this point out there, we're going to predict Y equals zero.

이제 다른 학습 예제를 봅시다. 이 점은 다른 색으로 표시합니다. 하늘색의 점을 x라고 가정합니다. 이 점은 f1, f2, f3가 모두 0에 근사합니다.

따라서, y = 0이라고 예측합니다.

And if you do this yourself for a range of different points, be sure to convince yourself that if you have a training example that's close to L2, say, then at this point we'll also predict Y equals one.

모든 학습 셋에 대해 같은 작업을 반복합니다. l^(2)에 가까운 학습 예제는 y = 1이라고 예측합니다. 직접 계산해 보시기 바랍니다.

And in fact, what you end up doing is, you know, if you look around this boundary, this space, what we'll find is that for points near l1 and l2 we end up predicting positive. And for points far away from l1 and l2, that's for points far away from these two landmarks, we end up predicting that the class is equal to 0. As so, what we end up doing, is that the decision boundary of this hypothesis would end up looking something like this where inside this red decision boundary would predict Y equals 1 and outside we predict Y equals 0. And so this is how with this definition of the landmarks and of the kernel function. We can learn pretty complex non-linear decision boundary, like what I just drew where we predict positive when we're close to either one of the two landmarks.

And we predict negative when we're very far away from any of the landmarks. And so this is part of the idea of kernels of and how we use them with the support vector machine, which is that we define these extra features using landmarks and similarity functions to learn more complex nonlinear classifiers.

그리고, 이 경계를 둘러보면, l^(1)과 l^(2) 근처의 점들은 파지티브 예제라고 예측하고, l^(1)과 l^(2)에서 멀리 떨어진 점들은 클래스가 0인 네거티브 예제라고 예측합니다. 이것이 가설의 결정 경계입니다. 빨간색 결정 경계의 내부는 1이고 외부는 0입니다. 이것이 랜드마크와 커널 함수의 정의를 사용하는 방법입니다. 두 랜드마크 중에 하나에 가까울 때 파지티브 값을 예측합니다. 따라서, 꽤 복잡한 비선형 경계를 배울 수 있습니다.

어떤 랜드마크에서 아주 멀리 떨어져 있을 때 네거티브 값을 예측합니다. 이것이 커널에 대한 아이디어의 일부이고 서포트 벡터 머신에서 커널을 사용하는 방법입니다. 더 복잡한 비선형 분류기를 학습하기 위해 랜드마크 및 유사도 함수를 사용하여 추가 피처를 정의합니다.

So hopefully that gives you a sense of the idea of kernels and how we could use it to define new features for the Support Vector Machine. But there are a couple of questions that we haven't answered yet. One is, how do we get these landmarks? How do we choose these landmarks? And another is, what other similarity functions, if any, can we use other than the one we talked about, which is called the Gaussian kernel. In the next video we give answers to these questions and put everything together to show how support vector machines with kernels can be a powerful way to learn complex nonlinear functions.

커널에 대한 개념과 서포트 벡터 머신의 새로운 피처를 정의하는 방법을 설명했습니다. 그러나, 아직 답변하지 않은 질문이 있습니다. 바로 랜드마크는 어떻게 얻을 수 있는지에 대한 것입니다. 또 다른 유사도 함수가 있다면 가우시안 커널 외에 다른 것을 사용할 수 있을 까요? 다음 강의에서 이런 질문에 대한 답을 제시할 것입니다. 모든 것을 결합하여 커널이 있는 서포트 벡터 머신이 복잡한 비선형 함수를 학습하는 강력한 방법을 설명할 것입니다.