brunch

You can make anything
by writing

C.S.Lewis

by 라인하트 Nov 12. 2020

앤드류 응의 머신러닝(12-2):SVM 큰 마진 분류기

온라인 강의 플랫폼 코세라의 창립자인 앤드류 응 (Andrew Ng) 교수는 인공지능 업계의 거장입니다. 그가 스탠퍼드 대학에서 머신 러닝 입문자에게 한 강의를 그대로 코세라 온라인 강의 (Coursera.org)에서 무료로 배울 수 있습니다. 이 강의는 머신러닝 입문자들의 필수코스입니다. 인공지능과 머신러닝을 혼자 공부하면서 자연스럽게 만나게 되는 강의입니다.

Support Vector Machines

서포트 벡터 머신

Large Margin Classification (큰 마진 분류)

Large Margin Intuition (큰 마진 분류기의 이해)

Sometimes people talk about support vector machines, as large margin classifiers, in this video I'd like to tell you what that means, and this will also give us a useful picture of what an SVM hypothesis may look like.

서포트 벡터 머신은 큰 마진 분류기입니다. 이번 강의는 SVM 가설이 무엇인지와 SVM 가설의 모양을 설명할 것입니다.

Here's my cost function for the support vector machine where here on the left I've plotted my cost 1 of z function that I used for positive examples and on the right I've plotted my zero of 'Z' function, where I have 'Z' here on the horizontal axis. Now, let's think about what it takes to make these cost functions small. If you have a positive example, so if y is equal to 1, then cost 1 of Z is zero only when Z is greater than or equal to 1. So in other words, if you have a positive example, we really want theta transpose x to be greater than or equal to 1 and conversely. If y is equal to zero, look this cost zero of z function, then it's only in this region where z is less than equal to 1 we have the cost is zero as z is equals to zero,

여기 서포트 벡터 머신의 비용 함수가 있습니다.

왼쪽은 Cost1(z) 함수를 도식화하고, 오른쪽은 Cost0(z) 함수를 도식화하였습니다. 수평축은 z입니다. 비용 함수를 최소화하기 위해 필요한 것이 무엇일까요? 파지티브 예제가 y = 1 이면 Cost1(z)는 z가 1보다 크거나 같을 때만 0입니다. 즉, y = 1 이면 θ^Tx >= 1이어야 합니다. 네거티브 예제가 y = 0 이면 Cost0(z)는 -1 보다 작거나 같을 때만 0입니다.

And this is an interesting property of the support vector machine right, which is that, if you have a positive example so if y is equal to one, then all we really need is that theta transpose x is greater than equal to zero. and that would mean that we classify correctly because if theta transpose x is greater than zero. our hypothesis will predict zero. And similarly, if you have a negative example, then really all you want is that theta transpose x is less than zero and that will make sure we got the example right. But the support vector machine wants a bit more than that. It says, you know, don't just barely get the example right. So then don't just have it just a little bit bigger than zero. What i really want is for this to be quite a lot bigger than zero say maybe bit greater or equal to one and I want this to be much less than zero. Maybe I want it less than or equal to -1.

And so this builds in an extra safety factor or safety margin factor into the support vector machine. Logistic regression does something similar too of course, but let's see what happens or let's see what the consequences of this are, in the context of the support vector machine.

즉, 로지스틱 회귀에서 파지티브 예제가 y =1 이면 θ^Tx는 0 보다 큽니다. θ^Tx >= 0 보다 크면 가설은 1을 예측하고 올바르게 분류한다는 것을 의미합니다. 마찬가지로 네거티브 예제가 y=있다면, θ^Tx < 0 보다 작을 때 올바르게 분류합니다. 그러나 서포트 벡터 머신은 애매하게 분류하지 않습니다. θ^Tx 가 0보다 조금 더 큰 것이 아니라 0 보다 훨씬 더 크거나 1 보다 조금 더 크거나 같습니다. θ^Tx 가 0보다 조금 작이 아니라 -1보다 훨씬 작거나 같습니다.

따라서, 서포트 벡터 머신은 추가적인 안전 계수(Safety Factor)나 안전 마진 계수(Safety margin Factor)를 구축합니다. 물론, 로지스틱 회귀도 비슷하지만 서포트 벡터 머신은 좀 더 다른 의미를 가집니다.

Concretely, what I'd like to do next is consider a case case where we set this constant C to be a very large value, so let's imagine we set C to a very large value, may be a hundred thousand, some huge number. Let's see what the support vector machine will do. If C is very, very large, then when minimizing this optimization objective, we're going to be highly motivated to choose a value, so that this first term is equal to zero.

So let's try to understand the optimization problem in the context of, what would it take to make this first term in the objective equal to zero, because you know, maybe we'll set C to some huge constant, and this will hope, this should give us additional intuition about what sort of hypotheses a support vector machine learns. So we saw already that whenever you have a training example with a label of y=1 if you want to make that first term zero, what you need is is to find a value of theta so that theta transpose x i is greater than or equal to 1. And similarly, whenever we have an example, with label zero, in order to make sure that the cost, cost zero of Z, in order to make sure that cost is zero we need that theta transpose x i is less than or equal to -1.

상수 C가 매우 큰 값 일 때를 생각해 봅시다. 예를 들면, 상수 C의 값은 100,000과 같은 엄청 큰 값입니다. 서포트 벡터 머신은 C가 매우 크면 최적화 목표를 최소화하기 위해 첫 번째 항을 0으로 설정합니다. 두 번째 항은 작은 값의 정규화 항만 남습니다. 첫 번째 항을 0으로 만들기 위해서는 y=1인 파지티브 예제는 z = θ^Tx^(i) >= 1 커야 하고, y=0인 네거티브 예제는 z = θ^Tx <= -1이어야 합니다. 이해가 되지 않을 경우 그래프를 다시 한번 살펴보시기 바랍니다.

So, if we think of our optimization problem as now, really choosing parameters and show that this first term is equal to zero, what we're left with is the following optimization problem. We're going to minimize that first term zero, so C times zero, because we're going to choose parameters so that's equal to zero, plus one half and then you know that second term and this first term is 'C' times zero, so let's just cross that out because I know that's going to be zero. And this will be subject to the constraint that theta transpose x(i) is greater than or equal to one, if y(i) Is equal to one and theta transpose x(i) is less than or equal to minus one whenever you have a negative example and it turns out that when you solve this optimization problem, when you minimize this as a function of the parameters theta you get a very interesting decision boundary.

따라서, 최적화 문제에서 첫 번째 항이 0과 같을 때 남는 항은 두 번째 항인 정규화 항입니다. 따라서, y^(i) = 1이고 θ^Tx^(i) >= 1이고, y^(i) = 0이고 θ^Tx^(i) <= -1입니다. 최적화 문제를 풀 때 파라미터 θ의 함수로 이것을 최소화하면 매우 흥미로운 결정 경계를 얻습니다.

Concretely, if you look at a data set like this with positive and negative examples, this data is linearly separable and by that, I mean that there exists, you know, a straight line, altough there is many a different straight lines, they can separate the positive and negative examples perfectly. For example, here is one decision boundary that separates the positive and negative examples, but somehow that doesn't look like a very natural one, right? Or by drawing an even worse one, you know here's another decision boundary that separates the positive and negative examples but just barely. But neither of those seem like particularly good choices.

구체적으로 그림과 같이 파지티브 예제와 네거티브 예제가 있습니다. 이들 데이터 셋은 선형으로 분리할 수 있습니다. 파지티브 예제와 네거티브 예제를 완벽하게 분리 수 있는 직선을 그릴 수 있습니다. 직선의 결정 경계를 얻을 수 있습니다. 분홍색 선은 매우 부자연스럽습니다. 녹색 선은 간신히 구분하는 듯합니다. 이런 선들 중에 어느 것도 좋은 선택처럼 보이지 않습니다.

The Support Vector Machines will instead choose this decision boundary, which I'm drawing in black. And that seems like a much better decision boundary than either of the ones that I drew in magenta or in green. The black line seems like a more robust separator, it does a better job of separating the positive and negative examples. And mathematically, what that does is, this black decision boundary has a larger distance. That distance is called the margin, when I draw up this two extra blue lines, we see that the black decision boundary has some larger minimum distance from any of my training examples, whereas the magenta and the green lines they come awfully close to the training examples. and then that seems to do a less a good job separating the positive and negative classes than my black line.

서포트 벡터 머신은 다른 선들 대신에 검은색 선을 결정 경계로 선택합니다. 검은색 선은 다른 선들보다 훨씬 나은 결정 경계처럼 보입니다. 검은색 선은 훨씬 더 강력하고 파지티브 예제와 네거티브 예제를 더 잘 구분합니다. 수학적으로 검은색 결정 경계는 폭이 훨씬 더 큽니다. 이것을 마진이라고 합니다. 검은색 결정 경계 주위에 그린 두 개의 파란색 선을 그립니다. 검은색 선은 학습 예제들로부터 멀리 떨어져 있지만 반면에 분홍색 선과 녹색 선은 학습 예제에 매우 가깝습니다. 따라서, 파지티브 예제와 네거티브 예제를 분리하는 데에 검은색 선이 더 나아 보입니다.

And so this distance is called the margin of the support vector machine and this gives the SVM a certain robustness, because it tries to separate the data with as a large a margin as possible. So the support vector machine is sometimes also called a large margin classifier and this is actually a consequence of the optimization problem we wrote down on the previous slide.

따라서 검은색 선을 중심으로 한 폭을 서포트 벡터 머신의 마진이라고 합니다. SVM은 가능한 한 큰 마진으로 데이터를 분리합니다. 따라서, 서포트 벡터 머신은 큰 마진 분류기라고 합니다. 이것이 서포트 벡터 머신의 최적화 결과입니다.

I know that you might be wondering how is it that the optimization problem I wrote down in the previous slide, how does that lead to this large margin classifier. I know I haven't explained that yet. And in the next video I'm going to sketch a little bit of the intuition about why that optimization problem gives us this large margin classifier. But this is a useful feature to keep in mind if you are trying to understand what are the sorts of hypothesis that an SVM will choose. That is, trying to separate the positive and negative examples with as big a margin as possible.

최적화 문제가 어떻게 큰 마진 분류기로 이어지는지는 아직 설명하지 않았습니다. 다음 강의에서 최적화 문제가 왜 큰 마진 분류기를 제공하는지 이유를 수학적으로 설명할 것입니다. SVM은 가능한 한 큰 마진으로 파지티브 예제와 네거티브 예제를 분리하기 때문에 매우 유용합니다.

I want to say one last thing about large margin classifiers in this intuition, so we wrote out this large margin classification setting in the case of when C, that regularization concept, was very large, I think I set that to a hundred thousand or something. So given a dataset like this, maybe we'll choose that decision boundary that separate the positive and negative examples on large margin.

Now, the SVM is actually sligthly more sophisticated than this large margin view might suggest. And in particular, if all you're doing is use a large margin classifier then your learning algorithms can be sensitive to outliers, so lets just add an extra positive example like that shown on the screen. If he had one example then it seems as if to separate data with a large margin, maybe I'll end up learning a decision boundary like that, right? that is the magenta line and it's really not clear that based on the single outlier based on a single example and it's really not clear that it's actually a good idea to change my decision boundary from the black one over to the magenta one.

So, if C, if the regularization parameter C were very large, then this is actually what SVM will do, it will change the decision boundary from the black to the magenta one but if C were reasonably small if you were to use the C, not too large then you still end up with this black decision boundary.

마지막으로 큰 마진 분류기의 특징을 하나 더 설명합니다. SVM은 정규화 개념인 C가 매우 큰 값을 가질 때 데이터 셋에서 파지티브 예제와 네거티브 예제를 큰 마진으로 분리하는 결정 경계를 작성합니다.

하지만, SVM은 매우 정교하고 복잡합니다. SVM 학습 알고리즘은 평균값에서 크게 벗어난 이상 값인 아웃라이어에 매우 민감합니다. 그림과 같이 파지티브 예제가 추가될 경우 SVM 결정 경계는 검은색에서 분홍색으로 바뀝니다. SVM은 큰 마진으로 데이터를 분리하는 것처럼 보이지만 분홍색 선과 같은 결정 경계를 학습할 것입니다. 단 하나의 아웃라이어 예제로 인해 생성된 결정 경계가 좋은 아이디어 인지 분명하지 않습니다.

그래서, 정규화 파라미터 C가 매우 크다면 SVM은 결정 경계를 검은색에서 분홍색으로 바꿉니다. 그러나 정규화 파라미터 C가 합리적으로 너무 크지 않고 작다면 SVM은 결정 경계를 여전히 검은색으로 유지합니다.

And of course if the data were not linearly separable so if you had some positive examples in here, or if you had some negative examples in here then the SVM will also do the right thing. And so this picture of a large margin classifier that's really, that's really the picture that gives better intuition only for the case of when the regulations parameter C is very large, and just to remind you this corresponds C plays a role similar to one over Lambda, where Lambda is the regularization parameter we had previously. And so it's only of one over Lambda is very large or equivalently if Lambda is very small that you end up with things like this Magenta decision boundary, but in practice when applying support vector machines, when C is not very very large like that, it can do a better job ignoring the few outliers like here. And also do fine and do reasonable things even if your data is not linearly separable.

물론 데이터가 선형적으로 분리되지 않았지만 네거티브 예제 속에 파지티브 예제가 있거나 네거티브 예제 속에 네거티브 예제가 있더라도 SVM은 제대로 동작할 것입니다. 큰 마진 분류기 SVM은 파라미터 C가 매우 큰 경우에 훨씬 더 잘 동작합니다. 파라미터 C는 정규화 파라미터 λ와 비슷한 역할을 합니다. λ가 매우 크거나 또는 너무 작으면 분홍색 선의 결정 경계가 만들어집니다. 서포터 벡터 머신은 C가 그렇게 크지 않고 몇 가지 아웃라이어 데이터를 무시할 때 더 나은 결과를 만듭니다. 데이터가 선형적으로 분리되지 않더라도 합리적으로 잘 동작합니다.

But when we talk about bias and variance in the context of support vector machines which will do a little bit later, hopefully all of of this trade-offs involving the regularization parameter will become clearer at that time. So I hope that gives some intuition about how this support vector machine functions as a large margin classifier that tries to separate the data with a large margin, technically this picture of this view is true only when the parameter C is very large, which is a useful way to think about support vector machines.

SVM 관점에서 편향과 분산을 이야기할 때 정규화 파라미터와 관련된 모든 트레이드오프가 더 명확합니다. 서포터 벡터 머신은 파라미터 C가 매우 클 때 큰 마진으로 데이터를 분리합니다.

There was one missing step in this video which is, why is it that the optimization problem we wrote down on these slides, how does that actually lead to the large margin classifier, I didn't do that in this video, in the next video I will sketch a little bit more of the math behind that to explain that separate reasoning of how the optimization problem we wrote out results in a large margin classifier.

이 강의에서 한 가지 설명하지 않은 단계가 있습니다. 여기에 최적화 문제를 적어놓은 이유는 무엇일까요? 실제로 큰 마진 분류기는 무엇입니까? 이번 강의와 다음 강의에서 설명하지 않습니다. 여기 최적화 문제에 대한 별도의 추론으로 큰 마진 분류기가 생성된다는 것을 설명하기 위해 다음 강의에서 수학적인 부분을 설명할 것입니다.

앤드류 응의 머신러닝 동영상 강의

정리하며

서포트 벡터 머신은 Positive 예제 y = 1 이면 Cost1(z)는 z가 1보다 크거나 같을 때만 0입니다. y = 0 이면 Cost0(z)는 -1 보다 작거나 같을 때만 0입니다. z = θ^Tx 같습니다.

서포트 벡터 머신을 이해하기 위해 상수 C는 100,000과 같은 매우 큰 값으로 설정합니다. 서포트 벡터 머신의 비용 함수는 아래와 같이 단순화됩니다.

1/2 Σθ^2j 따라서, y^(i) = 1이고 θ^Tx^(i) >= 1

y^(i) = 0이고 θ^Tx^(i) <= -1

서포트 벡터 머신의 비용 함수로 이진 분류를 하면 아래와 같이 폭이 넓은 검은색 도로를 얻을 수 있습니다. 이것을 서포트 벡터 머신의 마진이라고 합니다. SVM은 가능한 한 큰 마진으로 데이터를 분리하려고 시도하기 때문에 확실한 견고성을 제공합니다. 따라서, 서포트 벡터 머신은 큰 마진 분류기라고 합니다.

물론 데이터가 선형적으로 분리되지 않았지만 Negative 예제 속에 Positive 예제가 있거나 Positive 예제 속에 Negative 예제가 있더라도 SVM은 올바르게 동작할 것입니다. 서포터 벡터 머신은 C가 그렇게 크지 않은 경우 몇 가지 아웃라이어 데이터를 무시하면 더 나은 결과를 만들 수 있습니다. 데이터가 선형적으로 분리되지 않더라도 합리적으로 잘 동작합니다.