brunch

You can make anything
by writing

C.S.Lewis

by 라인하트 Nov 11. 2020

앤드류 응의 머신러닝(12-1):서포트벡터머신최적화

온라인 강의 플랫폼 코세라의 창립자인 앤드류 응 (Andrew Ng) 교수는 인공지능 업계의 거장입니다. 그가 스탠퍼드 대학에서 머신 러닝 입문자에게 한 강의를 그대로 코세라 온라인 강의 (Coursera.org)에서 무료로 배울 수 있습니다. 이 강의는 머신러닝 입문자들의 필수코스입니다. 인공지능과 머신러닝을 혼자 공부하면서 자연스럽게 만나게 되는 강의입니다.

Support Vector Machines

서포트 벡터 머신

Large Margin Classification (큰 마진 분류)

Optimizaiton Objective (최적화 목표)

By now, you've seen a range of difference learning algorithms. With supervised learning, the performance of many supervised learning algorithms will be pretty similar, and what matters less often will be whether you use learning algorithm a or learning algorithm b, but what matters more will often be things like the amount of data you create these algorithms on, as well as your skill in applying these algorithms. Things like your choice of the features you design to give to the learning algorithms, and how you choose the Regulization parameter, and things like that.

But, there's one more algorithm that is very powerful and is very widely used both within industry and academia, and that's called the support vector machine. And compared to both logistic regression and neural networks, the Support Vector Machine, or SVM sometimes gives a cleaner, and sometimes more powerful way of learning complex non-linear functions. And so let's take the next videos to talk about that. Later in this course, I will do a quick survey of a range of different supervisory algorithms just as a very briefly describe them. But the support vector machine, given its popularity and how powerful it is, this will be the last of the supervisory algorithms that I'll spend a significant amount of time on in this course as with our development other learning algorithms, we're gonna start by talking about the optimization objective.

지금까지 다양한 학습 알고리즘을 설명했습니다. 많은 지도 학습 알고리즘의 성능은 비슷하기 때문에 어떤 학습 알고리즘을 사용할지 결정하는 것보다 데이터의 양과 학습 알고리즘을 적용하는 스킬이 중요합니다. 스킬은 학습 알고리즘에 적용할 피처를 설계하고 정규화 파라미터를 선택하는 것입니다.

머신 러닝 업계와 학계에 매우 널리 사용하는 알고리즘 중에 서포트 벡터 머신(SVM, Support Vector Machine) 알고리즘이 있습니다. 서포트 벡터 머신은 로지스틱 회귀 및 인공 신경망과 비교하여 복잡한 비선형 함수를 학습하는 더 간단하고 강력한 알고리즘입니다. 서포트 벡터 머신은 지금까지 배운 학습 알고리즘처럼 충분한 시간을 두고 설명할 가치가 있습니다. 이 과정의 뒷부분에서 나머지 지도 학습 알고리즘을 간단하게 정리할 것이고, 이번 강의에서 최적화 목표에 대해 다룹니다.

So, let's get started on this algorithm. In order to describe the support vector machine, I'm actually going to start with logistic regression, and show how we can modify it a bit, and get what is essentially the support vector machine. So in logistic regression, we have our familiar form of the hypothesis there and the sigmoid activation function shown on the right. And in order to explain some of the math, I'm going to use z to denote theta transpose axiom.

여기 로지스틱 회귀 알고리즘이 있습니다. 서포터 벡터 머신을 설명하기 위해 로지스틱 회귀를 어떻게 변형하여 사용하는 지를 설명합니다. 여기 로지스틱 회귀의 가설과 시그모이드 활성화 함수가 있습니다.

Now let's think about what we would like logistic regression to do. If we have an example with y equals one and by this I mean an example in either the training set or the test set or the cross-validation set, but when y is equal to one then we're sort of hoping that h of x will be close to one. Right, we're hoping to correctly classify that example. And what having x subscript 1, what that means is that theta transpose x must be must larger than 0. So there's greater than, greater than sign that means much, much greater than 0. And that's because it is z, the theta of transpose x is when z is much bigger than 0 is far to the right of here. That the outputs of logistic progression becomes close to one.

Conversely, if we have an example where y is equal to zero, then what we're hoping for is that the hypothesis will output a value close to zero. And that corresponds to theta transpose x of z being much less than zero because that corresponds to a hypothesis of putting a value close to zero.

로지스틱 회귀가 수행하는 작업을 정리합니다. 학습 셋, 교차 검증 셋 또는 테스트 셋에 속한 예제가 y=1일 때, 가설 hθ(x)는 1에 가까울수록 올바르게 분류합니다.

즉 θ^Tx는 0 보다 큰 값이어야 합니다. '>>'은 훨씬 더 크다는 의미입니다. 로지스틱 회귀의 출력 hθ(x) = g(z)는 1에 가깝습니다.

반대로 예제가 y=0일 때, 가설 hθ(x)는 0에 가까울수록 올바르게 분류합니다. θ^Tx는 0보다 훨씬 작어야 합니다. 로지스틱 회귀 출력 hθ(x) = g(z)는 0에 가깝습니다.

If you look at the cost function of logistic regression, what you'll find is that each example (x, y) contributes a term like this to the overall cost function, right? So for the overall cost function, we will also have a sum over all the chain examples and the 1 over m term, that this expression here, that's the term that a single training example contributes to the overall objective function so we can just rush them. Now if I take the definition for the fall of my hypothesis and plug it in over here, then what I get is that each training example contributes this term, ignoring the one over M but it contributes that term to my overall cost function for logistic regression.

학습 데이터 셋의 각 예제 (x, y)에 대한 로지스틱 회귀의 비용 함수는 다음과 같습니다.

이 것은 단 하나의 학습 예제에 대한 비용입니다. 전체 비용 함수는 모든 예제에 대한 비용 함수를 합한 후 1/m로 나눈 평균입니다. 여기서 전체 비용 함수를 합산하지 않는 대신에 1/m도 무시합니다.

Now let's consider two cases of when y is equal to one and when y is equal to zero. In the first case, let's suppose that y is equal to 1. In that case, only this first term in the objective matters, because this one minus y term would be equal to zero if y is equal to one. So when y is equal to one, when in our example x comma y, when y is equal to 1 what we get is this term. Minus log one over one, plus E to the negative Z where as similar to the last line I'm using Z to denote data transposed X and of course in a cost I should have this minus line that we just had if Y is equal to one so that's equal to one I just simplify in a way in the expression that I have written down here. And if we plot this function as a function of z, what you find is that you get this curve shown on the lower left of the slide.

And thus, we also see that when z is equal to large, that is, when theta transpose x is large, that corresponds to a value of z that gives us a fairly small value, a very, very small contribution to the cost Function. And this kinda explains why, when logistic regression sees a positive example, with y=1, it tries to set theta transport x to be very large because that corresponds to this term, in the cost function, being small.

y=1일 때와 y=0일 때 두 가지 경우를 생각해 봅니다. y=1 일 때 로지스틱 회귀의 비용 함수는 첫째 항만 의미가 있습니다. 두 번째 항은 (1-y)때문에 0입니다.

여기 z에 관한 함수를 도식화하면 왼쪽 하단의 그래프와 같습니다. 따라서, z가 클 때 즉 θ^Tx가 클 때 z에 대응하는 비용 함수는 매우 작은 값입니다. 이것이 로지스틱 회귀가 y=1 인 Positive 예제일 때 θ^Tx의 값을 매우 크게 설정하려고 시도하는 이유입니다. 왜냐하면 z에 관한 g(z) = - log (1/(1+e^(-z))) 비용 함수의 값이 작기 때문입니다.

Now, to fill the support vec machine, here's what we're going to do. We're gonna take this cost function, this minus log 1 over 1 plus e to negative z, and modify it a little bit. Let me take this point 1 over here, and let me draw the cross functions you're going to use. The new pass functions can be flat from here on out, and then we draw something that grows as a straight line, similar to logistic regression. But this is going to be a straight line at this portion. So the curve that I just drew in magenta, and the curve I just drew purple and magenta, so if it's pretty close approximation to the cross function used by logistic regression. Except it is now made up of two line segments, there's this flat portion on the right, and then there's this straight line portion on the left. And don't worry too much about the slope of the straight line portion. It doesn't matter that much. But that's the new cost function we're going to use for when y is equal to one, and you can imagine it should do something pretty similar to logistic regression. But turns out, that this will give the support vector machine computational advantages and give us, later on, an easier optimization problem that would be easier for software to solve. We just talked about the case of y equals one.

이제 서포트 벡터 머신을 이해해봅시다. 비용 함수 - log (1/(1+e^(-z)))를 -z 부분을 약간 수정합니다. z = 1일 때를 기준으로 새로운 비용 함수를 분홍 색선으로 그립니다. 새로운 비용 함수는 z = 1 일 때부터 오른쪽은 평평하고, 왼쪾은 로지스틱 회귀와 유사하게 직선입니다. 분홍석 선은 로지스틱 회귀에서 사용하는 비용 함수와 매우 가까운 근사치입니다. 차이점은 단지 두 개의 선분으로 나뉜다는 것입니다. 그리고, 직선 부분의 기울기는 고려하지 않습니다. 이것이 y=1 일 때 사용할 새로운 비용 함수이고, 로지스틱 회귀와 매우 유사한 작업을 합니다. 이것이 서포트 벡터 머신이 계산을 편하게 할 수 있도록 하고 소프트웨어가 최적화 문제를 더 쉽게 해결할 수 있게 합니다. 지금까지 y = 1인 경우를 다루었습니다.

The other case is if y is equal to zero. In that case, if you look at the cost, then only the second term will apply because the first term goes away, right? If y is equal to zero, then you have a zero here, so you're left only with the second term of the expression above. And so the cost of an example, or the contribution of the cost function, is going to be given by this term over here. And if you plot that as a function of z, to have pure z on the horizontal axis, you end up with this one. And for the support vector machine, once again, we're going to replace this blue line with something similar and at the same time we replace it with a new cost, this flat out here, this 0 out here. And that then grows as a straight line, like so.

이제 y=0인 경우를 봅시다. y=0일 때 비용 함수 J(θ)의 첫 번째 항은 0이 되고 두 번째 항만 남습니다.

y=0일 때 z의 함수를 그립니다. 서포트 벡터 머신은 z=-1까지는 수평축과 동일하게 그리고, 이후는 한번 파란색 선과 비슷한 분홍색 선입니다.

So let me give these two functions names. This function on the left I'm going to call cost subscript 1 of z, and this function of the right I'm gonna call cost subscript 0 of z. And the subscript just refers to the cost corresponding to when y is equal to 1, versus when y Is equal to zero. Armed with these definitions, we're now ready to build a support vector machine.

두 함수의 이름을 정합니다. 왼쪽 함수는 y=1일 때의 비용으로 Cost1(z)라 부르고, 오른쪽 함수는 y=0일 때의 비용으로 Cost0(z)라 부릅니다. 이제 서포처 벡터 머신을 구축할 준비가 되었습니다.

Here's the cost function, j of theta, that we have for logistic regression. In case this equation looks a bit unfamiliar, it's because previously we had a minus sign outside, but here what I did was I instead moved the minus signs inside these expressions, so it just makes it look a little different. For the support vector machine what we're going to do is essentially take this and replace this with cost 1 of z, that is cost 1 of theta transpose x. And we're going to take this and replace it with cost0 of z, that is cost0 of theta transpose x. Where the cost one function is what we had on the previous slide that looks like this. And the cost zero function, again what we had on the previous slide, and it looks like this. So what we have for the support vector machine is a minimization problem of one over M, the sum of Y I times cost one, theta transpose X I, plus one minus Y I times cause zero of theta transpose X I, and then plus my usual regularization parameter.

여기 로지스틱 회귀에 대한 비용 함수 J(θ)가 있습니다. 비용 함수의 - 부호가 밖에 있었기 때문에 약간 낯설게 보일 수 있습니다. 대신 표현식 안에 - 기호를 옮겨서 조금 다르게 보이게 할 것입니다. 서포트 벡터 머신에서 첫째 항의 - loghθ(x^(i)) = Cost1(θ^Tx^(i))으로 표현하고, -log(1-hθ(x^(i)) = Cost0(θ^Tx^(i))으로 표현합니다. Cost1()과 Cost0() 함수는 전 슬라이드에서 배웠습니다. 이것을 서포트 벡터 머신으로 표현하면 다음과 같습니다.

Like so. Now, by convention, for the support vector machine, we're actually write things slightly different. We re-parameterize this just very slightly differently. First, we're going to get rid of the 1 over m terms, and this just this happens to be a slightly different convention that people use for support vector machines compared to or just a progression. But here's what I mean. You're one way to do this, we're just gonna get rid of these one over m terms and this should give you me the same optimal value of data right? Because one over m is just as constant so whether I solved this minimization problem with one over m in front or not. I should end up with the same optimal value for theta. Here's what I mean, to give you an example, suppose I had a minimization problem. Minimize over a long number U of U minus five squared plus one. Well, the minimum of this happens to be U equals five. Now if I were to take this objective function and multiply it by 10. So here my minimization problem is min over U, 10 U minus five squared plus 10. Well the value of U that minimizes this is still U equals five right? So multiply something that you're minimizing over, by some constant, 10 in this case, it does not change the value of U that gives us, that minimizes this function. So the same way, what I've done is by crossing out the M is all I'm doing is multiplying my objective function by some constant M and it doesn't change the value of theta. That achieves the minimum.

관습적으로 서포터 벡터 머신은 실제로 약간 다르게 작성합니다. 첫째, 1/m 항을 제거합니다. 서포트 벡터 머신을 사용할 때는 약간 다른 관행입니다. 또 m도 제거합니다. 최적의 데이터 값을 얻기 위해서입니다. 1/m은 상수이기 때문에 최소화 문제를 1/m을 곱하던 나누던 말던 θ의 최소값은 동일합니다. 예를 들어 (u-5)^2+1을 최소화하는 값 u = 5입니다. 그리고 ((u-5)^2+1) * 10을 해도 최소화하는 값은 u = 5로 동일합니다. 즉, 어떤 상수를 곱하거나 더해도 u의 값은 변하지 않습니다. 그래서 같은 방식으로 m이 있으나 없으나 θ을 최소화하는 값은 동일하기 때문에 1/m과 m을 제거합니다.

The second bit of notational change, which is just, again, the more standard convention when using SVMs instead of logistic regression, is the following. So for logistic regression, we add two terms to the objective function. The first is this term, which is the cost that comes from the training set and the second is this row, which is the regularization term. And what we had was we had a, we control the trade-off between these by saying, what we want is A plus, and then my regularization parameter lambda. And then times some other term B, where I guess I'm using your A to denote this first term, and I'm using B to denote the second term, maybe without the lambda. And instead of prioritizing this as A plus lambda B, and so what we did was by setting different values for this regularization parameter lambda, we could trade off the relative weight between how much we wanted the training set well, that is, minimizing A, versus how much we care about keeping the values of the parameter small, so that will be, the parameter is B for the support vector machine, just by convention, we're going to use a different parameter. So instead of using lambda here to control the relative waight between the first and second terms. We're instead going to use a different parameter which by convention is called C and is set to minimize C times a + B.

로지스틱 회귀 대신에 SVM을 사용할 때 표준 표기법이 있습니다. 두 번째는 다음과 같습니다. 로지스틱 회귀의 목적 함수를 두 개의 항으로 분리할 수 있습니다. 첫 번째 항은 비용 함수이고 두 번째 항은 정규화 항입니다. 결국 두 항은 서로 트레이드오프의 관계입니다. 비용 함수 항을 A라 하고 정규화 항을 λ없이 B라 할 수 있습니다. A + λB를 우선순위를 두는 대신에 A와 λB의 우선순위를 두신 대신에 정규화 파라미터 λ에 대해 다른 값을 설정합니다. 즉, 학습 셋에 가중치를 두어 A를 최소화하는 것과 파라미터의 값을 작게 유지하는 것 사이의 트레이드오프입니다. 즉, 서포트 벡터 머신의 파라미터는 B입니다. 관례상 다른 파라미터를 사용합니다. 따라서 λ를 사용하여 첫 번째 항과 두 번째 항 사이의 상대적인 가중치를 조절하는 것 대신에 C라는 다른 매개변수를 사용합니다. A + λB는 CA + B로 표기하고 C는 1/λ입니다.

So for logistic regression, if we set a very large value of lambda, that means you will give B a very high weight. Here is that if we set C to be a very small value, then that responds to giving B a much larger rate than C, than A. So this is just a different way of controlling the trade off, it's just a different way of prioritizing how much we care about optimizing the first term, versus how much we care about optimizing the second term. And if you want you can think of this as the parameter C playing a role similar to 1 over lambda. And it's not that it's two equations or these two expressions will be equal. This equals 1 over lambda, that's not the case. It's rather that if C is equal to 1 over lambda, then these two optimization objectives should give you the same value the same optimal value for theta so we just filling that in I'm gonna cross out lambda here and write in the constant C there. So that gives us our overall optimization objective function for the support vector machine. And if you minimize that function, then what you have is the parameters learned by the SVM.

로지스틱 회귀에서 λ의 값을 매우 크게 설정하면 B는 매우 큰 가중치를 가질 것입니다. 여기에서 C를 매우 작은 값으로 설정하면 B가 A 나 C보다 훨씬 더 큰 비율을 제공할 것입니다. 이것이 트레이드오프를 제어하는 또 다른 방법입니다. 첫 번째 항을 최적화하기 위해 얼마나 신경을 쓰는지 대두 번째 항을 최적화하기 위해 얼마나 신경을 쓰는지에 대한 트레이드오프입니다. 파라미터 C는 1/λ와 비슷한 역할을 합니다. 이 것이 같은 의미라는 것은 아닙니다. 파라미터 C는 1/λ 와 같지만 같지 않습니다. 오히려 C = 1/λ 와 같으면, 두 최적화 목표는 동일한 값을 θ에 대해 동일한 최적 값을 제공합니다. 이제 λ를 지우고 상수 C를 작성합니다. 따라서 서포트 벡터 머신에 대한 전반적인 최적화 목적 함수를 제공합니다. 이 함수를 최소화하면 SVM이 학습한 파라미터입니다.

Finally unlike logistic regression, the support vector machine doesn't output the probability is that what we have is we have this cost function, that we minimize to get the parameter's theta, and what a support vector machine does is it just makes a prediction of y being equal to one or zero, directly. So the hypothesis will predict one if theta transpose x is greater or equal to zero, and it will predict zero otherwise and so having learned the parameters theta, this is the form of the hypothesis for the support vector machine. So that was a mathematical definition of what a support vector machine does.

마지막으로 로지스틱 회귀와 달리 서포트 벡터 머신은 확률을 출력하지 않습니다. 여기 서포트 벡터 머신의 비용 함수가 있습니다.

비용을 최소화하는 파라미터 θ를 얻습니다. 서포트 벡터 머신은 y=1 또는 y=0의 값을 예측합니다. 가설 hθ(x)가 θ^Tx >= 0 이면 1을 예측하고, 그렇지 않으면 0을 예측합니다.

In the next few videos, let's try to get back to intuition about what this optimization objective leads to and whether the source of the hypotheses SVM will learn and we'll also talk about how to modify this just a little bit to the complex nonlinear functions.

다음 몇 개의 강의에서 최적화 목표는 무엇이고 가설 SVM에 대한 감각을 익힐 것입니다. 복잡한 비선형 함수를 약간 수정하는 방법도 설명합니다.

앤드류 응의 머신러닝 동영상 강의

정리하며

지금까지 많은 지도 학습 알고리즘을 배웠습니다. 학습 알고리즘의 성능은 매우 유사하기 때문에 어떤 학습 알고리즘을 사용할지가 중요한 것이 아니라 알고리즘이 학습할 데이터의 양과 학습 알고리즘을 적용하는 기술이 중요합니다. 기술은 학습 알고리즘에 적용할 Feauture를 설계하고 정규화 파라미터를 선택하는 것을 의미합니다.

하지만, 학습 알고리즘이 중요하지 않아도 한 가지 더 깊게 공부해야 할 알고리즘이 남아 있습니다. 그것은 서포트 벡터 머신(SVM, Support Vector Machine)입니다. 서포트 벡터 머신은 로지스틱 회귀 함수에서 시작합니다.