brunch

You can make anything
by writing

C.S.Lewis

by 라인하트 Nov 20. 2020

앤드류 응의 머신러닝(12-5):SVM 커널 II

온라인 강의 플랫폼 코세라의 창립자인 앤드류 응 (Andrew Ng) 교수는 인공지능 업계의 거장입니다. 그가 스탠퍼드 대학에서 머신 러닝 입문자에게 한 강의를 그대로 코세라 온라인 강의 (Coursera.org)에서 무료로 배울 수 있습니다. 이 강의는 머신러닝 입문자들의 필수코스입니다. 인공지능과 머신러닝을 혼자 공부하면서 자연스럽게 만나게 되는 강의입니다.

Support Vector Machines

서포트 벡터 머신

Kernels (커넬)

Kernels II (커널 II)

In the last video, we started to talk about the kernels idea and how it can be used to define new features for the support vector machine. In this video, I'd like to throw in some of the missing details and, also, say a few words about how to use these ideas in practice. Such as, how they pertain to, for example, the bias variance trade-off in support vector machines.

지난 강의에서 커널과 서포트 벡터 머신을 위한 새로운 피처를 정의하는 방법을 설명했습니다. 이번 강의는 설명하지 않은 나머지 것들을 설명하고 실제로 사용하는 방법에 대해 설명할 것입니다. 예를 들어, 서포트 벡터 머신에서 편향과 분산 사이의 트레이드오프를 적용하는 법을 다룹니다.

In the last video, I talked about the process of picking a few landmarks. You know, l1, l2, l3 and that allowed us to define the similarity function also called the kernel or in this example if you have this similarity function this is a Gaussian kernel. And that allowed us to build this form of a hypothesis function.

지난 강의에 몇 가지 랜드마크를 선택하는 과정에 대해 설명했습니다. l^(1), l^(2), l^(3)를 통해 커널이라 불리는 유사도 함수를 정의했습니다. 유사도 함수인 가우스 커널로 구현한 가설을 다음과 같습니다.

But where do we get these landmarks from? Where do we get l1, l2, l3 from? And it seems, also, that for complex learning problems, maybe we want a lot more landmarks than just three of them that we might choose by hand. So in practice this is how the landmarks are chosen which is that given the machine learning problem. We have some data set of some some positive and negative examples. So, this is the idea here which is that we're gonna take the examples and for every training example that we have, we are just going to call it. We're just going to put landmarks as exactly the same locations as the training examples. So if I have one training example if that is x1, well then I'm going to choose this is my first landmark to be at xactly the same location as my first training example. And if I have a different training example x2. Well we're going to set the second landmark to be the location of my second training example. On the figure on the right, I used red and blue dots just as illustration, the color of this figure, the color of the dots on the figure on the right is not significant. But what I'm going to end up with using this method is I'm going to end up with m landmarks of l1, l2 down to l(m) if I have m training examples with one landmark per location of my per location of each of my training examples. And this is nice because it is saying that my features are basically going to measure how close an example is to one of the things I saw in my training set.

그렇다면 랜드마크는 어떻게 결정할까요? 복잡한 학습 문제는 3 개 이상의 랜드마크가 필요합니다. 실제로 머신러닝 문제에 따라 랜드마크를 선택하는 방법을 설명합니다. 여기 파지티브 예제 셋과 네거티브 예제 셋이 있습니다. 여기 있는 모든 학습 예제와 똑같은 위치에 랜드마크를 배치합니다. 학습 예제 x^(1)과 완전히 같은 위치에 첫 번째 랜드마크 l^(1)을 배치합니다. 학습 예제 x^(2)와 완전히 같은 위치에 두 번째 랜드마크 l^(2)를 배치합니다. 왼쪽 그림과 달리 오른쪽 그림은 빨간색 점과 파란색 점을 사용하였습니다. 점의 색상은 중요하지 않습니다. 학습 예제 x^(m)과 완전히 같은 위치에 m번째 랜드마크 l^(m)를 배치합니다. 따라서, l^(1), l^(2),..., ^(m)까지 랜드마크를 설정합니다. 각 학습 예제마다 하나의 위치가 지정되고, 각 위치마다 하나의 랜드마크를 지정합니다. 피처는 기본적으로 한 학습 예제가 학습 셋 안에 모든 예제와 얼마나 가까운지를 측정합니다.

So, just to write this outline a little more concretely, given m training examples, I'm going to choose the the location of my landmarks to be exactly near the locations of my m training examples. When you are given example x, and in this example x can be something in the training set, it can be something in the cross validation set, or it can be something in the test set. Given an example x we are going to compute, you know, these features as so f1, f2, and so on. Where l1 is actually equal to x1 and so on. And these then give me a feature vector. So let me write f as the feature vector. I'm going to take these f1, f2 and so on, and just group them into feature vector. Take those down to fm. And, you know, just by convention. If we want, we can add an extra feature f0, which is always equal to 1. So this plays a role similar to what we had previously. For x0, which was our interceptor.

구체적으로 m개의 학습 예제가 주어졌을 때, m 개의 학습 예제의 위치와 정확히 같은 위치에 랜드마크의 위치를 결정합니다. 학습 예제 x는 학습 셋, 교차 검증 셋 또는 테스트 셋 중에 하나일 수 있습니다. 학습 예제 x가 있을 때 피처 f1, f2,..., fm을 계산합니다. f1 = similarity(x,ㅣ^(1)) 에서 l^(1) = x^(1)입니다. 피처 벡터 f = [f0, f1; f2;...; fm]를 만들 수 있습니다. 관례에 따라 f0를 추가합니다. 인터셉트 항 x0에 대한 피처 f0는 항상 1입니다.

So, for example, if we have a training example x(i), y(i), the features we would compute for this training example will be as follows: given x(i), we will then map it to, you know, f1(i). Which is the similarity. I'm going to abbreviate as SIM instead of writing out the whole word similarity, right? And f2(i) equals the similarity between x(i) and l2, and so on, down to fm(i) equals the similarity between x(i) and l(m). And somewhere in the middle. Somewhere in this list, you know, at the i-th component, I will actually have one feature component which is f subscript i(i), which is going to be the similarity between x and l(i). Where l(i) is equal to x(i), and so you know fi(i) is just going to be the similarity between x and itself. And if you're using the Gaussian kernel this is actually e to the minus 0 over 2 sigma squared and so, this will be equal to 1 and that's okay. So one of my features for this training example is going to be equal to 1.

예를 들어, 훈련 예제 (x^(i), y^(i))가 있을 때 Feature는 다음과 같이 계산할 수 있습니다. x^(i)에 대해 유사도 함수 f1^(i)를 매핑합니다. 여기서는 Similarity를 sim으로 축약할 것입니다.

그리고, 학습 예제의 피처 중에 하나는 반드시 1입니다.

And then similar to what I have above. I can take all of these m features and group them into a feature vector. So instead of representing my example, using, you know, x(i) which is this what R(n) plus R(n) one dimensional vector. Depending on whether you can set terms, is either R(n) or R(n) plus 1. We can now instead represent my training example using this feature vector f. I am going to write this f superscript i. Which is going to be taking all of these things and stacking them into a vector. So, f1(i) down to fm(i) and if you want and well, usually we'll also add this f0(i), where f0(i) is equal to 1. And so this vector here gives me my new feature vector with which to represent my training example. So given these kernels and similarity functions, here's how we use a support vector machine.

여기 m개의 피처를 모두 모아서 피처 벡터로 그룹화할 수 있습니다. 학습 예제 x^(i)는 R^(n) 또는 R^(n+1) 차원 벡터입니다. Feature 벡터 f를 사용하여 학습 예제를 나타낼 수 있습니다. f^(i) = [f0^(i); f1^(i); f2^(i);...; fm^(i)]입니다. 여기서 f0^(i)도 추가합니다. f0^(i)는 1입니다. f^(i)는 새로운 학습 에제를 나태는 Featuer 벡터입니다. 이런 커널과 유사도 함수를 고려할 때 서포트 벡터 머신을 사용하는 방법입니다.

If you already have a learning set of parameters theta, then if you given a value of x and you want to make a prediction. What we do is we compute the features f, which is now an R(m) plus 1 dimensional feature vector. And we have m here because we have m training examples and thus m landmarks and what we do is we predict 1 if theta transpose f is greater than or equal to 0. Right. So, if theta transpose f, of course, that's just equal to theta 0, f0 plus theta 1, f1 plus dot dot dot, plus theta m f(m). And so my parameter vector theta is also now going to be an m plus 1 dimensional vector. And we have m here because where the number of landmarks is equal to the training set size. So m was the training set size and now, the parameter vector theta is going to be m plus one dimensional.

이미 파라미터 θ가 있는 경우 x의 값이 주어지면 R^(m+1) 차원의 벡터 피처 f를 계산하여 예측합니다. m 개의 학습 예제와 m 개의 랜드마크가 있을 때 θ^Tf >= 0 면 y = 1을 예측합니다. θ^Tf = θ0f0 + θ1f1 + θ2f2 +... + θmfm입니다. 그래서, 파라미터 벡터 θ는 R^(m+1) 차원 벡터가 될 것입니다. 여기서는 n이 아니라 m을 사용하였습니다. 왜냐하면 랜드 마크의 수가 훈련 세트의 수와 같기 때문입니다. m은 훈련 세트의 수이고 파라미터 벡터 θ는 R^(m+1) 차원 벡터입니다.

So that's how you make a prediction if you already have a setting for the parameter's theta. How do you get the parameter's theta? Well you do that using the SVM learning algorithm, and specifically what you do is you would solve this minimization problem. You've minimized the parameter's theta of C times this cost function which we had before. Only now, instead of looking there instead of making predictions using theta transpose x(i) using our original features, x(i). Instead we've taken the features x(i) and replace them with a new features so we are using theta transpose f(i) to make a prediction on the i'f training examples and we see that, you know, in both places here and it's by solving this minimization problem that you get the parameters for your Support Vector Machine.

지금까지 파라미터 θ 의 값을 알고 있을 때 y의 값을 예측하는 방법이었습니다. 그렇다면 파라미터 θ는 어떻게 계산할까요? SVM 학습 알고리즘은 최소화 문제를 해결하기 위해 비용 함수가 있습니다.

파라미터 θ의 값을 최소화하는 비용 함수 식에서 원래 피처 x^(i)를 활용한 θ^Tx^(i)를 사용하는 대신에 새로운 피처 f^(i)를 활용한 θ^Tf^(i)를 사용합니다.

여기에서 최소화 문제를 해결함으로써 서포트 벡터 머신의 파라미터를 얻을 수 있습니다.

And one last detail is because this optimization problem we really have n equals m features. That is here. The number of features we have. Really, the effective number of features we have is dimension of f. So that n is actually going to be equal to m. So, if you want to, you can think of this as a sum, this really is a sum from j equals 1 through m. And then one way to think about this, is you can think of it as n being equal to m, because if f isn't a new feature, then we have m plus 1 features, with the plus 1 coming from the interceptor. And here, we still do sum from j equal 1 through n, because similar to our earlier videos on regularization, we still do not regularize the parameter theta zero, which is why this is a sum for j equals 1 through m instead of j equals zero though m. So that's the support vector machine learning algorithm. That's one sort of, mathematical detail aside that I should mention, which is that in the way the support vector machine is implemented, this last term is actually done a little bit differently.

마지막으로 최적화 문제는 n과 m 이 같습니다. m은 피처의 수이고 피처 벡터 f의 차원입니다. 실제로 n = m입니다. j = 1에서 m까지의 합입니다. f0를 고려하면 m + 1 Feature의 수입니다. 정규화 항에서 j = 1에서 n까지입니다. 파라미터 θ0는 정규화하지 않기 때문입니다. 이것이 j에 대한 합이 0부터 m까지 대신에 1부터 m까지인 이유입니다. 이것은 서포트 벡터 머신 학습 알고리즘입니다. 수학적으로 서포트 벡터 머신이 구현되는 방식입니다. 마지막 항은 실제로 약간 다르게 수행됩니다.

So you don't really need to know about this last detail in order to use support vector machines, and in fact the equations that are written down here should give you all the intuitions that should need. But in the way the support vector machine is implemented, you know, that term, the sum of j of theta j squared right? Another way to write this is this can be written as theta transpose theta if we ignore the parameter theta 0. So theta 1 down to theta m. Ignoring theta 0. Then this sum of theta j squared that this can also be written theta transpose theta. And what most support vector machine implementations do is actually replace this theta transpose theta, will instead, theta transpose times some matrix inside, that depends on the kernel you use, times theta.

서포트 벡터 머신을 사용하기 위한 마지막 정규화 항에 대한 부분은 실제로 알 필요가 없습니다. 여기에 적힌 방정식이 감각적으로 이해할 수 있는 직관을 제공합니다. 서포트 벡터 머신을 구현할 때 j에 대한 합산 Σθj^j 이 맞나요? 이것과 같은 결과를 도출하는 또 다른 방법은 파라미터 θ0를 무시한다면, 파라미터 벡터 θ = [θ1; θ2;...; θm]를 이용하여 θ^Tθ로 정의할 수 있습니다. 즉,

대부분의 서포트 벡터 머신 구현은 θ^Tθ를 사용합니다. θ^Tθ는 실제로 사용하는 커널에 따라 달라집니다.

And so this gives us a slightly different distance metric. We'll use a slightly different measure instead of minimizing exactly the norm of theta squared means that minimize something slightly similar to it. That's like a rescale version of the parameter vector theta that depends on the kernel. But this is kind of a mathematical detail. That allows the support vector machine software to run much more efficiently. And the reason the support vector machine does this is with this modification. It allows it to scale to much bigger training sets.

θ^TMθ는 약간 다른 거리 측정법을 제공합니다. ||θ||^2를 정확히 최소화하는 대신에 약간 다른 측정값을 사용하여 약간 유사한 무엇인가를 최소화합니다. θ^TMθ은 커널에 따른 파라미터 θ의 리스케일 버전입니다. 수학적으로 θ^TMθ 을 설명할 수 있습니다. 서포트 벡터 머신 소프트웨어를 더 효율적으로 활용할 수 있는 방법입니다. 서포트 벡터 머신은 수정을 할 수 있는 방법이기도 합니다.

Because for example, if you have a training set with 10,000 training examples. Then, you know, the way we define landmarks, we end up with 10,000 landmarks. And so theta becomes 10,000 dimensional. And maybe that works, but when m becomes really, really big then solving for all of these parameters, you know, if m were 50,000 or a 100,000 then solving for all of these parameters can become expensive for the support vector machine optimization software, thus solving the minimization problem that I drew here. So kind of as mathematical detail, which again you really don't need to know about. It actually modifies that last term a little bit to optimize something slightly different than just minimizing the norm squared of theta squared, of theta. But if you want, you can feel free to think of this as an kind of a n implementational detail that does change the objective a bit, but is done primarily for reasons of computational efficiency, so usually you don't really have to worry about this.

예를 들어, 더 큰 학습 세트로 확장할 수 있습니다. m =10,000개의 학습 예제가 있는 학습 셋이 있다고 가정합니다. 랜드마크도 10,000개를 설정합니다. 파라미터 벡터 θ는 10,000차원입니다. m이 정말 커지면 m = 50,000 또는 m = 100,000 일 때 모든 파라미터를 계산하면 서포트 벡터 머신 최적화 소프트웨어 비용이 많이 들 것입니다. 여기서 최소화 문제를 해결합니다. 중요하지 않으므로 수학적인 세부사항을 알 필요는 없습니다. 실제로 마지막 항을 약간 수정하여 ||θ||^2를 최소화하는 것과 약간 다른 것을 최적화합니다. 그러나 θ^TMθ을 목표를 약간 변경하는 구현의 한 방법으로 생각할 수 있지만, 주로 계산의 효율적으로 하기 위해 사용합니다. 일반적으로 이것에 대해 걱정할 필요는 없습니다.

And by the way, in case your wondering why we don't apply the kernel's idea to other algorithms as well like logistic regression, it turns out that if you want, you can actually apply the kernel's idea and define the source of features using landmarks and so on for logistic regression. But the computational tricks that apply for support vector machines don't generalize well to other algorithms like logistic regression. And so, using kernels with logistic regression is going too very slow, whereas, because of computational tricks, like that embodied and how it modifies this and the details of how the support vector machine software is implemented, support vector machines and kernels tend go particularly well together. Whereas, logistic regression and kernels, you know, you can do it, but this would run very slowly. And it won't be able to take advantage of advanced optimization techniques that people have figured out for the particular case of running a support vector machine with a kernel.

그런데 왜 커널을 로지스틱 회귀와 같은 다른 알고리즘에 적용하지 않을까요? 로지스틱 회귀에 커널 아이디어를 적용하고 랜드마크를 사용하여 피처를 정의할 수 있습니다. 그러나 서포트 벡터 머신에 적용되는 계산 방식은 로지스틱 회귀와 같은 다른 알고리즘에 일반화하지 않습니다. 로지스틱 회귀와 함께 커널을 사용하는 것은 계산이 너무 느리지만 서포트 벡터 머신과 함께 커널을 사용하는 것은 계산이 빠르기 때문입니다. 커널을 사용하여 서포트 벡터 머신을 실행하는 특정 사례에 맞는 고급 최적화 기술을 활용할 수 없습니다.

But all this pertains only to how you actually implement software to minimize the cost function. I will say more about that in the next video, but you really don't need to know about how to write software to minimize this cost function because you can find very good off the shelf software for doing so. And just as, you know, I wouldn't recommend writing code to invert a matrix or to compute a square root, I actually do not recommend writing software to minimize this cost function yourself, but instead to use off the shelf software packages that people have developed and so those software packages already embody these numerical optimization tricks, so you don't really have to worry about them.

그러나 이 모든 것은 비용 함수를 최소화기 위한 소프트웨어를 구현하는 방법에 국한된 것입니다. 다음 강의에서 더 자세히 설명하겠지만, 비용 함수를 최소화하기 위해 소프트웨어를 구현하는 방법을 알 필요는 없습니다. 이미 만들어진 좋은 소프트웨어가 있기 때문입니다. 역행렬을 구하거나 제곱근을 계산하는 코드를 작성하는 것은 권장하지 않습니다. 비용 함수를 최소화하기 위한 소프트웨어를 만드는 대신에 이미 만들어진 좋은 소프트웨어 패키지를 사용하는 것이 좋습니다. 이런 패키지는 수치 최적화 수법을 구현하여 실제로 고민할 필요는 없습니다.

But one other thing that is worth knowing about is when you're applying a support vector machine, how do you choose the parameters of the support vector machine? And the last thing I want to do in this video is say a little word about the bias and variance trade offs when using a support vector machine. When using an SVM, one of the things you need to choose is the parameter C which was in the optimization objective, and you recall that C played a role similar to 1 over lambda, where lambda was the regularization parameter we had for logistic regression. So, if you have a large value of C, this corresponds to what we have back in logistic regression, of a small value of lambda meaning of not using much regularization. And if you do that, you tend to have a hypothesis with lower bias and higher variance. Whereas if you use a smaller value of C then this corresponds to when we are using logistic regression with a large value of lambda and that corresponds to a hypothesis with higher bias and lower variance. And so, hypothesis with large C has a higher variance, and is more prone to overfitting, whereas hypothesis with small C has higher bias and is thus more prone to underfitting. So this parameter C is one of the parameters we need to choose.

하지만, 서포트 벡터 머신을 적용할 때 파라미터를 어떻게 선택해야 할까요? 이번 강의에서 마지막으로 다룰 것은 서포트 벡터 머신을 사용할 때 편향과 분산의 트레이드오프입니다. SVM을 사용할 때 최적화 목표에 있는 파라미터 C를 선택합니다. C는 1/λ과 유사한 역할을 수행합니다. 여기서 λ는 로지스틱 회귀의 정규화 파라미터입니다. 따라서 C값이 크면 로지스틱 회귀에서 얻은 값과 대응하여 정규화를 위한 λ의 값이 작다는 것을 의미합니다. 편향이 낮고 분산이 더 높은 가설을 갖는 경향이 있습니다. 반면에 C값이 작으면 로지스틱 회귀를 사용할 때 해당하는 편향이 높고 분산이 낮은 가설에 해당합니다. C가 크면 분산이 커서 과적합이 되는 경향이 있고, C가 작으면 편향이 높고 분산이 낮아서 가설은 과소 적합이 됩니다. 피라미터 C는 공식에서 선택해야 하는 파라미터 중에 하나입니다.

The other one is the parameter sigma squared, which appeared in the Gaussian kernel. So if the Gaussian kernel sigma squared is large, then in the similarity function, which was this you know E to the minus x minus landmark varies squared over 2 sigma squared. In this one of the example; If I have only one feature, x1, if I have a landmark there at that location, if sigma squared is large, then, you know, the Gaussian kernel would tend to fall off relatively slowly and so this would be my feature f(i), and so this would be smoother function that varies more smoothly, and so this will give you a hypothesis with higher bias and lower variance, because the Gaussian kernel that falls off smoothly, you tend to get a hypothesis that varies slowly, or varies smoothly as you change the input x.

또 다른 파라미터는 가우시안 커널에 있는 σ^2입니다.

가우시안 터널의 σ^2의 값이 크면 유사도 함수 exp()에서 (|| x - l^(1)||^2 인 분모 값에 영향을 미칩니다. 예를 들면, 학습 예제가 피처 x1의 특징만 있다면 그 위치에 랜드마크가 있고 σ^2의 값이 크면 상대적으로 f1는 느리고 매끄럽게 변하는 부드러운 곡선의 함수입니다. 따라서 더 높은 편향과 낮은 분산을 가진 가설입니다. 왜냐하면 매끄럽게 떨어지는 가우스 터널은 느리게 변하는 가설을 얻는 경향이 있기 때문입니다.

Whereas in contrast, if sigma squared was small and if that's my landmark given my 1 feature x1, you know, my Gaussian kernel, my similarity function, will vary more abruptly. And in both cases I'd pick out 1, and so if sigma squared is small, then my features vary less smoothly. So if it's just higher slopes or higher derivatives here. And using this, you end up fitting hypotheses of lower bias and you can have higher variance. And if you look at this week's points exercise, you actually get to play around with some of these ideas yourself and see these effects yourself.

대조적으로 σ^2d이 작고 학습 예제가 피처 x1의 특징만 있다면 그 위치에 랜드마크가 있고 σ^2의 값이 크면 상대적으로 f1는 더 높은 기울기로 급하게 변하는 곡선의 함수입니다. 따러서, 더 낮은 편향과 더 높은 분산을 가진 가설을 얻습니다. 실제 옥타브 실습에서 이러한 효과를 직접 확인할 것입니다.

So, that was the support vector machine with kernels algorithm. And hopefully this discussion of bias and variance will give you some sense of how you can expect this algorithm to behave as well.

그래서, 지금까지 커널 알고리즘이 있는 서포트 벡터 머신입니다. 편향과 분산에 대한 논의가 알고리 즘이 어떻게 동작하는지에 대한 이해를 줄 것입니다.