brunch

You can make anything
by writing

C.S.Lewis

by 라인하트 Dec 11. 2020

앤드류 응의 머신러닝(16-4): 협업 필터링 알고리즘

온라인 강의 플랫폼 코세라의 창립자인 앤드류 응 (Andrew Ng) 교수는 인공지능 업계의 거장입니다. 그가 스탠퍼드 대학에서 머신 러닝 입문자에게 한 강의를 그대로 코세라 온라인 강의 (Coursera.org)에서 무료로 배울 수 있습니다. 이 강의는 머신러닝 입문자들의 필수코스입니다. 인공지능과 머신러닝을 혼자 공부하면서 자연스럽게 만나게 되는 강의입니다.

Recommender Systems

(추천 시스템)

Collaborative Filtering (협업 필터링)

Collaborative Filtering Algorithm (협업 필터링 알고리즘)

In the last couple videos, we talked about the ideas of how, first, if you're given features for movies, you can use that to learn parameters data for users. And second, if you're given parameters for the users, you can use that to learn features for the movies. In this video we're going to take those ideas and put them together to come up with a collaborative filtering algorithm.

지난 강의에서 영화 추천 알고리즘을 배웠습니다. 콘텐츠 기반 추천 알고리즘은 영화의 특성을 나타내는 피처가 있을 떄 사용자의 취향을 나타내는 파라미터를 추정하는 방법이고, 협업 필터링 알고리즘은 사용자의 취향을 나타내는 파라미터가 있을 때 영화의 특성을 나타내는 피처를 추정하는 방법입니다. 이번 강의에서 두 아이디어를 합쳐서 좀 더 나은 협업 필터링 알고리즘을 만듭니다. 영화의 특징에 대한 피처 벡터를 영화 프로파일이라고 하고, 사용자의 영화 취향을 사용자 프로파일이라 합니다.

So one of the things we worked out earlier is that if you have features for the movies then you can solve this minimization problem to find the parameters theta for your users. And then we also worked that out, if you are given the parameters theta, you can also use that to estimate the features x, and you can do that by solving this minimization problem. So one thing you could do is actually go back and forth. Maybe randomly initialize the parameters and then solve for theta, solve for x, solve for theta, solve for x. But, it turns out that there is a more efficient algorithm that doesn't need to go back and forth between the x's and the thetas, but that can solve for theta and x simultaneously.

수학적으로 다시 정리합니다. 영화 프로파일 x^(i)이 있을 때 사용자 프로파일 θ^(j)를 추정할 수 있고, 사용자 프로파일 θ^(j)가 있을 때 영화 프로파일 x^(i)를 추정할 수 있습니다. 실제 평가에 근사한 추정을 위해 최소화 문제를 반복합니다. 파라미터의 값을 무작위로 초기하고 θ를 추정하고, x를 추정하고, 다시 θ를 추정하고, x를 추정합니다. 하지만, 가장 효율적인 방법은 이렇게 반복하는 것이 아니라 동시에 x와 θ를 추정하는 것입니다.

And here it is. What we are going to do, is basically take both of these optimization objectives, and put them into the same objective. So I'm going to define the new optimization objective j, which is a cost function, that is a function of my features x and a function of my parameters theta. And, it's basically the two optimization objectives I had on top, but I put together.

여기 수학 공식이 있습니다. 기본적으로 두 가지 최적화 목표를 하나의 식으로 만들었습니다. 비용 함수 J의 최적화 목표를 정의합니다. 이것은 피처 x의 함수이자 파라미터 θ의 함수입니다.

So, in order to explain this, first, I want to point out that this term over here, this squared error term, is the same as this squared error term and the summations look a little bit different, but let's see what the summations are really doing.

The first summation is sum over all users J and then sum over all movies rated by that user. So, this is really summing over all pairs IJ, that correspond to a movie that was rated by a user. Sum over J says, for every user, the sum of all the movies rated by that user. This summation down here, just does things in the opposite order. This says for every movie I, sum over all the users J that have rated that movie and so, you know these summations, both of these are just summations over all pairs ij for which r of i J is equal to 1. It's just something over all the user movie pairs for which you have a rating and so those two terms up there is just exactly this first term, and I've just written the summation here explicitly, where I'm just saying the sum of all pairs IJ, such that RIJ is equal to 1. So what we're going to do is define a combined optimization objective that we want to minimize in order to solve simultaneously for x and theta.

비용 함수 J의 첫 항은 오차의 제곱항입니다. 첫 번째 식의 오차의 제곱항과 두 번째 식의 오차의 제곱항을 합친 것입니다. 첫 번째 식의 오차의 제곱항은 모든 사용자 j의 합계이자 해당 사용자가 평가한 모든 영화의 합계입니다. r(i, j) = 1이므로 사용자 j가 평가를 한 모든 영화 i를 합산한 것입니다. 두 번째 식의 오차의 제곱항은 모든 영화의 합계이자 해당 영화를 평가한 모든 사용자의 합계입니다. 두 가지 모두 r(i, j) =1 일 때 사용자 j와 영화 i의 쌍에 대한 합계입니다. 평가가 이루어진 모든 사용자와 모든 영화의 쌍입니다. 따라서, 이것을 Σ는 (i, j):r(i, j)=1와 같습니다. 따라서, 최소화하려는 최적화 목표는 다음과 같이 정의합니다.

And then the other terms in the optimization objective are this, which is a regularization in terms of theta. So that came down here and the final piece is this term which is my optimization objective for the x's and that became this. And this optimization objective j actually has an interesting property that if you were to hold the x's constant and just minimize with respect to the thetas then you'd be solving exactly this problem, whereas if you were to do the opposite, if you were to hold the thetas constant, and minimize j only with respect to the x's, then it becomes equivalent to this. Because either this term or this term is constant if you're minimizing only the respective x's or only respective thetas.

비용 함수 J의 두 번째 항은 두 번째 식의 두 번째 항으로 파라미터 θ에 대한 정규화 항입니다. 비용 함수 J의 세 번째 항은 첫 번째 식의 두 번째 항으로 파리 머터 x에 대한 정규화 항입니다. 상수 x를 고정하고 θ를 최소화하면서 최적화 목표 J를 구하는 것과 상수 θ를 고정하고 x를 최소화하면서 최적화 목표 J를 구하는 것은 동일합니다. 왜냐하면 두 항은 상수로 값을 일정하게 유지하기 때문입니다.

So here's an optimization objective that puts together my cost functions in terms of x and in terms of theta. And in order to come up with just one optimization problem, what we're going to do, is treat this cost function, as a function of my features x and of my user pro user parameters data and just minimize this whole thing, as a function of both the Xs and a function of the thetas. And really the only difference between this and the older algorithm is that, instead of going back and forth, previously we talked about minimizing with respect to theta then minimizing with respect to x, whereas minimizing with respect to theta, minimizing with respect to x and so on. In this new version instead of sequentially going between the 2 sets of parameters x and theta, what we are going to do is just minimize with respect to both sets of parameters simultaneously.

여기에 θ와 x를 최적화하는 목표가 있습니다. 최적화 문제를 단일 공식으로 만들기 위해 콘텐츠의 특성 피처 x와 사용자 프로파일 파라미터 θ를 활용해 비용 함수 J를 최소화합니다. 새로운 알고리즘과 과거 알고리즘과 지금 알고리즘의 차이는 x를 최소화하고 나서 θ를 최소화하는 작업을 반복적으로 수행하는 것이 아니라 x와 θ를 동시에 최소화합니다.

Finally one last detail is that when we're learning the features this way. Previously we have been using this convention that we have a feature x0 equals one that corresponds to an interceptor. When we are using this sort of formalism where we're are actually learning the features, we are actually going to do away with this convention. And so the features we are going to learn x, will be in Rn. Whereas previously we had features x and Rn + 1 including the intercept term. By getting rid of x0 we now just have x in Rn. And so similarly, because the parameters theta is in the same dimension, we now also have theta in RN because if there's no x0, then there's no need parameter theta 0 as well. And the reason we do away with this convention is because we're now learning all the features, right? So there is no need to hard code the feature that is always equal to one. Because if the algorithm really wants a feature that is always equal to 1, it can choose to learn one for itself. So if the algorithm chooses, it can set the feature X1 equals 1. So there's no need to hard code the feature of 001, the algorithm now has the flexibility to just learn it by itself.

마지막으로 고려해야 할 세부사항은 인터셉터 x0입니다. 실제로 추천 알고리즘에서 피처 x0를 사용하지 않습니다. 피처 x 0가 없으면 가 R^(n) 벡터로 설명하고 x0를 포함하면 R^(n+1) 벡터로 설명합니다. 여기서 x0를 제거하면 피처 x는 R^(n) 차원 벡터입니다. 마찬가지로 파라미터 θ도 동일 차원이기 때문에 θ^0는 필요하지 않습니다. R^(n) 차원 벡터입니다. 여기서 이 규칙을 제거하는 이유는 알고리즘이 모든 피처를 학습하기 때문입니다. 항상 1의 값을 가지는 인터셉터항을 유지할 필요가 없습니다. 알고리즘이 항상 1과 같은 피처나 파라미터를 원한다면 자체적으로 학습할 것이기 때문입니다. 따라서, 알고리즘이 필요하다면 x1을 1로 설정할 수 있습니다. 따라서 알고리즘은 스스로 학습할 수 있는 유연성을 가졌습니다.

So, putting everything together, here is our collaborative filtering algorithm first we are going to initialize x and theta to small random values. And this is a little bit like neural network training, where there we were also initializing all the parameters of a neural network to small random values. Next we're then going to minimize the cost function using great intercepts or one of the advance optimization algorithms. So, if you take derivatives you find that the great intercept like these and so this term here is the partial derivative of the cost function, I'm not going to write that out, with respect to the feature value Xik and similarly this term here is also a partial derivative value of the cost function with respect to the parameter theta that we're minimizing.

지금까지 이야기한 모든 것을 조합하면 협력 필터링 알고리즘을 개발할 수 있습니다. 첫 번째로 x와 θ를 임의의 작은 값으로 초기화합니다. 이것은 인공 신경망에서 학습하는 것과 약간 비슷합니다. 인공 신경망의 무든 파라미터를 임의의 작은 값으로 초기화했습니다. 두 번째로 고급 최적화 알고리증 하나를 사용하여 비용 함수 J를 최소화합니다. 피처 파라미터 x에 대한 비용 함수 J의 편미분입니다 또한, 최소화하기 위한 파라미터 θ에 대한 비용 함수 J의 편미분 값입니다.

And just as a reminder, in this formula that we no longer have this X0 equals 1 and so we have that x is in Rn and theta is a Rn. In this new formalism, we're regularizing every one of our perimeters theta, you know, every one of our parameters Xn. There's no longer the special case theta zero, which was regularized differently, or which was not regularized compared to the parameters theta 1 down to theta. So there is now no longer a theta 0, which is why in these updates, I did not break out a special case for k equals 0. So we then use gradient descent to minimize the cost function j with respect to the features x and with respect to the parameters theta.

상기하자면, 협업 필터링 알고리즘 공식에서 x0 = 1은 없고, x는 R^(n) 차원입니다. 따라서, θ0 = 1은 없고, θ는 R^(n) 차원입니다. 경사 하강 업데이트에서 k = 0인 특별한 경우가 없습니다. 경사 하강법을 활용하여 피처 x와 관련한 비용 함수 J를 최소화하고, 파라미터 θ와 관련한 비용 함수 J를 최소화합니다.

And finally, given a user, if a user has some parameters, theta, and if there's a movie with some sort of learned features x, we would then predict that that movie would be given a star rating by that user of theta transpose j. Or just to fill those in, then we're saying that if user J has not yet rated movie I, then what we do is predict that user J is going to rate movie I according to theta J transpose Xi. So that's the collaborative filtering algorithm and if you implement this algorithm you actually get a pretty decent algorithm that will simultaneously learn good features for hopefully all the movies as well as learn parameters for all the users and hopefully give pretty good predictions for how different users will rate different movies that they have not yet rated.

마지막으로 사용자 j의 영화 취향을 나타내는 파라미터 θ^(j)가 있고 영화의 특성을 나타내는 피처 x^(i)를 학습할 수 있다면, 해당 영화를 사용자가 어떻게 평가할지를 예측할 수 있습니다. 또는 사용자 j가 아직 영화 i를 평가하지 않았다면 (θ^(j))^Tx^(i)로 영화 i를 선호도를 예측할 수 있습니다. 이것이 협업 필터링 알고리즘입니다. 협업 필터링 알고리즘은 모든 영화의 피처를 학습하고 모든 사용자에 대한 파라미터를 학습하고, 다른 사용자들이 아직 평가하지 않은 다른 영화에 대한 꽤 좋은 예측을 할 수 있습니다.

앤드류 응의 머신러닝 동영상 강의

정리하며

새로운 알고리즘과 과거 알고리즘과 지금 알고리즘의 차이는 x를 최소화하고 나서 θ를 최소화하는 작업을 반복적으로 수행하는 것이 아니라 x와 θ를 동시에 최소화합니다.

마지막으로 고려해야 할 세부사항은 인터셉터 x0입니다. 추천 알고리즘이 x0 = 1과 같은 피처가 필요하다면 자체적으로 학습할 것입니다. 아마도 알고리즘은 필요할 경우 x1을 1로 설정할 수 있습니다. 따라서 알고리즘은 스스로 학습할 수 있는 유연성을 가졌으므로 x0 피처는 고려하지 않습니다.

따라서, 협업 필터링 알고리즘을 개발하는 방법은 다음과 같습니다.

1) 파라미터 x와 θ를 임의의 작은 값으로 초기화합니다.

2) 고급 최적화 알고리증 하나를 사용하여 비용 함수 J를 최소화합니다.

경사 하강 업데이트에서 k = 0인 특별한 경우가 없습니다. 경사 하강법을 활용하여 피처 x와 관련한 비용 함수 J를 최소화하고, 파라미터 θ와 관련한 비용 함수 J를 최소화합니다.

3) 사용자 j의 영화 취향을 나타내는 파라미터 θ^(j)가 있고 영화의 특성을 나타내는 피처 x^(i)를 학습할 수 있다면, 영화에 대한 사용자 j의 평가를 예측할 수 있습니다. 사용자 j가 아직 영화 i를 평가하지 않았다면 (θ^(j))^Tx^(i)로 영화 i를 선호도를 예측할 수 있습니다.