brunch

You can make anything
by writing

C.S.Lewis

by 라인하트 Dec 13. 2020

앤드류 응의 머신러닝(16-6): 협업 필터링 정규화

온라인 강의 플랫폼 코세라의 창립자인 앤드류 응 (Andrew Ng) 교수는 인공지능 업계의 거장입니다. 그가 스탠퍼드 대학에서 머신 러닝 입문자에게 한 강의를 그대로 코세라 온라인 강의 (Coursera.org)에서 무료로 배울 수 있습니다. 이 강의는 머신러닝 입문자들의 필수코스입니다. 인공지능과 머신러닝을 혼자 공부하면서 자연스럽게 만나게 되는 강의입니다.

Recommender Systems

(추천 시스템)

Low Rank Matrix Factorization (저 차원 행렬 분해)

Implementation Detail: Mean Normailization (구현 세부사항 : 정규화)

By now you've seen all of the main pieces of the recommender system algorithm or the collaborative filtering algorithm. In this video I want to just share one last implementational detail, namely mean normalization, which can sometimes just make the algorithm work a little bit better.

지금까지 추천 시스템 알고리즘과 협업 필터링 알고리즘의 주요 부분을 공부했습니다. 이번 강의에서 마지막 구현 세부사항인 정규화에 대해 설명합니다. 때로는 정규화는 알고리즘이 조금 더 잘 작동하게 합니다.

To motivate the idea of mean normalization, let's consider an example of where there's a user that has not rated any movies. So, in addition to our four users, Alice, Bob, Carol, and Dave, I've added a fifth user, Eve, who hasn't rated any movies. Let's see what our collaborative filtering algorithm will do on this user.

정규화를 이해하기 위해 어떤 영화도 평가하지 않은 사용자가 있다고 가정합니다. 4명의 사용자인 앨리스, 밥, 캐럴, 데이브 외에 이브가 있습니다. 다섯 번째 사용자 이브는 어떤 영화도 평가하지 않았습니다. 협업 필터링 알고리즘이 이브에게 무엇을 하는 지를 살펴보겠습니다.

Let's say that n is equal to 2 and so we're going to learn two features and we are going to have to learn a parameter vector theta 5, which is going to be in R2, remember this is now vectors in Rn not Rn+1, we'll learn the parameter vector theta 5 for our user number 5, Eve. So if we look in the first term in this optimization objective, well the user Eve hasn't rated any movies, so there are no movies for which Rij is equal to one for the user Eve and so this first term plays no role at all in determining theta 5 because there are no movies that Eve has rated. And so the only term that effects theta 5 is this term. And so we're saying that we want to choose vector theta 5 so that the last regularization term is as small as possible. In other words we want to minimize this lambda over 2 theta 5 subscript 1 squared plus theta 5 subscript 2 squared. So that's the component of the regularization term that corresponds to user 5, and of course if your goal is to minimize this term, then what you're going to end up with is just theta 5 equals 0 0. Because a regularization term is encouraging us to set parameters close to 0 and if there is no data to try to pull the parameters away from 0, because this first term doesn't effect theta 5, we just end up with theta 5 equals the vector of all zeros. And so when we go to predict how user 5 would rate any movie, we have that theta 5 transpose xi, for any i, that's just going to be equal to zero. Because theta 5 is 0 for any value of x, this inner product is going to be equal to 0. And what we're going to have therefore, is that we're going to predict that Eve is going to rate every single movie with zero stars.

피처의 개수 n = 2입니다. 두 가지 피처를 학습하는 피처 파라미터 백터 θ^(5)는 R^(2)입니다. 파라미터 θ^(5)를 학습하니다. 이것은 R^(n+1)이 아니라 R^(n) 벡터입니다. x^(1)에서 5번째 사용자의 파라미터 벡터 θ^(5)를 학습합니다. 최적화 목표에서 첫 번째 항을 살펴보면, 사용자 이브는 어떤 영화도 평가하지 않았기 때문에 r(i, j) = 1인 영화가 없습니다. 따라서, 첫 번째 항은 아무런 역할을 하지 않습니다. θ^(5)는 마지막 정규화 항이 가능한 최소값을 갖도록 합니다. 5번째 사용자에 대한 정규화 항은 다음과 같습니다.

λ/2 * ((θ1^(5))^2 + (θ2^(5))^2)

그래서, 이것이 5번째 사용자에 해당하는 정규화 항이고, 이 항을 최소화하는 것이 목표일 때 θ^(5) = [0; 0]입니다. 왜냐하면 정규화 항은 파라미터들을 0과 가깝게 설정하기 때문입니다. 파라미터를 0이 아닌 더 큰 값을 부여할 데이터가 없다면, 첫 항은 θ^(5)에 영향을 미칠 수 없습니다. 따라서, 5번째 사용자가 영화 i를 어떤 등급으로 평가할지를 예측할 때, (θ^(5))^Tx^(i)를 사용할 것이고, (θ^(5))^Tx^(i)의 모든 값은 0이 될 것입니다. 결국 알고리즘은 이브는 모든 영화를 별이 없는 0이라고 평가할 것이라고 예측합니다.

But this doesn't seem very useful does it? I mean if you look at the different movies, Love at Last, this first movie, a couple people rated it 5 stars. And for even the Swords vs. Karate, someone rated it 5 stars. So some people do like some movies. It seems not useful to just predict that Eve is going to rate everything 0 stars. And in fact if we're predicting that eve is going to rate everything 0 stars, we also don't have any good way of recommending any movies to her, because you know all of these movies are getting exactly the same predicted rating for Eve so there's no one movie with a higher predicted rating that we could recommend to her, so, that's not very good.

이것은 유용하지 않습니다. 첫 번째 영화 마침내 사랑을 본 몇 사람들은 별 5개로 평가했습니다. 누군가는 검 대 가라태 영화에 별 5개를 주었습니다. 사람들마다 좋아하는 영화가 있습니다. 이브가 모든 영화를 별 0개로 평가할 것을 예측하는 것은 유용하지 않습니다. 사실 알고리즘이 이렇게 예측을 한다면, 이브에게 어떤 영화도 추천할 수 없습니다. 왜냐하면 모든 영화가 이브에게 정확히 동일하게 평가를 하고 있기 때문입니다. 우리가 그녀에게 추천할 수 있는 더 높은 등급을 가진 영화가 하나도 없습니다. 따라서, 이것은 좋은 방법이 아닙니다.

The idea of mean normalization will let us fix this problem. So here's how it works. As before let me group all of my movie ratings into this matrix Y, so just take all of these ratings and group them into matrix Y. And this column over here of all question marks corresponds to Eve's not having rated any movies. Now to perform mean normalization what I'm going to do is compute the average rating that each movie obtained. And I'm going to store that in a vector that we'll call mu. So the first movie got two 5-star and two 0-star ratings, so the average of that is a 2.5-star rating. The second movie had an average of 2.5-stars and so on. And the final movie that has 0, 0, 5, 0. And the average of 0, 0, 5, 0, that averages out to an average of 1.25 rating.

이 문제를 해결하기 위해 정규화라는 개념을 활용합니다. 동작 원리는 다음과 같습니다. 전과 마찬가지로 모든 영화 등급을 행렬 Y로 그룹화합니다. 행렬 Y의 5번째 열은 이브가 어떤 영화도 평가를 하지 않았기 때문에 물음표로 되어 있습니다. 이제 정규화를 위해 각 영화가 얻은 영화 등급의 평균을 계산합니다. 각 영화의 등급 평균을 μ라고 합니다. 첫 번째 영화는 5, 5, 0, 0을 받았으므로 평균 2.5입니다. 두 번째 영화는 5, 0을 받았으므로 평균 2.5입니다. 마지막 영화는 0, 0, 5, 0을 받았으므로 평균 1.25입니다.

And what I'm going to do is look at all the movie ratings and I'm going to subtract off the mean rating. So this first element 5 I'm going to subtract off 2.5 and that gives me 2.5. And the second element 5 subtract off of 2.5, get a 2.5. And then the 0, 0, subtract off 2.5 and you get -2.5, -2.5. In other words, what I'm going to do is take my matrix of movie ratings, take this wide matrix, and subtract form each row the average rating for that movie. So, what I'm doing is just normalizing each movie to have an average rating of zero. And so just one last example. If you look at this last row, 0 0 5 0. We're going to subtract 1.25, and so I end up with these values over here.

지금부터 모든 영화가 받은 등급에서 평균을 뺍니다. 행렬 Y의 첫 번째 성분 Y^(1,1)는 5에서 2.5를 뺀 2.5입니다. 행렬 성분 Y^(1,2)는 5에서 2.5를 뺀 2.5입니다. 그리고 행렬 성분 Y^(1,3)과 Y^(1,4)는 0에서 2.5를 뺀 -2.5입니다. 사용자가 평가한 영화 등급을 나타내는 행렬 Y에서 각 영화의 평균 영화 등급을 뺍니다. 결국, 우측의 행렬 Y는 각 열의 평균이 0이 되도록 정규화합니다. 행렬 Y의 마지막 행의 성분 0, 0, 5, 0에서 2.5를 빼면 -1.25, -1.25, 3,75, -1.25입니다. 빨간색 박스의 값으로 끝납니다.

So now and of course the question marks stay a question mark. So each movie in this new matrix Y has an average rating of 0. What I'm going to do then, is take this set of ratings and use it with my collaborative filtering algorithm. So I'm going to pretend that this was the data that I had gotten from my users, or pretend that these are the actual ratings I had gotten from the users, and I'm going to use this as my data set with which to learn my parameters theta J and my features XI - from these mean normalized movie ratings. When I want to make predictions of movie ratings, what I'm going to do is the following: for user J on movie I, I'm gonna predict theta J transpose XI, where X and theta are the parameters that I've learned from this mean normalized data set.

따라서, 물음표는 여전히 물음표로 남고, 새로운 행렬 Y의 평균 등급은 0입니다. 협업 필터링 알고리즘에 새로운 행렬 Y를 적용합니다. 새로운 행렬 Y를 사용자들로부터 직접 얻은 데이터이거나 사용자가 직접 평가한 데이터인 것처럼 사용합니다. 알고리즘은 사용자 프로파일 파라미터 θ^(j)와 영화 프로파일 피처 x^(i)를 학습합니다. 영화 등급을 예측하는 방법은 다음과 같습니다.

(θ^(j))^Tx^(i)

따라서, 새로운 행렬 Y는 정규화된 데이터 셋입니다.

But, because on the data set, I had subtracted off the means in order to make a prediction on movie i, I'm going to need to add back in the mean, and so i'm going to add back in mu i. And so that's going to be my prediction where in my training data subtracted off all the means and so when we make predictions and we need to add back in these means mu i for movie i. And so specifically if you user 5 which is Eve, the same argument as the previous slide still applies in the sense that Eve had not rated any movies and so the learned parameter for user 5 is still going to be equal to 0, 0. And so what we're going to get then is that on a particular movie i we're going to predict for Eve theta 5, transpose xi plus add back in mu i and so this first component is going to be equal to zero, if theta five is equal to zero. And so on movie i, we are going to end a predicting mu i. And, this actually makes sense. It means that on movie 1 we're going to predict Eve rates it 2.5. On movie 2 we're gonna predict Eve rates it 2.5. On movie 3 we're gonna predict Eve rates it at 2 and so on. This actually makes sense, because it says that if Eve hasn't rated any movies and we just don't know anything about this new user Eve, what we're going to do is just predict for each of the movies, what are the average rating that those movies got.

그러나, 정규화된 데이터 셋은 영화 i에 대한 평가 등급을 예측하기 위해 평균값을 모두 뺐습니다. 평균값을 다시 추가해야 원래의 평가 등급을 얻을 수 있습니다. 따라서, μi를 추가합니다. 따라서 사용자 j가 영화 i를 평가하는 등급을 예측할 때 사용 공식을 사용합니다.

(θ^(j))^Tx^(i) + μi

특히, 아무런 영화도 평가하지 않았던 다섯 번째 사용자 이브의 영화 선호도인 θ^(5) =[0, 0]입니다. 각 영화에 대한 평가를 예측할 때 (θ^(j))^Tx^(i)는 0이지만 μi는 그대로 남아 있습니다. 따라서, 이브가 평가한 x^(1) = 2.5, x^(2) = 2.5, x^(3) = 3, x^(4) = 2.25, x^(5) = 1.25입니다. 이것은 실제로 논리적입니다. 이브가 어떤 영화도 평가하지 않았고 이브에 대한 아무런 정보가 없어도 알고리즘은 각 영화가 얻은 평균값을 바탕으로 영화를 추천합니다.

Finally, as an aside, in this video we talked about mean normalization, where we normalized each row of the matrix y, to have mean 0. In case you have some movies with no ratings, so it is analogous to a user who hasn't rated anything, but in case you have some movies with no ratings, you can also play with versions of the algorithm, where you normalize the different columns to have means zero, instead of normalizing the rows to have mean zero, although that's maybe less important, because if you really have a movie with no rating, maybe you just shouldn't recommend that movie to anyone, anyway. And so, taking care of the case of a user who hasn't rated anything might be more important than taking care of the case of a movie that hasn't gotten a single rating.

마지막으로 이 강의에서 행렬 Y의 각 행의 평균을 0으로 정규화하였습니다. 아무런 평가가 없는 영화는 어떤 영화도 평가하지 않은 사용자와 유사합니다. 아무런 평가가 없는 영화도 알고리즘이 행단 위로 평균을 0으로 만드는 대신에 열 단위로 평균을 0으로 만드는 정규화를 합니다. 알고리즘은 평가 등급이 없는 영화를 아무에게도 추천을 할 수 없기 때문입니다. 아무런 영화를 평가하지 않는 사용자를 처리하는 것보다 아무런 평가 등급을 받지 않은 영화를 처리하는 것이 더 중요할 수 있습니다.

So to summarize, that's how you can do mean normalization as a sort of pre-processing step for collaborative filtering. Depending on your data set, this might some times make your implementation work just a little bit better.

정리하면, 협업 필터링의 전처리 단계로 정규화를 사용하였습니다. 데이터 셋에 따라 구현 방법은 조금 더 나을 수 있습니다.

앤드류 응의 머신러닝 동영상 강의

정리하며

어떤 영화도 평가하지 않은 사용자와 아무도 평가를 하지 않는 영화는 모든 값이 0입니다. 따라서, 추천 알고리즘은 추천을 할 수 없습니다. 이런 상황에 추천을 할 수 있도록 하기 위해 데이터를 전 처리하는 정규화를 합니다. 정규화는 각 영화별로 평가 등급에 대해 평균을 내고, 기존 등급에서 평균값을 제외합니다. 마지막에 다시 평균값을 추가하여 원래 평가 등급을 표시합니다.

(θ^(j))^Tx^(i) + μi

사용자가 어떤 영화도 평가하지 않았어도 알고리즘은 각 영화가 얻은 평균값을 바탕으로 영화를 추천합니다. 마찬가지로 아무런 평가가 없는 영화는 다른 사용자들의 평가에 대한 평균을 활용합니다.