brunch

You can make anything
by writing

C.S.Lewis

by 라인하트 Dec 08. 2020

앤드류 응의 머신러닝(16-1): 추천 시스템 개요

온라인 강의 플랫폼 코세라의 창립자인 앤드류 응 (Andrew Ng) 교수는 인공지능 업계의 거장입니다. 그가 스탠퍼드 대학에서 머신 러닝 입문자에게 한 강의를 그대로 코세라 온라인 강의 (Coursera.org)에서 무료로 배울 수 있습니다. 이 강의는 머신러닝 입문자들의 필수코스입니다. 인공지능과 머신러닝을 혼자 공부하면서 자연스럽게 만나게 되는 강의입니다.

Recommender Systems

(추천 시스템)

Predicting Movie Ratings

(영화 등급 예측)

Problem Formulation (문제 확인)

In this next set of videos, I would like to tell you about recommender systems. There are two reasons, I had two motivations for why I wanted to talk about recommender systems. The first is just that it is an important application of machine learning. Over the last few years, occasionally I visit different, you know, technology companies here in Silicon Valley and I often talk to people working on machine learning applications there and so I've asked people what are the most important applications of machine learning or what are the machine learning applications that you would most like to get an improvement in the performance of. And one of the most frequent answers I heard was that there are many groups out in Silicon Valley now, trying to build better recommender systems. So, if you think about what the websites are like Amazon, or what Netflix or what eBay, or what iTunes Genius, made by Apple does, there are many websites or systems that try to recommend new products to use. So, Amazon recommends new books to you, Netflix try to recommend new movies to you, and so on. And these sorts of recommender systems, that look at what books you may have purchased in the past, or what movies you have rated in the past, but these are the systems that are responsible for today, a substantial fraction of Amazon's revenue and for a company like Netflix, the recommendations that they make to the users is also responsible for a substantial fraction of the movies watched by their users.

And so an improvement in performance of a recommender system can have a substantial and immediate impact on the bottom line of many of these companies. Recommender systems is kind of a funny problem, within academic machine learning so that we could go to an academic machine learning conference, the problem of recommender systems, actually receives relatively little attention, or at least it's sort of a smaller fraction of what goes on within Academia. But if you look at what's happening, many technology companies, the ability to build these systems seems to be a high priority for many companies. And that's one of the reasons why I want to talk about them in this class.

이번 강의부터 추천 시스템을 다룹니다. 추천 시스템을 다루는 이유는 두 가지입니다. 첫 번째 이유는 머신 러닝의 중요한 응용 분야이기 때문입니다. 지난 몇 년 동안 실리콘밸리에 있는 기업을 방문할 때 머신 러닝을 활용하는 사람들과 자주 이야기합니다. 사람들에게 머신 러닝의 가장 중요한 응용 분야나 성능 향상을 원하는 머신 러닝 응용 분야가 무엇인지를 물었습니다. 실리콘 밸리의 사람들은 추천 시스템이라고 답변했습니다. 아마존, 넥플릭스, eBay와 같은 웹사이트는 무엇을 할까요? Apple에서 만든 iTunes Genius는 무엇을 하는 것일까요? 아마존은 새로운 책을 추천하고, 넥플릭스는 새로운 영화를 추천합니다. 기업들의 웹사이트는 새로운 제품을 구매하도록 추천합니다. 추천 시스템은 여러분이 과거에 어떤 책을 구입했는지 또는 과거에 어떤 영화를 평가했는 지를 살펴봅니다. 추천 시스템은 오늘날 아마존 매출의 상당 부분을 차지합니다. 넥플릭스의 사용자가 시청하는 영화의 대부분은 추천 시스템이 추천한 영화입니다.

따라서, 추천 시스템의 성능 향상은 이런 회사들의 수익에 실질적이고 즉각적인 영향을 미칩니다. 추천 시스템은 머신 러닝 분야의 학계에서 상대적으로 주목도가 낮지만, 기술 집약 기업에서 최우선 과제로 다룹니다. 많은 기업들이 추천 시스템을 구축하는 능력을 확보하고자 합니다. 이것이 제가 이번 강의에서 추천 시스템을 다루고 싶은 이유 중에 하나입니다.

The second reason that I want to talk about recommender systems is that as we approach the last few sets of videos of this class I wanted to talk about a few of the big ideas in machine learning and share with you, you know, some of the big ideas in machine learning. And we've already seen in this class that features are important for machine learning, the features you choose will have a big effect on the performance of your learning algorithm. So there's this big idea in machine learning, which is that for some problems, maybe not all problems, but some problems, there are algorithms that can try to automatically learn a good set of features for you. So rather than trying to hand design, or hand code the features, which is mostly what we've been doing so far, there are a few settings where you might be able to have an algorithm, just to learn what feature to use, and the recommender systems is just one example of that sort of setting. There are many others, but engraved through recommender systems, will be able to go a little bit into this idea of learning the features and you'll be able to see at least one example of this, I think, big idea in machine learning as well.

추천 시스템을 다루는 두 번째 이유는 이 과정의 마지막 부분에서 머신 러닝에 대한 큰 아이디어를 설명하고 싶기 때문입니다. 머신 러닝 분야에서 학습 알고리즘의 성능에 큰 영향을 미치는 피처의 선택은 중요합니다. 일부 머신 러닝 문제는 자동으로 좋은 피처를 학습할 수 있는 알고리즘이 있습니다. 피처를 직접 설계하거나 코드를 작성할 필요가 없습니다. 대신 알고리즘을 사용할 수 있는 몇 가지 설정을 합니다. 추천 시스템은 알고리즘이 피처를 자동으로 설계하는 사례로 피처를 스스로 학습하는 아이디어를 알려줍니다. 머신러닝의 큰 아이디어 중 하나를 배울 수 있을 것입니다.

So, without further ado, let's get started, and talk about the recommender system problem formulation. As my running example, I'm going to use the modern problem of predicting movie ratings. So, here's a problem. Imagine that you're a website or a company that sells or rents out movies, or what have you. And so, you know, Amazon, and Netflix, and I think iTunes are all examples of companies that do this, and let's say you let your users rate different movies, using a 1 to 5 star rating. So, users may, you know, something one, two, three, four or five stars. In order to make this example just a little bit nicer, I'm going to allow 0 to 5 stars as well, because that just makes some of the math come out just nicer. Although most of these websites use the 1 to 5 star scale. So here, I have 5 movies. You know, Love That Lasts, Romance Forever, Cute Puppies of Love, Nonstop Car Chases, and Swords vs. Karate. And we have 4 users, which, calling, you know, Alice, Bob, Carol, and Dave, with initials A, B, C, and D, we'll call them users 1, 2, 3, and 4. So, let's say Alice really likes Love That Lasts and rates that 5 stars, likes Romance Forever, rates it 5 stars. She did not watch Cute Puppies of Love, and did rate it, so we don't have a rating for that, and Alice really did not like Nonstop Car Chases or Swords vs. Karate. And a different user Bob, user two, maybe rated a different set of movies, maybe she likes to Love at Last, did not to watch Romance Forever, just have a rating of 4, a 0, a 0, and maybe our 3rd user, rates this 0, did not watch that one, 0, 5, 5, and, you know, let's just fill in some of the numbers.

지금부터 추천 시스템 문제를 시작합니다. 여기 영화 등급을 예측하는 문제가 있습니다. 영화를 판매하거나 대여하는 아마존, 넥플릭스, 그리고 아이튠즈와 같은 회사나 시스템에서 사용합니다. 사용자는 영화를 별의 개수로 평가합니다. 영화를 시청한 후 사용자는 0, 1, 2, 3, 4, 5개의 별을 부여합니다. 별점은 수학적으로 좀 더 멋지게 표현할 수 있습니다.

여기 5 개의 영화가 있습니다. Love That Lasts(마침내 사랑), Romance Forever(영원한 로맨스), Cute Puppies of Love(사랑스러운 귀염둥이), Nonstop Car Chases(논스톱 자동차 추격전) , Swords vs. Karate(검 대 가라테)입니다. 그리고 4명의 사용자들이 있습니다. 앨리스(Alice), 밥(Bob), 캐럴(Carol), 데이브(Dave) 사용자들은 이니셜 A, B, C, D라 적고 1,2,3,4로 표시합니다. 앨리스는 '영원한 사랑'을 정말 좋아하고 별 5개를 주었고, 영원한 로맨스를 좋아하고 별 5개를 평가했습니다. 그녀는 사랑스러운 귀염둥이를 보지 않고 등급을 매겼기 때문에 그것의 등급을 없앴습니다. 앨리스는 논스톱 자동차 추격전과 검 대 가라테는 정말로 좋아하지 않습니다. 그리고, 다른 사용자 밥(사용자 2)은 영원한 사람을 좋아해서 5점을 주었고, 영원한 로맨스를 보지 않았고, 나머지 영화에 4점, 0점, 0점을 주었습니다. 캐럴은 영원한 사람은 0점, 영원한 로맨스는 보지 않았고, 나머지 영화는 0점, 5점, 5점을 주었습니다. 데이브는 마침내 사랑과 영원한 로맨스에 각 0점을 주었고, 사랑스러운 귀염둥이와 검대 가라테는 보지 않았습니다. 논스톱 자동차 추격전은 4점을 주었습니다.

And so just to introduce a bit of notation, this notation that we'll be using throughout, I'm going to use NU to denote the number of users. So in this example, NU will be equal to 4. So the u-subscript stands for users and Nm, going to use to denote the number of movies, so here I have five movies so Nm equals equals 5. And you know for this example, I have for this example, I have loosely 3maybe romantic or romantic comedy movies and 2 action movies and you know, if you look at this small example, it looks like Alice and Bob are giving high ratings to these romantic comedies or movies about love, and giving very low ratings about the action movies, and for Carol and Dave, it's the opposite, right? Carol and Dave, users three and four, really like the action movies and give them high ratings, but don't like the romance and love-type movies as much. Specifically, in the recommender system problem, we are given the following data. Our data comprises the following: we have these values r(i, j), and r(i, j) is 1 if user J has rated movie I. So our users rate only some of the movies, and so, you know, we don't have ratings for those movies. And whenever r(i, j) is equal to 1, whenever user j has rated movie i, we also get this number y(i, j), which is the rating given by user j to movie i. And so, y(i, j) would be a number from zero to five, depending on the star rating, zero to five stars that user gave that particular movie.

몇 가지 표기법을 정리합니다.

nu는 사용자 수이고 u는 아래 첨자입니다. 이 예제에서 nu = 4입니다.

nm은 영화의 수이고 m은 아래 첨자입니다. 이 예제에서 nm = 5입니다.

이 예제는 로맨틱 코미디 영화 3편과 액션 영화 2편이 있습니다. 앨리스와 밥은 로맨틱 코미디 영화에 높은 평가를 주지만 액션 영화에 매우 낮은 평가를 줍니다. 반대로 캐럴과 데이브는 액션 영화를 정말 좋아하여 높은 평가를 주지만 로맨틱 코미디 영화는 그다지 좋아하지 않습니다.

추천 시스템 문제에서 다음과 같이 데이터를 구성합니다.

r(i, j)는 사용자 j가 영화 i를 평가 여부를 표시합니다. 평가를 한 경우는 1이고, 하지 않은 경우는 0입니다.

y^(i, j)는 사용자 j가 영화 i에 부여한 별점입니다. 0에서 5까지의 숫자로 표시합니다.

사용자는 보지 않은 영화를 평가할 수 없으므로 일부 영화만을 평가합니다. 따라서, r(i, j) = 1 일 때만 y^(i, j)가 있습니다.

So, the recommender system problem is given this data that has give these r(i, j)'s and the y(i, j)'s to look through the data and look at all the movie ratings that are missing and to try to predict what these values of the question marks should be. In the particular example, I have a very small number of movies and a very small number of users and so most users have rated most movies but in the realistic settings your users each of your users may have rated only a minuscule fraction of your movies but looking at this data, you know, if Alice and Bob both like the romantic movies maybe we think that Alice would have given this a five. Maybe we think Bob would have given this a 4.5 or some high value, as we think maybe Carol and Dave were doing these very low ratings. And Dave, well, if Dave really likes action movies, maybe he would have given Swords and Karate a 4 rating or maybe a 5 rating, okay? And so, our job in developing a recommender system is to come up with a learning algorithm that can automatically go fill in these missing values for us so that we can look at, say, the movies that the user has not yet watched, and recommend new movies to that user to watch. You try to predict what else might be interesting to a user. So that's the formalism of the recommender system problem. In the next video we'll start to develop a learning algorithm to address this problem.

따라서, 추천 시스템 문제는 r(i, j)와 y^(i, j)의 데이터를 살펴보고 누락된 모든 영화의 등급과 '?'의 값을 예측합니다. 이 예제에서 영화의 수와 사용자의 수가 매우 적기 때문에 대부분의 사용자가 대부분의 영화를 평가했습니다. 하지만 현실에서 각 사용자가 영화의 극히 일부만 평가합니다. 여기 데이터에서 앨리스와 밥이 둘 다 로맨틱 영화를 좋아한다면 앨리스는 사랑스러운 귀염둥이 영화에 5점을 줄 수 있을 것입니다. 캐럴과 데이브는 로맨틱 코미디 영화에 매우 낮은 등급을 주기 때문에 '?'에 0을 주었을 것입니다. 밥은 4.5 정도를 줄 것이고, 데이브가 액션 영화를 좋아한다면 검 대 가라테에 4점이나 5점을 줄 것입니다. 그래서 추천 시스템을 개발한다는 것은 사용자가 아직 보지 않은 영화를 추천할 수 있도록 누락된 '?'의 값을 자동으로 채우는 학습 알고리즘을 만드는 것입니다. 추천 시스템에서 학습 알고리즘의 역할은 사용자가 좋아할 만한 영화를 예측하고 추천하는 것입니다. 다음 강의에서 이문제를 해결하기 위한 학습 알고리즘을 개발합니다.

앤드류 응의 머신러닝 동영상 강의

정리하며

추천 시스템을 다루는 이유는 두 가지입니다. 첫 번째는 실리콘밸리의 기업들이 가장 관심을 갖는 학습 알고리즘이기 때문입니다. 아마존, 넥플릭스, eBay와 같은 웹사이트는 추천을 합니다. 아마존은 새로운 책을 추천하고, 넥플릭스는 새로운 영화를 추천합니다. 추천 시스템의 성능 향상은 이런 회사들의 수익과 즉각적인 영향을 미칩니다. 두 번째로 추천 시스템은 자동으로 좋은 피처를 학습할 수 있는 알고리즘이 있기 때문입니다. 지금까지 피처를 직접 디자인하거나 코딩을 해야 했지만, 알고리즘에 간단하게 몇 가지 설정만 피처를 디자인할 수 있습니다. 머신 러닝에 대한 큰 아이디어를 얻을 수 있습니다.

추천 시스템을 이해하기 위한 몇 가지 기호를 알아봅니다.

nu는 사용자 수

nm은 영화의 수

r(i, j) = 1은 사용자 j가 영화 i를 평가한 경우

y^(i, j)는 사용자 j가 영화 i에 부여한 평가, 0에서 5까지 값을 부여

결과적으로 추천 시스템을 개발한다는 것은 사용자가 아직 보지 않은 영화를 추천할 수 있도록 이런 누락된 값을 자동으로 채울 수 있는 학습 알고리즘을 만드는 것입니다.