brunch

You can make anything
by writing

C.S.Lewis

by 라인하트 Nov 05. 2020

앤드류 응의 머신러닝(11-1): 스팸 분류기 개선하기

온라인 강의 플랫폼 코세라의 창립자인 앤드류 응 (Andrew Ng) 교수는 인공지능 업계의 거장입니다. 그가 스탠퍼드 대학에서 머신 러닝 입문자에게 한 강의를 그대로 코세라 온라인 강의 (Coursera.org)에서 무료로 배울 수 있습니다. 이 강의는 머신러닝 입문자들의 필수코스입니다. 인공지능과 머신러닝을 혼자 공부하면서 자연스럽게 만나게 되는 강의입니다.

Advice for Applying Machine Learning

머신 러닝 적용을 위한 조언

Machine Learning System Design

(머신러닝 시스템 디자인)

Prioritizing What to Work on (스팸 분류기 개선하기)

In the next few videos I'd like to talk about machine learning system design. These videos will touch on the main issues that you may face when designing a complex machine learning system, and will actually try to give advice on how to strategize putting together a complex machine learning system.In case this next set of videos seems a little disjointed that's because these videos will touch on a range of the different issues that you may come across when designing complex learning systems. And even though the next set of videos may seem somewhat less mathematical, I think that this material may turn out to be very useful, and potentially huge time savers when you're building big machine learning systems. Concretely, I'd like to begin with the issue of prioritizing how to spend your time on what to work on, and I'll begin with an example on spam classification.

다음 몇 개의 강의에서 머신 러닝 시스템 설계를 다룹니다. 이번 강의는 복잡한 머신 러닝 시스템을 설계할 때 직면할 수 있는 주요 문제와 복잡한 머신 러닝 시스템을 구성하는 방법에 대해 조언합니다. 대형 머신 러닝 시스템을 구축할 때 시간을 절약할 수 있는 유용한 방법을 설명합니다.

먼저 작업 우선순위를 정하는 문제부터 시작합니다.

Let's say you want to build a spam classifier. Here are a couple of examples of obvious spam and non-spam emails if the one on the left tried to sell things. And notice how spammers will deliberately misspell words, like Vincent with a 1 there, and mortgages. And on the right as maybe an obvious example of non-stamp email, actually email from my younger brother. Let's say we have a labeled training set of some number of spam emails and some non-spam emails denoted with labels y equals 1 or 0, how do we build a classifier using supervised learning to distinguish between spam and non-spam?

다음은 스팸 분류 사례입니다. 왼쪽 예는 물건을 팔려는 스팸 이메일이고, 오른쪽 예는 흔한 보통 이메일입니다. 스팸 발송자는 w4tches 와 같이 단어의 철자를 고의로 오타로 입력합니다. 오른쪽 이메일은 일반 이메일입니다. 사실 제 동생이 보낸 이메일입니다. 스팸 이메일은 레이블이 y = 1이고, 일반 이메일은 레이블이 y = 0로 지정된 학습 셋이 있다고 가정합니다. 지도 학습을 사용하여 스팸 이메일과 일반 이메일을 구별하는 스팸 분류기를 어떻게 구축할 수 있을까요?

In order to apply supervised learning, the first decision we must make is how do we want to represent x, that is the features of the email. Given the features x and the labels y in our training set, we can then train a classifier, for example using logistic regression. Here's one way to choose a set of features for our emails. We could come up with, say, a list of maybe a hundred words that we think are indicative of whether e-mail is spam or non-spam, for example, if a piece of e-mail contains the word 'deal' maybe it's more likely to be spam if it contains the word 'buy' maybe more likely to be spam, a word like 'discount' is more likely to be spam, whereas if a piece of email contains my name, Andrew, maybe that means the person actually knows who I am and that might mean it's less likely to be spam. And maybe for some reason I think the word "now" may be indicative of non-spam because I get a lot of urgent emails, and so on, and maybe we choose a hundred words or so.

지도 학습을 적용하기 위해 우선 이메일의 피처 x를 결정합니다. 학습 셋의 피처 x와 레이블 y가 있을 때 로지스틱 회귀 분류기는 학습할 수 있습니다. 이메일에 대한 피처 셋을 선택하는 방법 중 한 가지는 이메일이 스팸인지 아닌지를 추정할 수 있는 수백 개의 단어 목록을 만드는 것입니다. 예를 들어, 이메일이 '거래(deal)'라는 단어를 포함한다면 스팸일지도 모릅니다. '구매(buy)'라는 단어까지 포함한다면 스팸일 확률이 더 높습니다. '할인(discount)'이라는 단어까지 포함한다면 스팸일 확률이 더 더 높습니다. 반면에 이메일이 '앤드류(Andrew)'라는 단어가 포함된다면 특정한 사람을 의미할 수 있습니다. 실제로 제가 누군인지 알고 있으며 스팸일 가능성은 적습니다. 그리고 '지금'이라는 단어가 스팸이 아닌 것을 나타낼 수 있다고 생각합니다. 왜냐하면 저는 긴급한 이메일을 많이 받기 때문입니다. 이런 수 백개의 단어를 선택합니다.

Given a piece of email, we can then take this piece of email and encode it into a feature vector as follows. I'm going to take my list of a hundred words and sort them in alphabetical order say. It doesn't have to be sorted. But, you know, here's a, here's my list of words, just count and so on, until eventually I'll get down to now, and so on and given a piece of e-mail like that shown on the right, I'm going to check and see whether or not each of these words appears in the e-mail and then I'm going to define a feature vector x where in this piece of an email on the right, my name doesn't appear so I'm gonna put a zero there. The word "by" does appear, so I'm gonna put a one there and I'm just gonna put one's or zeroes. I'm gonna put a one even though the word "by" occurs twice. I'm not gonna recount how many times the word occurs. The word "deal" appears, I put a one there. The word "discount" doesn't appear, at least not in this this little short email, and so on. The word "now" does appear and so on. So I put ones and zeroes in this feature vector depending on whether or not a particular word appears.

이메일을 다음과 같은 피처 벡터로 인코딩합니다. 100 개의 단어 목록을 알파벳 순서로 정렬합니다. 사실 정렬할 필요는 없지만 단어 목록을 만듭니다. 오른쪽 하단에 보이는 것처럼 이메일이 있습니다. 각 단어가 이메일에 있는지 여부를 살펴보도 피처 벡터 X를 정의합니다. 오른쪽에 있는 이 메일 부분에 제 이름 'Andrew'가 없습니다. 0을 넣습니다. 'buy' 단어가 있으므로 1을 넣습니다. 'buy'가 두 번 나오더라도 1을 넣습니다. 이 단어가 몇 번 나오는지는 언급하지 않습니다. 'deal' 단어가 있으므로 1을 넣습니다. 'discount'라는 단어는 없습니다. 최소한 이 짧은 이메일에는 없으므로 0을 넣습니다. 'now' 단어가 있으므로 1을 넣습니다. 그래서 특정 단어가 있는지 없는지 여부에 따라 Feature 벡터에 1과 0을 표시합니다.

And in this example my feature vector would have to mention one hundred, if I have a hundred, if if I chose a hundred words to use for this representation and each of my features Xj will basically be 1 if you have a particular word that, we'll call this word j, appears in the email and Xj would be zero otherwise.Okay. So that gives me a feature representation of a piece of email.

이 예에서 피처 벡터 X는 R^100차원입니다. 100 개의 단어를 선택하고 Feature xj를 순차적으로 부여합니다. 이메일에 xj로 지정된 특정 단어가 있다면 xj의 값은 1이고 없다면 xj의 값은 0입니다. 이것이 이메일의 내용으로 피처를 표현하는 것입니다.

By the way, even though I've described this process as manually picking a hundred words, in practice what's most commonly done is to look through a training set, and in the training set depict the most frequently occurring n words where n is usually between ten thousand and fifty thousand, and use those as your features. So rather than manually picking a hundred words, here you look through the training examples and pick the most frequently occurring words like ten thousand to fifty thousand words, and those form the features that you are going to use to represent your email for spam classification.

그런데 수동으로 100개의 단어를 고르는 것으로 설명했지만, 실제로는 만개에서 오만 개의 단어 사이의 단어에서 가장 많이 반복되는 n개의 단어를 학습하여 Feature로 사용합니다. 수백 개의 단어를 수동으로 선택하는 대신 학습 예제를 살펴보고 자주 사용하는 만에서 5만 개의 단어를 선택하고 스팸 분류를 위한 이메일을 검사할 피처를 구성합니다.

Now, if you're building a spam classifier one question that you may face is, what's the best use of your time in order to make your spam classifier have higher accuracy, you have lower error. One natural inclination is going to collect lots of data. Right? And in fact there's this tendency to think that, well the more data we have the better the algorithm will do. And in fact, in the email spam domain, there are actually pretty serious projects called Honey Pot Projects, which create fake email addresses and try to get these fake email addresses into the hands of spammers and use that to try to collect tons of spam email, and therefore you know, get a lot of spam data to train learning algorithms.

이제 스팸 분류기를 구축할 때 직면하는 질문은 스팸 분류기의 정확도를 높이는 방법이 무엇인지 찾는 것입니다. 그 방법은 시간을 절약하고 오류가 적을 것입니다. 일반적인 선택은 더 많은 데이터를 수집하는 것입니다. 사람들은 데이터가 많을수록 알고리즘이 더 잘 동작할 것이라고 생각하는 경향이 있습니다. 스팸 이메일 관련 업계에 허니팟(Honeypot Project) 프로젝트가 있습니다. 스팸 이메일을 수집하기 위해 가짜 이메일 주소를 만들고 스패머(스팸 이메일을 생성하는 기계)가 가짜 이메일 주소로 스팸 이메일을 보내게 합니다. 학습 알고리즘이 학습할 수 있는 많은 데이터를 자동으로 획득합니다.

But we've already seen in the previous sets of videos that getting lots of data will often help, but not all the time. But for most machine learning problems, there are a lot of other things you could usually imagine doing to improve performance.

그러나, 지난 강의에서 배웠듯이 많은 데이터를 얻는 것이 항상 도움이 되는 것은 아닙니다. 대부분의 머신러닝 문제는 성능 향상을 위해 수행할 수 있는 다른 작업이 많이 있습니다.

For spam, one thing you might think of is to develop more sophisticated features on the email, maybe based on the email routing information. And this would be information contained in the email header. So, when spammers send email, very often they will try to obscure the origins of the email, and maybe use fake email headers. Or send email through very unusual sets of computer service. Through very unusual routes, in order to get the spam to you. And some of this information will be reflected in the email header. And so one can imagine, looking at the email headers and trying to develop more sophisticated features to capture this sort of email routing information to identify if something is spam.

스팸의 경우, 이메일 라우팅 정보를 기반으로 더 정교한 피처를 개발하는 것이 좋습니다. 이메일 헤더는 많은 정보를 포함하고 있습니다. 스패머가 이메일을 보낼 때 이메일의 출처를 모호하게 표시하고 가짜 이메일 헤더를 사용합니다. 또는 특이한 컴퓨터 서비스나 특이한 경로를 통해 이메일을 보내고 이메일 헤더에 그것들이 남습니다. 이메일 헤더를 보고 스팸 이메일을 식별할 수 있는 이메일 라우팅 정보를 캡처하는 더 정교한 피처를 개발할 필요가 있습니다.

Something else you might consider doing is to look at the email message body, that is the email text, and try to develop more sophisticated features. For example, should the word 'discount' and the word 'discounts' be treated as the same words or should we have treat the words 'deal' and 'dealer' as the same word? Maybe even though one is lower case and one in capitalized in this example.

또 다른 작업은 이메일 본문의 텍스트를 살펴보고 정교한 피처를 개발하는 것입니다. 예를 들어, 'discount'라는 단어와 'discounts'라는 단어를 같은 취급 해야 할까요? 아니면 'deal'과 'Dealer' 단어를 같은 단어로 취급해야 할까요? 소문자로 된 단어와 대문자로 된 단어도 같은 단어로 취급해야 할까요?

Or do we want more complex features about punctuation because maybe spam is using exclamation marks a lot more. I don't know. And along the same lines, maybe we also want to develop more sophisticated algorithms to detect and maybe to correct to deliberate misspellings, like mortgage, medicine, watches. Because spammers actually do this, because if you have watches with a 4 in there then well, with the simple technique that we talked about just now, the spam classifier might not equate this as the same thing as the word "watches, " and so it may have a harder time realizing that something is spam with these deliberate misspellings. And this is why spammers do it.

스팸 이메일은 느낌표를 훨씬 많이 사용하기 때문에 문장 마침표에 대한 복잡한 Feature 가 필요할까요? 모르겠습니다. 같은 맥락에서 m0rtgage(모기지), med1cine(의약품), wa4ch(시계)와 같은 고의적인 맞춤법 오류를 감지하고 수정할 수 있는 보다 정교한 알고리즘을 개발할 수도 있습니다. 스패머는 실제로 이렇게 이메일을 발송합니다. 왜냐하면 스팸 분류기는 w4tch 단어를 'watches' 단어와 동일하게 보지 않을 수 있기 때문입니다. 고의적인 맞춤법 오류는 이메일을 스팸으로 인식하기 더 어렵게 합니다.

While working on a machine learning problem, very often you can brainstorm lists of different things to try, like these. By the way, I've actually worked on the spam problem myself for a while. And I actually spent quite some time on it. And even though I kind of understand the spam problem, I actually know a bit about it, I would actually have a very hard time telling you of these four options which is the best use of your time so what happens, frankly what happens far too often is that a research group or product group will randomly fixate on one of these options. And sometimes that turns out not to be the most fruitful way to spend your time depending, you know, on which of these options someone ends up randomly fixating on.

머신러닝 문제를 개선하기 위해 사람들은 자주 다음에 시도할 작업 목록과 관련된 브레인스토밍을 합니다. 저는 한동안 스팸 문제를 해결하기 위해 일했었고, 실제로 문제 해결을 위해 꽤 많은 시간을 소비했습니다. 제가 스팸 문제를 이해하고 있지만 실제로 조금 알고 있습니다. 사실 여러분의 시간을 가장 잘 활용하는 네 가지 옵션에 대해 말씀드리기 매우 어려울 것입니다. 연구 팀이나 제품 팀은 네 가지 옵션 중에서 무작위로 하나를 선택합니다. 그리고 결과적으로 선택한 방법은 가장 좋지 않은 방법으로 밝혀지기도 합니다.

By the way, in fact, if you even get to the stage where you brainstorm a list of different options to try, you're probably already ahead of the curve. Sadly, what most people do is instead of trying to list out the options of things you might try, what far too many people do is wake up one morning and, for some reason, just, you know, have a weird gut feeling that, "Oh let's have a huge honeypot project to go and collect tons more data" and for whatever strange reason just sort of wake up one morning and randomly fixate on one thing and just work on that for six months.

실제로 시도해 볼 옵션 목록을 작성하는 브레인스토밍 단계는 학습 곡선을 살펴보았을 것입니다. 하지만, 안타깝게도 대부분의 사람들은 옵션을 나열하는 대신에 어느 날 아침에 일어나서 어떤 이유로든 이상한 직감을 느낍니다. "더 많은 데이터를 수집하기 위해 거대한 허니팟 프로젝트를 합시다" 그리고 이유가 무엇이든 간에 어느 날 아침에 일어나 무작위로 한 가지에 집착하여 6 개월 동안 시간을 투자합니다

But I think we can do better. And in particular what I'd like to do in the next video is tell you about the concept of error analysis and talk about the way where you can try to have a more systematic way to choose amongst the options of the many different things you might work, and therefore be more likely to select what is actually a good way to spend your time, you know for the next few weeks, or next few days or the next few months.

우리가 더 잘할 수 있습니다. 특히 다음 강의에서 오류 분석의 개념에 대해 설명하고 다양한 옵션 중에서 선택하는 보다 체계적인 방법을 설명합니다. 따라서, 다음 며칠 또는 다음 몇 주 또는 다음 몇 개월 동안 실제로 시간을 보내는 좋은 방법을 선택할 가능성이 높습니다.

앤드류 응의 머신러닝 동영상 강의

정리하며

스팸 이메일을 분류하는 알고리즘을 개발할 때 가장 많이 사용하는 방법은 이메일 본문에 있는 단어를 보고 판단합니다. 즉, 이메일이 스팸인지 아닌지를 추정할 수 있는 수백 개의 단어 목록을 만들고, 단어를 하나의 Feature와 연결하는 것입니다. xj로 지정된 특정 단어가 있다면 xj의 값은 1이고 없다면 xj의 값은 0입니다. 따라서, Feature 벡터 X는 R^100차원의 벡터입니다.

스팸 분류기의 성능 개선하기 위해 생각할 수 있는 방법은 이메일 헤더에 포함된 정보를 활용하는 것입니다. 또, 'discount'라는 단어와 'discounts'을 같은 단어로 취급하는 Feature를 개발합니다. 고의로 철자를 잘못적은 'watch'와 'w4tch'도 같은 단어로 취급하는 Feature를 개발합니다.

이런 아이디어를 추가하여 스팸 분류기를 더욱 정교하게 만들 수 있습니다. 하지만, 수많은 아이디어 중에서 어느 것이 스팸 분류기에 확실히 효과적일지는 알 수 없습니다. 많은 사람들은 아무런 논리적인 설득력 없이 허니팟 프로젝트와 같은 것을 구상하고 실행하면서 시간을 낭비합니다.