brunch

You can make anything
by writing

C.S.Lewis

by 라인하트 Nov 06. 2020

앤드류 응의 머신러닝(11-2): 스팸 분류기 오류분석

온라인 강의 플랫폼 코세라의 창립자인 앤드류 응 (Andrew Ng) 교수는 인공지능 업계의 거장입니다. 그가 스탠퍼드 대학에서 머신 러닝 입문자에게 한 강의를 그대로 코세라 온라인 강의 (Coursera.org)에서 무료로 배울 수 있습니다. 이 강의는 머신러닝 입문자들의 필수코스입니다. 인공지능과 머신러닝을 혼자 공부하면서 자연스럽게 만나게 되는 강의입니다.

Advice for Applying Machine Learning

머신 러닝 적용을 위한 조언

Machine Learning System Design

(머신러닝 시스템 디자인)

Error Analysis (오류 분석)

In the last video I talked about how, when faced with a machine learning problem, there are often lots of different ideas for how to improve the algorithm. In this video, let's talk about the concept of error analysis. Which will hopefully give you a way to more systematically make some of these decisions.

지난 강의에서 스팸 이메일을 분류하는 머신러닝 문제에 부딪혔을 때 학습 알고리즘을 개선하는 방법에 대한 다양한 아이디어를 설명했습니다. 이번 강의에서 다음에 할 작업을 보다 체계적으로 결정할 수 있는 오류 분석의 개념을 설명합니다.

If you're starting work on a machine learning problem, or building a machine learning application. It's often considered very good practice to start, not by building a very complicated system with lots of complex features and so on. But to instead start by building a very simple algorithm that you can implement quickly. And when I start with a learning problem what I usually do is spend at most one day, like literally at most 24 hours, To try to get something really quick and dirty. Frankly not at all sophisticated system but get something really quick and dirty running, and implement it and then test it on my cross-validation data. Once you've done that you can then plot learning curves, this is what we talked about in the previous set of videos. But plot learning curves of the training and test errors to try to figure out if you're learning algorithm maybe suffering from high bias or high variance, or something else. And use that to try to decide if having more data, more features, and so on are likely to help.

And the reason that this is a good approach is often, when you're just starting out on a learning problem, there's really no way to tell in advance. Whether you need more complex features, or whether you need more data, or something else. And it's just very hard to tell in advance, that is, in. the absence of evidence, in the absence of seeing a learning curve.It's just incredibly difficult to figure out where you should be spending your time. And it's often by implementing even a very, very quick and dirty implementation. And by plotting learning curves, that helps you make these decisions. So if you like you can to think of this as a way of avoiding whats sometimes called premature optimization in computer programming. And this idea that says we should let evidence guide our decisions on where to spend our time rather than use gut feeling, which is often wrong.

머신 러닝 애플리케이션을 구축하거나 머신 러닝 문제를 다룰 때, 처음부터 매우 복잡한 피처를 가진 시스템을 구축할 필요가 없습니다. 대신에 빠르게 구현할 수 있는 간단한 알고리즘을 먼저 구축합니다. 최대 24시간 안에 재빨리 간단한 알고리즘을 구현합니다. 정교하지 않지만 간단한 알고리즘을 교차 검증 데이터 셋에서 테스트합니다. 그리고 학습 곡선을 그립니다. 알고리즘이 높은 편향 또는 높은 분산인지 확인하거나 다른 문제가 있는 지를 알기 위해 학습 오차와 테스트 오차의 곡선을 도식화합니다. 더 많은 데이터가 필요한 지 아니면 더 많은 피처가 필요한 지를 결정합니다.

이런 접근법이 효율적인 이유는 학습 알고리즘 문제를 처음 다룰 때 무엇이 필요한 지를 미리 알기 어렵기 때문입니다. 학습 곡선이 없으면 아무것도 알 수가 없습니다. 학습 알고리즘을 개선하기 위해 무엇을 해야 하는지 어디서 많은 시간을 투자해야 하는 지를 알 수 없습니다. 하지만, 간단한 구현을 통해 학습 곡선을 그리면 무엇을 해야 할지를 알 수가 있습니다. 컴퓨터 프로그래밍에서 애플리케이션을 조기에 최적화하는 방법입니다. 잘못된 직감에 의존하는 것보다 더 나은 결정을 내릴 수 있습니다.

In addition to plotting learning curves, one other thing that's often very useful to do is what's called error analysis. And what I mean by that is that when building say a spam classifier. I will often look at my cross validation set and manually look at the emails that my algorithm is making errors on. So look at the spam e-mails and non-spam e-mails that the algorithm is misclassifying and see if you can spot any systematic patterns in what type of examples it is misclassifying. And often, by doing that, this is the process that will inspire you to design new features. Or they'll tell you what are the current things or current shortcomings of the system. And give you the inspiration you need to come up with improvements to it.

학습 곡선을 그리는 것 외에 오류 분석이 있습니다. 스팸 분류기를 구축할 때 교차 검증 셋을 살펴보고 오류를 일으키는 메일을 직접 점검합니다. 알고리즘이 잘못 분류한 스팸 이메일과 일반 이메일을 살펴보고 어떤 유형의 이메일을 잘못 분류하는 지를 발견합니다. 알고리즘이 잘못 분류하는 패턴을 발견하고 새로운 피처를 설계하는 통찰력을 얻을 수 있습니다. 시스템의 문제점이나 단점을 발견하고 개선을 할 수 있습니다.

Concretely, here's a specific example. Let's say you've built a spam classifier and you have 500 examples in your cross validation set. And let's say in this example that the algorithm has a very high error rate. And this classifies 100 of these cross validation examples. So what I do is manually examine these 100 errors and manually categorize them. Based on things like what type of email it is, what cues or what features you think might have helped the algorithm classify them correctly.

So, specifically, by what type of email it is, if I look through these 100 errors, I might find that maybe the most common types of spam emails in these classifies are maybe emails on pharma or pharmacies, trying to sell drugs. Maybe emails that are trying to sell replicas such as fake watches, fake random things, maybe some emails trying to steal passwords, These are also called fishing emails, that's another big category of emails, and maybe other categories. So in terms of classify what type of email it is, I would actually go through and count up my hundred emails. Maybe I find that 12 of them is label emails, or pharma emails, and maybe 4 of them are emails trying to sell replicas, that sell fake watches or something. And maybe I find that 53 of them are these what's called phishing emails, basically emails trying to persuade you to give them your password. And 31 emails are other types of emails. And it's by counting up the number of emails in these different categories that you might discover, for example.

여기 구체적인 사례가 있습니다. 여러분은 스팸 분류기 알고리즘을 구축했고 교차 검증 셋 mcv = 500입니다. 이 예제에서 알고리즘의 오류율이 매우 높습니다. 스팸 분류기 알고리즘이 잘못 분류한 100 개의 이메일을 직접 검사하고 분류합니다. 이메일의 유형을 분류할 때 피처 기반으로 생각합니다.

100 개의 이메일은 크게 4가지 유형별로 분류합니다. 12개의 약국에서 약을 판매하는 이메일, 4개의 가짜 시계나 가짜 제품을 판매하는 이메일, 53개의 비밀번호를 훔치려는 피싱 이메일, 그리고 31개의 기타입니다. 유형별로 분류한 이메일을 분석합니다.

That the algorithm is doing really, particularly poorly on emails trying to steal passwords. And that may suggest that it might be worth your effort to look more carefully at that type of email and see if you can come up with better features to categorize them correctly. And, also what I might do is look at what cues or what additional features might have helped the algorithm classify the emails. So let's say that some of our hypotheses about things or features that might help us classify emails better are. Trying to detect deliberate misspellings versus unusual email routing versus unusual spamming punctuation. Such as if people use a lot of exclamation marks. And once again I would manually go through and let's say I find five cases of this and 16 of this and 32 of this and a bunch of other types of emails as well. And if this is what you get on your cross validation set, then it really tells you that maybe deliberate spellings is a sufficiently rare phenomenon that maybe it's not worth all the time trying to write algorithms that detect that. But if you find that a lot of spammers are using, you know, unusual punctuation, then maybe that's a strong sign that it might actually be worth your while to spend the time to develop more sophisticated features based on the punctuation.

So this sort of error analysis, which is really the process of manually examining the mistakes that the algorithm makes, can often help guide you to the most fruitful avenues to pursue. And this also explains why I often recommend implementing a quick and 게 implementation of an algorithm. What we really want to do is figure out what are the most difficult examples for an algorithm to classify. And very often for different algorithms, for different learning algorithms they'll often find similar categories of examples difficult. And by having a quick and dirty implementation, that's often a quick way to let you identify some errors and quickly identify what are the hard examples. So that you can focus your effort on those.

스팸 분류기 알고리즘은 비밀번호를 훔치려는 피싱 이메일을 잘 분류하지 못합니다. 53개의 피싱 이메일을 더 잘 처리할 수 있는 피처를 찾기 위해 세부적으로 분석합니다. 53개의 이메일은 세밀하게 5개의 고의적인 맞춤법 오류, 16개의 비정상적인 이메일 라우팅, 32개의 비정상적인 마침표로 나눕니다. 고의적인 맞춤법 오류는 교차 검증 셋에서 얻은 것이기 때문에 매우 드문 현상일 수 있습니다. 따라서, 검출을 위한 코드 작성이 쓸모없을지도 모릅니다. 하지만, 비정상적인 마침표 문제를 처리하는 정교한 피처를 설계하고 코드를 작성하는 것은 가치가 있습니다.

알고리즘이 올바르게 처리하지 못하는 예제를 분석하는 과정은 중요합니다. 직접 오류를 분석하는 과정에서 가장 유익한 방법을 찾아낼 수 있습니다. 간단하고 빠르게 구현한 알고리즘은 오류 분석도 가능합니다. 오류 분석으로 알고리즘을 개선할 수 있습니다. 간단한 구현은 오류를 식별하고 오류를 일으키는 예제가 무엇인지를 신속하게 파악합니다. 그리고 식별된 문제를 해결하기 위해 노력합니다.

Lastly, when developing learning algorithms, one other useful tip is to make sure that you have a numerical evaluation of your learning algorithm. And what I mean by that is you if you're developing a learning algorithm, it's often incredibly helpful. If you have a way of evaluating your learning algorithm that just gives you back a single real number, maybe accuracy, maybe error. But the single real number that tells you how well your learning algorithm is doing. I'll talk more about this specific concept in later videos, but here's a specific example. Let's say we're trying to decide whether or not we should treat words like discount, discounts, discounted, discounting as the same word. So you know maybe one way to do that is to just look at the first few characters in the word like, you know. If you just look at the first few characters of a word, then youfigure out that maybe all of these words roughly have similar meanings.

In natural language processing, the way that this is done is actually using a type of software called stemming software. And if you ever want to do this yourself, search on a web-search engine for the porter stemmer, and that would be one reasonable piece of software for doing this sort of stemming, which will let you treat all these words, discount, discounts, and so on, as the same word. But using a stemming software that basically looks at the first few alphabets of a word, more of less, it can help, but it can hurt. And it can hurt because for example, the software may mistake the words universe and university as being the same thing. Because, you know, these two words start off with the same alphabets.

마지막으로 학습 알고리즘을 개발할 때 유용한 팁 중 하나는 학습 알고리즘을 숫자로 평가하는 것입니다. 학습 알고리즘이 얼마나 잘 동작하는 지를 하나의 숫자로 정확성이나 오류를 측정합니다. 다음 강의에서 자세히 설명하겠지만 간단하게 설명합니다. 'discount', 'discounts', 'discounted', 'discounting'을 같은 단어로 취급해야 할까요? 여러 방법 중 하나는 처음 몇 글자만 보는 것입니다. 단어의 처음 몇 글자만 살펴보면 모든 단어가 대략 비슷한 의미를 가집니다.

자연어 처리에서 어간 추출을 하는 스태밍 소프트웨어 (Stemming Software)를 사용합니다. 단어는 어간과 어미로 이루어지고, 어미를 제외한 어간을 분리하는 것을 어간 추출이라고 합니다. 어간은 핵심적인 의미를 가지고, 어미는 추가적인 의미를 부여합니다. 구글에서 포터 스태머를 검색하면 여러 종류의 소프트웨어를 표시합니다. 이 소프트웨어는 연관된 단어를 하나의 단어로 분류합니다. 포터 스태머는 기본적으로 단어의 처음 몇 개의 알파벳을 보고 분류하는 스태밍 소프트웨어는 도움은 되지만 가끔 문제를 일으킬 수 있습니다. 예를 들어, universe (우주)와 university(대학)를 같은 단어로 착각하기도 합니다. 두 단어는 같은 알파벳 'univer'로 시작하기 때문입니다.

So if you're trying to decide whether or not to use stemming software for a spam cross classifier, it's not always easy to tell. And in particular, error analysis may not actually be helpful for deciding if this sort of stemming idea is a good idea. Instead, the best way to figure out if using stemming software is good to help your classifier is if you have a way to very quickly just try it and see if it works. And in order to do this, having a way to numerically evaluate your algorithm is going to be very helpful.

Concretely, maybe the most natural thing to do is to look at the cross validation error of the algorithm's performance with and without stemming. So, if you run your algorithm without stemming and end up with 5 percent classification error. And you rerun it and you end up with 3 percent classification error, then this decrease in error very quickly allows you to decide that it looks like using stemming is a good idea. For this particular problem, there's a very natural, single, real number evaluation metric, namely the cross validation error. We'll see later examples where coming up with this sort of single, real number evaluation metric will need a little bit more work. But as we'll see in a later video, doing so would also then let you make these decisions much more quickly of say, whether or not to use stemming.

스팸 분류기에 스태밍 소프트웨어를 사용하는 것을 결정하는 것은 쉽지 않습니다. 오류 분석은 스태밍 소프트웨어를 사용하는 것이 좋은 지 아닌 지를 결정하는 것에는 도움이 되지 않습니다. 하지만, 가장 좋은 방법은 스태머 소프트웨어를 빠르게 시도하여 결과를 확인하는 것입니다. 확인하는 방법은 하나의 숫자로 평가한다면 유용합니다.

예를 들어, 스태밍 소프트웨어가 있을 때와 없을 때 알고리즘의 교차 검증 오류를 확인하는 것입니다. 스태밍 소프트웨어가 없을 때 5%의 교차 검증 오류가 발생하고, 스태밍 소프트웨어가 있을 때 3%의 교차 검증 오류가 발생합니다. 따라서, 스태머 소프트웨어를 사용하는 것이 좋다고 판단할 수 있습니다. 교차 검증 오류를 활용할 때 실수 평가지표를 활용할 수 있습니다. 이런 형태의 실수 평가 지표를 만들기 위해 좀 더 복잡하고 많은 작업이 필요합니다. 그러나, 스태밍 소프트웨어를 사용할지 여부와 관계없이 훨씬 더 빠른 결정을 내릴 수 있습니다.

And, just as one more quick example, let's say that you're also trying to decide whether or not to distinguish between upper versus lower case. So, you know, as the word, mom, were upper case, and versus lower case m, should that be treated as the same word or as different words? Should this be treated as the same feature, or as different features? And so, once again, because we have a way to evaluate our algorithm. If you try this down here, if I stopped distinguishing upper and lower case, maybe I end up with 3.2 percent error. And I find that therefore, this does worse than if I use only stemming. So, this let's me very quickly decide to go ahead and to distinguish or to not distinguish between upper and lowercase.

또 다른 사례는 대문자와 소문자를 구분할지 여부를 결정하는 것입니다. 'Mom'과 'mom'을 같은 단어로 취급해야 할까요? 동일한 피처로 취급해야 할까요? 다시 한번 알고리즘을 평가합니다. 대소문자를 구분을 하지 않으면 아마 3.2%의 오차로 끝날 것입니다. 스태머 소프트웨어만 사용하는 것보다 더 안 좋습니다. 따라서, 대문자와 소문자를 구별할지 구별하지 않을지를 빠르게 결정할 수 있습니다.

So when you're developing a learning algorithm, very often you'll be trying out lots of new ideas and lots of new versions of your learning algorithm. If every time you try out a new idea, if you end up manually examining a bunch of examples again to see if it got better or worse, that's gonna make it really hard to make decisions on. Do you use stemming or not? Do you distinguish upper and lower case or not? But by having a single real number evaluation metric, you can then just look and see, oh, did the arrow go up or did it go down? And you can use that to much more rapidly try out new ideas and almost right away tell if your new idea has improved or worsened the performance of the learning algorithm. And this will let you often make much faster progress.

So the recommended, strongly recommended the way to do error analysis is on the cross validations there rather than the test set. But, you know, there are people that will do this on the test set, even though that's definitely a less mathematic appropriate, certainly a less recommended way to, thing to do than to do error analysis on your cross validation set.

학습 알고리즘을 개발할 때 자주 새로운 아이디어나 새로운 버전을 시도합니다. 새로운 아이디어를 시도할 때마다 예제를 수동으로 검토하고 개선한다면 결정을 내리기 정말 어렵습니다. 스태밍 소프트웨어를 사용할까요? 대소문자를 구분할까요? 실수 평가 지표를 사용하면 화살표가 위로 올라가거나 내려갑니다. 새로운 아이디어를 훨씬 더 빠르게 시도하고 학습 알고리즘의 성능을 향상 또는 악화하는 지를 즉시 확인할 수 있습니다. 하나의 숫자로 알고리즘을 평가하는 것은 빠른 결정에 필수적입니다.

그리고, 권장하는 오류 분석 방법은 테스트 셋이 아닌 교차 검증 셋에서 하는 것입니다. 어떤 사람들은 테스트 셋에서 이런 작업을 수행합니다. 그것은 수학적으로 권장되는 방법은 아닙니다. 교차 검증 셋에서 오류 분석을 수행해야 합니다.

Set to wrap up this video, when starting on a new machine learning problem, what I almost always recommend is to implement a quick and dirty implementation of your learning out of them. And I've almost never seen anyone spend too little time on this quick and dirty implementation. I've pretty much only ever seen people spend much too much time building their first, supposedly, quick and dirty implementation. So really, don't worry about it being too quick, or don't worry about it being too dirty. But really, implement something as quickly as you can. And once you have the initial implementation, this is then a powerful tool for deciding where to spend your time next. Because first you can look at the errors it makes, and do this sort of error analysis to see what other mistakes it makes, and use that to inspire further development. And second, assuming your quick and dirty implementation incorporated a single real number evaluation metric. This can then be a vehicle for you to try out different ideas and quickly see if the different ideas you're trying out are improving the performance of your algorithm. And therefore let you, maybe much more quickly make decisions about what things to fold in and what things to incorporate into your learning algorithm.

지금까지 오류 분석을 다루었습니다. 새로운 머신 러닝 문제를 시작할 때 첫 번째 할 일은 간단한 학습 알고리즘을 구현하는 것입니다. 간단한 구현을 위해 시간의 너무 적게 쓰거나 너무 많이 쓸 필요는 없습니다. 너무 단순하고 대충 만들까 봐 걱정하시 마세요. 실제로 가능한 한 빨리 무엇인가를 구현합니다. 간단하게 구현한 초기 버전의 학습 알고리즘은 다음에 무엇을 해야 할지 결정할 수 있는 강력한 도구입니다. 왜냐하면 먼저 오류를 확인하고 분석하여 추가적인 개발에 영감을 줄 수 있기 때문입니다. 또한, 실수 평가 지표를 제공하기 때문입니다. 실수 평가 매트릭은 다른 아이디어에 대한 알고리즘의 성능을 평가할 수 있는 수단입니다. 따라서, 어떤 아이디어를 추가할지 말지를 빠르게 결정할 수 있습니다.

앤드류 응의 머신러닝 동영상 강의

정리하며

처음 머신 러닝 애플리케이션을 구축할 때 처음부터 매우 복잡한 Featue를 가진 복잡한 시스템을 구축할 필요가 없습니다. 대신에 빠르게 구현할 수 있는 매우 간단한 알고리즘을 구축합니다. 그리고, 교차 검증 데이터 셋에서 테스트합니다.

첫 번째로 학습 곡선을 그립니다. 알고리즘이 높은 편향, 높은 편차, 또는 다른 문제가 있는 지를 알아낼 수 있습니다. 더 많은 데이터가 필요한 지 또는 더 많은 Feature가 필요한 지를 알 수 있습니다.

두 번째로 오류를 일으키는 예제를 직접 살펴보는 오류 분석을 합니다. 예를 들어, 스팸을 분류하는 알고리즘이 잘못 분류한 예제를 직접 살펴보면서 유형별로 분류합니다. 그리고, 알고리즘의 문제점을 해결할 수 있는 아이디어나 Feature를 발견할 수 있습니다.

예를 들어, 500개의 교차 검증 셋이 있고, 100개의 예제에서 오류가 발생했다면, 직접 유형별로 분류하고 패턴을 찾습니다. 5개의 고의적인 맞춤법 오류, 16개의 비정상적인 이메일 라우팅, 32개의 비정상적인 마침표로 분류되었습니다. 여기서 고의적인 맞춤법 오류나 비정상적인 이메일 라우팅은 매우 드문 현상일 수 있기 때문에 고려하지 않습니다. 32개의 비정상적인 마침표 문제를 가장 먼저 해결하기 위해 코드를 작성합니다. 이렇게 직접 오류를 살펴보면서 다음에 해야 할 작업의 우선순위를 결정할 수 있습니다.

마지막으로 교차 검증 오류를 실수 평가 지표로 활용합니다. 예를 들어, 스팸 이메일을 분류하기 위해 단어의 어미는 다르지만 같은 뜻을 가지는 단어를 하나로 묶는 스태머 소프트웨어를 사용하는 것이 좋을까요? 직감에 의한 의사 결정이 아닌 수치 평가 지표에 의한 결정이 가능합니다. 스태머 소프트웨어가 있을 때와 없을 때 알고리즘의 교차 검증 오류를 확인합니다. 스태머 소프트웨어가 없을 때 5%, 있을 때 3%의 분류 오류가 발생한다면, 수치 평가 지표에 의해 의사 결정은 쉽습니다. 대문자와 소문자를 구분할 필요가 있을까요? 스태머 소프트웨어와 함께 사용하면 3.2%의 오류율이 발생한다면, 구분은 의미가 없고 스태머 소프트웨어만 사용합니다.

오류 분석은 항상 테스트 셋이 아닌 교차 검증 셋에서 해야 효과적입니다. 가끔 테스트 셋에서 하는 것을 권장하지 않습니다.