brunch

You can make anything
by writing

C.S.Lewis

by 라인하트 Nov 07. 2020

앤드류 응의 머신러닝(11-3):한쪽으로치우친 데이터

온라인 강의 플랫폼 코세라의 창립자인 앤드류 응 (Andrew Ng) 교수는 인공지능 업계의 거장입니다. 그가 스탠퍼드 대학에서 머신 러닝 입문자에게 한 강의를 그대로 코세라 온라인 강의 (Coursera.org)에서 무료로 배울 수 있습니다. 이 강의는 머신러닝 입문자들의 필수코스입니다. 인공지능과 머신러닝을 혼자 공부하면서 자연스럽게 만나게 되는 강의입니다.

Advice for Applying Machine Learning

머신 러닝 적용을 위한 조언

Handling Skewed Data (한쪽으로 치우친 데이터)

Error Metrics for Skewed Classes (한쪽으로 치우친 클래스를 위한 오류 지표)

In the previous video, I talked about error analysis and the importance of having error metrics, that is of having a single real number evaluation metric for your learning algorithm to tell how well it's doing.

지난 강의에서 오류 분석과 오류 지표의 중요성을 설명했습니다. 단일 실수 평가 지표는 학습 알고리즘이 얼마나 잘 동작하는 지를 평가합니다.

In the context of evaluation and of error metrics, there is one important case, where it's particularly tricky to come up with an appropriate error metric, or evaluation metric, for your learning algorithm. That case is the case of what's called skewed classes. Let me tell you what that means. Consider the problem of cancer classification, where we have features of medical patients and we want to decide whether or not they have cancer. So this is like the malignant versus benign tumor classification example that we had earlier. So let's say y equals 1 if the patient has cancer and y equals 0 if they do not. We have trained the progression classifier and let's say we test our classifier on a test set and find that we get 1 percent error. So, we're making 99% correct diagnosis. Seems like a really impressive result, right. We're correct 99% percent of the time.

학습 알고리즘에 대한 적절한 평가 지표와 오류 지표는 한똑으로 치우친 클래스의 경우에 중요합니다. 한쪽으로 치우친 클래스 (skewed class)가 무엇인지 예를 들어 설명합니다. 여기 암을 분류하는 문제가 있습니다. 특정 환자의 피처가 있고 알고리즘은 암인지 아닌지를 결정합니다. 과거에 다루었던 악성 종양과 양성 종양을 분류하는 사례와 같은 것입니다. 환자가 악성이면 y = 1, 정상이면 y= 0입니다. 로지스틱 회귀 분류기를 학습시킨 후 테스트 셋에서 1%의 오류를 발견했다고 가정합니다. 즉, 알고리즘은 99% 정확한 진단을 합니다. 매우 인상적인 결과처럼 보입니다.

But now, let's say we find out that only 0.5 percent of patients in our training test sets actually have cancer. So only half a percent of the patients that come through our screening process have cancer. In this case, the 1% error no longer looks so impressive. And in particular, here's a piece of code, here's actually a piece of non learning code that takes this input of features x and it ignores it. It just sets y equals 0 and always predicts, you know, nobody has cancer and this algorithm would actually get 0.5 percent error. So this is even better than the 1% error that we were getting just now and this is a non learning algorithm that you know, it is just predicting y equals 0 all the time. So this setting of when the ratio of positive to negative examples is very close to one of two extremes, where, in this case, the number of positive examples is much, much smaller than the number of negative examples because y equals one so rarely, this is what we call the case of skewed classes. We just have a lot more of examples from one class than from the other class. And by just predicting y equals 0 all the time, or maybe our predicting y equals 1 all the time, an algorithm can do pretty well.

하지만 전체 테스트 셋에 있는 0.5%의 환자만이 실제로 암에 걸렸습니다. 선별된 환자 중에 절반은 정상이고 절반은 암에 걸렸습니다. 이제 1% 오류는 인상적이지 않습니다. 피처 x의 값과 상관없이 무조건 y = 0이라고 분류하는 단순한 코드가 있습니다. 단순한 코드는 학습 없이 무조건 y=0이라고 예측합니다. 단순한 코드는 암환자가 없다고 y=0의 값을 출력하지만 0.5% 오류를 가집니다. 단순한 코드는 1%의 오류를 갖는 학습 알고리즘보다 훨씬 낫습니다. 한쪽으로 치우친 클래스(skewed class)는 긍정적인 예제와 부정적인 예제의 비율이 어느 한쪽으로 매우 치우친 예제를 의미합니다. y=1인 악성 종양 예제가 매우 적은 예제가 해당합니다. 한쪽으로 치우친 한쪽으로 치우친 클래스는 한쪽 예제가 압도적으로 적거나 많은 경우입니다. 단지 y = 1 이거나 y = 0이라고 항상 예측해도 좋은 결과를 얻습니다.

So the problem with using classification error or classification accuracy as our evaluation metric is the following. Let's say you have one joining algorithm that's getting 99.2% accuracy. So, that's a 0.8% error. Let's say you make a change to your algorithm and you now are getting 99.5% accuracy. That is 0.5% error. So, is this an improvement to the algorithm or not? One of the nice things about having a single real number evaluation metric is this helps us to quickly decide if we just need a good change to the algorithm or not. By going from 99.2% accuracy to 99.5% accuracy. You know, did we just do something useful or did we just replace our code with something that just predicts y equals zero more often? So, if you have very skewed classes it becomes much harder to use just classification accuracy, because you can get very high classification accuracies or very low errors, and it's not always clear if doing so is really improving the quality of your classifier because predicting y equals 0 all the time doesn't seem like a particularly good classifier. But just predicting y equals 0 more often can bring your error down to, you know, maybe as low as 0.5%.

따라서 평가 지표로 분류 오류나 분류 정밀도를 사용할 때 다음과 같은 문제가 있습니다. 99.2%의 정밀도를 갖는 알고리즘이 있을 때, 오류율은 0.8%입니다. 알고리즘을 수정하고 99.5%의 정밀도를 얻었을 때, 오류율은 0.5%입니다. 알고리즘이 개선된 것일까요? 단일 실수 평가 지표는 알고리즘을 수정하거나 아이디어를 추가한 것에 대한 평가를 신속하게 내릴 수 있도록 도와줍니다. 예를 들면, 정밀도가 99.2%에서 99.5%로 증가하였다고 평가합니다. 알고리즘 변경 작업은 유용한 작업인가요? 아니면 그냥 y=0을 더 자주 예측하는 코드로 바꾼 것인가요? 데이터가 한쪽으로 매우 치우친 한쪽으로 치우친 클래스(skewed class)의 경우 분류 정밀도만 사용하는 것은 위험합니다. 알고리즘이 실제와 다르게 매우 높은 분류 정밀도나 매우 낮은 오류율을 제공할 수 있기 때문입니다. 예를 들면, 항상 y=0을 예측하는 단순한 코드도 오류율 0.5%로 낮출 수 있습니다. 하지만 y=0이라고 더 자주 예측을 하면 오류를 0.5%로 낮출 수 있습니다.

When we're faced with such a skewed classes therefore we would want to come up with a different error metric or a different evaluation metric. One such evaluation metric are what's called precision recall. Let me explain what that is. Let's say we are evaluating a classifier on the test set. For the examples in the test set the actual class of that example in the test set is going to be either one or zero, right, if there is a binary classification problem. And what our learning algorithm will do is it will, you know, predict some value for the class and our learning algorithm will predict the value for each example in my test set and the predicted value will also be either one or zero. So let me draw a two by two table as follows, depending on a full of these entries depending on what was the actual class and what was the predicted class. If we have an example where the actual class is one and the predicted class is one then that's called an example that's a true positive, meaning our algorithm predicted that it's positive and in reality the example is positive. If our learning algorithm predicted that something is negative, class zero, and the actual class is also class zero then that's what's called a true negative. We predicted zero and it actually is zero. To find the other two boxes, if our learning algorithm predicts that the class is one but the actual class is zero, then that's called a false positive. So that means our algorithm for the patient is cancelled out in reality if the patient does not. And finally, the last box is a zero, one. That's called a false negative because our algorithm predicted zero, but the actual class was one.

한쪽으로 치우친 클래스를 평가할 때 정밀도 외에 다른 오류 지표나 평가 지표가 필요합니다. 또 다른 지표는 정밀도(Precision)와 재현율(Recall)입니다. 예를 들어, 분류기는 하나의 예제에 대해 클래스가 0 또는 1의 값을 가지는 이진 분류를 수행합니다. 분류기는 테스트 예제마다 예측값은 0 또는 1을 표시합니다. 실제 클래스와 예측 클래스의 값을 나열한 2 X 2 표를 그립니다.

실제 클래스가 1이고 예측 클래스가 1인 테스트 예제는 True Positivie입니다. 실제 클래스가 0이고 예측 클래스가 0이면 True Negative입니다. 실제 클래스와 예측 클래스가 같을 때 'True'를 붙이고, 예측 클래스가 1이면 Positive, 예측 클래스가 0이면 Negative입니다. 실제 클래스가 0이고 예측 클래스가 1이면 False Positive입니다. 실제 클래스가 1이고 예측 클래스가 0이면 False Negative입니다. 실제 클래스와 예측 클래스가 다를 때 'False'를 붙이고 예측 클래스가 1이면 Povitive, 예측 클래스가 0이면 Negative입니다. 예를 들어, 실제 환자는 정상이지만 알고리즘은 암이라고 예측한다면, False Positive입니다. 현실에서 알고리즘은 취소됩니다.

And so, we have this little sort of two by two table based on what was the actual class and what was the predicted class. So here's a different way of evaluating the performance of our algorithm. We're going to compute two numbers. The first is called precision - and what that says is, of all the patients where we've predicted that they have cancer, what fraction of them actually have cancer? So let me write this down, the precision of a classifier is the number of true positives divided by the number that we predicted as positive, right? So of all the patients that we went to those patients and we told them, "We think you have cancer." Of all those patients, what fraction of them actually have cancer? So that's called precision. And another way to write this would be true positives and then in the denominator is the number of predicted positives, and so that would be the sum of the, you know, entries in this first row of the table. So it would be true positives divided by true positives. I'm going to abbreviate positive as POS and then plus false positives, again abbreviating positive using POS. So that's called precision, and as you can tell high precision would be good. That means that all the patients that we went to and we said, "You know, we're very sorry. We think you have cancer, " high precision means that of that group of patients most of them we had actually made accurate predictions on them and they do have cancer.

실제 클래스와 예측된 클래스의 값에 따라 2 X 2 표을 만들었습니다. 여기에 알고리즘을 평가하는 방법이 있습니다. 첫 번째로 정밀도(Precision)는 암에 예측한 모든 환자 중에서 실제로 암에 걸린 환자의 비율을 나타냅니다. 즉, 분류기의 정밀도는 (예측된 악성 종양 중 실제 악성 종양의 수) / (예측된 악성 종양의 수)입니다. 정밀도는 암이라고 예상한 환자 중에 실제로 암에 걸린 환자의 수를 계산합니다.

분모는 첫 번째 행의 두 항목을 더한 것이고 실제 Positive 중에 True Posivie의 비율울 측정합니다.

에 있는 항목을 더한 것입니다. 전체 Positive 중에 실제 Positive의 비율을 따지는 것입니다. 여기서 Positive를 POS로 축약했습니다. 정밀도는 높을수록 좋습니다. 높은 정밀도는 "당신은 암에 걸린 것 같습니다."라고 환자에게 말할 경우 대부분의 환자가 실제로 암이라는 것을 의미합니다.

The second number we're going to compute is called recall, and what recall say is, if all the patients in, let's say, in the test set or the cross-validation set, but if all the patients in the data set that actually have cancer, what fraction of them that we correctly detect as having cancer. So if all the patients have cancer, how many of them did we actually go to them and you know, correctly told them that we think they need treatment. So, writing this down, recall is defined as the number of positives, the number of true positives, meaning the number of people that have cancer and that we correctly predicted have cancer and we take that and divide that by, divide that by the number of actual positives, so this is the right number of actual positives of all the people that do have cancer. What fraction do we directly flag and you know, send the treatment. So, to rewrite this in a different form, the denominator would be the number of actual positives as you know, is the sum of the entries in this first column over here. And so writing things out differently, this is therefore, the number of true positives, divided by the number of true positives plus the number of false negatives. And so once again, having a high recall would be a good thing. So by computing precision and recall this will usually give us a better sense of how well our classifier is doing.

두 번째 숫자는 재현율(Recall)입니다. 재현율은 실제 암환자 중에서 암이라고 예측한 환자의 비율입니다. 즉, 정확한 예측한 비율입니다. 재현율은 (암환자라고 예측한 예제의 수) / (실제로 암환자)입니다.

분모는 첫 번째 열에 있는 두 항목의 합계입니다. 재현율은 높을수록 좋습니다. 따라서, 도와 재현율은 분류기가 얼마나 잘 동작하는 지를 알려주는 지표입니다.

And in particular if we have a learning algorithm that predicts y equals zero all the time, if it predicts no one has cancer, then this classifier will have a recall equal to zero, because there won't be any true positives and so that's a quick way for us to recognize that, you know, a classifier that predicts y equals 0 all the time, just isn't a very good classifier. And more generally, even for settings where we have very skewed classes, it's not possible for an algorithm to sort of "cheat" and somehow get a very high precision and a very high recall by doing some simple thing like predicting y equals 0 all the time or predicting y equals 1 all the time. And so we're much more sure that a classifier of a high precision or high recall actually is a good classifier, and this gives us a more useful evaluation metric that is a more direct way to actually understand whether, you know, our algorithm may be doing well.

특히 y=0이라고 항상 예측하는 학습 알고리즘이 한쪽으로 치우친 클래스의 데이터에서 암에 걸린 환자가 없다고 예측하더라도 오류율은 낮지만 재현율은 0입니다. 왜냐하면 True Positive 값이 0이기 때문입니다. 재현율은 y=0을 예측하는 분류기가 뛰어나지 않다는 것을 설명합니다. 재현율과 정밀도는 한쪽으로 치우친 클래스의 데이터의 문제점을 짚어낼 수 있습니다. y=0 또는 y=1을 무조건 예측하는 분류기는 매우 높은 정밀도와 재현율을 얻을 수 없기 때문입니다. 정밀도와 재현율은 알고리즘을 실제로 이해하는 더 직접적인 평가지표입니다.

So one final note in the definition of precision and recall, that we would define precision and recall, usually we use the convention that y is equal to 1, in the presence of the more rare class. So if we are trying to detect rare conditions such as cancer, hopefully that's a rare condition, precision and recall are defined setting y equals 1, rather than y equals 0, to be sort of that the presence of that rare class that we're trying to detect. And by using precision and recall, we find, what happens is that even if we have very skewed classes, it's not possible for an algorithm to you know, "cheat" and predict y equals 1 all the time, or predict y equals 0 all the time, and get high precision and recall. And in particular, if a classifier is getting high precision and high recall, then we are actually confident that the algorithm has to be doing well, even if we have very skewed classes.

한 가지 더 첨언드리면, 정밀도와 재현율은 암과 같은 희귀한 상태를 감지할 때 더 희귀한 클래스를 y = 1이라는 정의 합니다. 정밀도와 재현율은 y=0 이 아니라 y=1로 설정하여 희귀한 클래스의 존재를 감지합니다. 정밀도와 재현율은 매우 치우친 클래스에서 문제점을 잡아낼 수 있습니다. 매우 치우친 클래스에서 조차 높은 정밀도와 재현율을 가진 알고리즘은 잘 동작합니다.

So for the problem of skewed classes precision recall gives us more direct insight into how the learning algorithm is doing and this is often a much better way to evaluate our learning algorithms, than looking at classification error or classification accuracy, when the classes are very skewed.

지금까지 한쪽으로 크게 치우친 클래스의 문제를 정밀도와 재현율이라는 평가지표로 분석하였습니다. 한쪾으로 크게 치우친 클래스를 가진 데이터는 분류 오류나 분류 정확성을 보는 것보다 정밀도와 재현율을 참고하는 것이 훨씬 더 현명한 방법입니다.

앤드류 응의 머신러닝 동영상 강의

정리하며

한쪽으로 치우친 분류 데이터란 이진 분류에서 한쪽의 데이터가 압도적으로 적거나 많은 경우입니다. 예를 들어 100개의 데이터 중에 99개는 y = 1이고, 1 개의 데이터는 y =0입니다. 이런 경우 학습 알고리즘이 1%의 오류를 보일 때 제대로 동작하는 것인지 아닌지 알 수 없습니다. 단순히 무조건 y=1이라고 답하는 비학습 알고리즘도 오류율이 1%이기 때문입니다.