brunch

You can make anything
by writing

C.S.Lewis

by 라인하트 Nov 08. 2020

앤드류 응의 머신러닝(11-4): 정밀도와 재현율

온라인 강의 플랫폼 코세라의 창립자인 앤드류 응 (Andrew Ng) 교수는 인공지능 업계의 거장입니다. 그가 스탠퍼드 대학에서 머신 러닝 입문자에게 한 강의를 그대로 코세라 온라인 강의 (Coursera.org)에서 무료로 배울 수 있습니다. 이 강의는 머신러닝 입문자들의 필수코스입니다. 인공지능과 머신러닝을 혼자 공부하면서 자연스럽게 만나게 되는 강의입니다.

Advice for Applying Machine Learning

머신 러닝 적용을 위한 조언

Handling Skewed Data (왜곡된 데이터 처리)

Trading Off Precision and Recall (정밀도와 재현율 사이의 트레이드오프)

In the last video, we talked about precision and recall as an evaluation metric for classification problems with skewed constants. For many applications, we'll want to somehow control the trade-off between precision and recall. Let me tell you how to do that and also show you some even more effective ways to use precision and recall as an evaluation metric for learning algorithms.

지난 강의에서 한쪽으로 치우친 왜곡된 데이터에 대한 평가 지표로 정밀도와 재현율을 사용했습니다. 실제로 많은 머신러닝 애플리케이션은 정밀도와 재현율 사이의 트레이드오프가 필요합니다. 트레이드오프는 양자택일로 하나를 취하면 다른 하나를 버리는 것을 의미합니다. 이번 강의에서 정밀도와 재현율을 더 효과적으로 평가 지표로 사용하는 방법을 설명합니다.

As a reminder, here are the definitions of precision and recall from the previous video. Let's continue our cancer classification example, where y equals 1 if the patient has cancer and y equals 0 otherwise. And let's say we're trained in logistic regression classifier which outputs probability between 0 and 1. So, as usual, we're going to predict 1, y equals 1, if h(x) is greater or equal to 0.5. And predict 0 if the hypothesis outputs a value less than 0.5

여기 정밀도와 재현율의 정의가 있습니다.

암 분류 예제를 계속합니다. 환자가 암이면 y=1이고, 환자가 정상이면 y=0입니다. 학습을 진행한 로지스틱 회귀 분류기는 0과 1 사이의 확률을 출력합니다. 평소와 같이 가설 hθ(x) >= 0.5 면 y = 1이고, hθ(x) < 0.5 면 y = 0입니다.

. And this classifier may give us some value for precision and some value for recall. But now, suppose we want to predict that the patient has cancer only if we're very confident that they really do. Because if you go to a patient and you tell them that they have cancer, it's going to give them a huge shock. What we give is a seriously bad news, and they may end up going through a pretty painful treatment process and so on. And so maybe we want to tell someone that we think they have cancer only if they are very confident. One way to do this would be to modify the algorithm, so that instead of setting this threshold at 0.5, we might instead say that we will predict that y is equal to 1 only if h(x) is greater or equal to 0.7. So this is like saying, we'll tell someone they have cancer only if we think there's a greater than or equal to, 70% chance that they have cancer. And, if you do this, then you're predicting someone has cancer only when you're more confident and so you end up with a classifier that has higher precision. Because all of the patients that you're going to and saying, we think you have cancer, although those patients are now ones that you're pretty confident actually have cancer. And so a higher fraction of the patients that you predict have cancer will actually turn out to have cancer because making those predictions only if we're pretty confident. But in contrast this classifier will have lower recall because now we're going to make predictions, we're going to predict y = 1 on a smaller number of patients.

로지스틱 회귀 분류기의 정밀도와 재현율 계산할 수 있을 것입니다. 하지만, 지금부터 환자가 암에 확실히 걸렸다고 믿을 수 있을 때만 y=1을 예측합니다. 환자들은 암이 의심된다고 듣는다면 큰 충격을 받을 것입니다. 환자들에게 매우 나쁜 결과이고 암은 매우 고통스러운 치료 과정을 수반합니다. 환자가 암에 걸렸다는 확신이 있을 때만 환자에게 통보하고 싶습니다. 알고리즘을 수정하여 임계값을 0.5가 아니라 0.7로 설정합니다. 가설 hθ(x) >= 0.7 이면 y=1이라고 예측합니다. 0.7은 암에 걸릴 확률이 70% 라는 의미입니다. 더 높은 확률을 나타내는 경우에만 암이 걸렸다고 예측하므로 정밀도가 높은 분류기입니다. y=1을 예측한 많은 환자가 실제 암에 걸릴 확률이 높아집니다. 충분한 확신을 얻기 위해 임계치를 0.5에서 0.7로 변경하였습니다. 이와 대조적으로 암 분류기는 더 낮은 재현율을 가집니다. 왜냐하면 암이라고 예측 환자 중에 실제 암 환자의 수의 비율이 더 적기 때문입니다.

Now, can even take this further. Instead of setting the threshold at 0.7, we can set this at 0.9. Now we'll predict y=1 only if we are more than 90% certain that the patient has cancer. And so, a large fraction of those patients will turn out to have cancer. And so this would be a higher precision classifier will have lower recall because we want to correctly detect that those patients have cancer.

조금 더 임계값을 올려 봅시다. 임계값을 0.7에서 0.9로 설정합니다. 환자가 암에 걸릴 확률이 90% 이상일 때 y=1을 예측합니다. 알고리즘이 예측한 대부분의 환자는 실제로 암 환자입니다. 임계값이 0.9인 분류기는 더 높은 정밀도를 가지지만 더 낮은 재현율을 가집니다.

Now consider a different example. Suppose we want to avoid missing too many actual cases of cancer, so we want to avoid false negatives. In particular, if a patient actually has cancer, but we fail to tell them that they have cancer then that can be really bad. Because if we tell a patient that they don't have cancer, then they're not going to go for treatment. And if it turns out that they have cancer, but we fail to tell them they have cancer, well, they may not get treated at all. And so that would be a really bad outcome because they die because we told them that they don't have cancer. They fail to get treated, but it turns out they actually have cancer. So, suppose that, when in doubt, we want to predict that y=1. So, when in doubt, we want to predict that they have cancer so that at least they look further into it, and these can get treated in case they do turn out to have cancer. In this case, rather than setting higher probability threshold, we might instead take this value and instead set it to a lower value. So maybe 0.3 like so, right? And by doing so, we're saying that, you know what, if we think there's more than a 30% chance that they have cancer we better be more conservative and tell them that they may have cancer so that they can seek treatment if necessary. And in this case what we would have is going to be a higher recall classifier, because we're going to be correctly flagging a higher fraction of all of the patients that actually do have cancer. But we're going to end up with lower precision because a higher fraction of the patients that we said have cancer, a high fraction of them will turn out not to have cancer after all.

여기 다른 사례가 있습니다. 실제 암에 걸린 환자를 놓치고 싶지 않습니다. 실제로 암에 걸린 환자에게 알려주지 않는다면, 환자는 치료를 받을 기회를 놓치기 때문에 더 위험합니다. 심한 경우 암이 의심된다고 통보하지 않았기 때문에 환자가 죽을 수도 있습니다. 최악의 상황을 막기 위해 알고리즘은 확실하지 않아도 y=1을 예측합니다. 암이 의심될 때 환자가 추가 검사를 받을 수 있도록 유도하거나 치료를 합니다. 임계값을 0.5보다 낮은 0.3으로 설정합니다. 환자가 암에 걸릴 확률이 30% 이상일 경우 통보하고 재검사를 합니다. 암 분류기는 암을 앓고 있는 모든 환자를 찾아내기 때문에 높은 재현율을 가집니다. 하지만, 반대로 정밀도는 떨어질 것입니다. 암이라고 의심한 환자 중에 실제 암에 걸리지 않은 환자의 수가 늘어납니다.

And by the way, just as a sider, when I talk about this to other students, I've been told before, it's pretty amazing, some of my students say, is how I can tell the story both ways. Why we might want to have higher precision or higher recall and the story actually seems to work both ways. But I hope the details of the algorithm is true and the more general principle is depending on where you want, whether you want higher precision- lower recall, or higher recall- lower precision. You can end up predicting y=1 when h(x) is greater than some threshold. And so in general, for most classifiers there is going to be a trade off between precision and recall, and as you vary the value of this threshold that we join here, you can actually plot out some curve that trades off precision and recall.

몇몇 학생들은 정밀도와 재현율 사이의 트레이드오프를 듣고 놀랍니다. 어떤 알고리즘은 정밀도와 재현율을 함께 높이는 것이 가능한 것처럼 보이기 때문입니다. 하지만, 일반적으로 알고리즘은 높은 정밀도와 낮은 재현율을 갖거나 높은 재현율과 낮은 정밀도를 갖습니다. 이것은 선택에 달린 것입니다. 가설 hθ(x)가 특정 임계값보다 클 때 y =1을 예측합니다. 대부분의 분류기들은 정밀도와 재현율 사이의 트레이드오프가 발생합니다. 임계값을 변경하면 실제로 정밀도와 재현율을 절충하는 곡선을 그릴 수 있습니다.

Where a value up here, this would correspond to a very high value of the threshold, maybe threshold equals 0.99. So that's saying, predict y=1 only if we're more than 99% confident, at least 99% probability this one. So that would be a high precision, relatively low recall. Where as the point down here, will correspond to a value of the threshold that's much lower, maybe equal 0.01, meaning, when in doubt at all, predict y=1, and if you do that, you end up with a much lower precision, higher recall classifier. And as you vary the threshold, if you want you can actually trace of a curve for your classifier to see the range of different values you can get for precision recall. And by the way, the precision-recall curve can look like many different shapes. Sometimes it will look like this, sometimes it will look like that. Now there are many different possible shapes for the precision-recall curve, depending on the details of the classifier. So, this raises another interesting question which is, is there a way to choose this threshold automatically?

여기 오른쪽에 정밀도와 재현율에 대한 그래프가 있습니다. 그래프의 좌측 상단은 임계값이 매우 높은 0.99로 99%의 확률로 y=1을 예측합니다. 높은 정밀도와 낮은 재현율을 의미합니다. 그래프의 우측 하단의 임계값은 0.01로 1%의 확률로 y=1을 예측합니다. 낮은 정밀도와 높은 재현율을 가리킵니다. 임계값을 변경하면서 분류기의 정밀도와 재현율에 대한 곡선을 그릴 수 있습니다. 분류기의 세부적인 차이로 인해 다양한 모양의 정밀도와 재현율의 곡선이 만들어집니다. 그렇다면, 임계값을 자동으로 선택하는 방법이 있을까요?

Or more generally, if we have a few different algorithms or a few different ideas for algorithms, how do we compare different precision recall numbers? Concretely, suppose we have three different learning algorithms. So actually, maybe these are three different learning algorithms, maybe these are the same algorithm but just with different values for the threshold. How do we decide which of these algorithms is best? One of the things we talked about earlier is the importance of a single real number evaluation metric. And that is the idea of having a number that just tells you how well is your classifier doing. But by switching to the precision recall metric we've actually lost that. We now have two real numbers. And so we often, we end up face the situations like if we trying to compare Algorithm 1 and Algorithm 2, we end up asking ourselves, is the precision of 0.5 and a recall of 0.4, was that better or worse than a precision of 0.7 and recall of 0.1? And, if every time you try out a new algorithm you end up having to sit around and think, well, maybe 0.5/0.4 is better than 0.7/0.1, or maybe not, I don't know. If you end up having to sit around and think and make these decisions, that really slows down your decision making process for what changes are useful to incorporate into your algorithm.

Whereas in contrast, if we have a single real number evaluation metric like a number that just tells us is algorithm 1 or is algorithm 2 better, then that helps us to much more quickly decide which algorithm to go with. It helps us as well to much more quickly evaluate different changes that we may be contemplating for an algorithm. So how can we get a single real number evaluation metric? One natural thing that you might try is to look at the average precision and recall. So, using P and R to denote precision and recall, what you could do is just compute the average and look at what classifier has the highest average value.

더 일반적으로, 알고리즘에 새로운 아이디어를 추가하거나 몇 가지 다른 알고리즘이 있을 때 정밀도와 재현율을 어떻게 비교할까요? 여기 하나의 알고리즘에 임계값을 달리하는 세 가지 정밀도와 재현율이 있습니다.

세 가지 중에 어떤 것이 가장 좋은 알고리즘일까요? 단일 실수 평가 지표는 분류기가 얼마나 잘하고 있는 지를 숫자로 보여줍니다.

알고리즘 1은 정밀도가 0.5이고 재현율은 0.4입니다. 알고리즘 2는 정밀도가 0.7이고 재현율은 0.1입니다. 알고리즘 3은 정밀도가 0.02이고 재현율은 0.4입니다. 이 중에 어떤 알고리즘이 가장 나은 것일까요? 잘 모른다면, 결국 의사 결정은 늦어지고 복잡해집니다.

반면에 단일 실수 평가 지표가 있다면 의사결정에 큰 도움이 됩니다. 단일 실수 평가 지표는 어떻게 만들 수 있을 까요? 한 가지 방법은 정밀도와 재현율의 평균을 구하는 것입니다. 정밀도 P와 재현율 R의 값에 평균을 계산하고 가장 높은 평균값을 갖는 분류기를 선택합니다.

But this turns out not to be such a good solution, because similar to the example we had earlier it turns out that if we have a classifier that predicts y=1 all the time, then if you do that you can get a very high recall, but you end up with a very low value of precision. Conversely, if you have a classifier that predicts y equals zero, almost all the time, that is that it predicts y=1 very sparingly, this corresponds to setting a very high threshold using the notation of the previous y. Then you can actually end up with a very high precision with a very low recall. So, the two extremes of either a very high threshold or a very low threshold, neither of that will give a particularly good classifier. And the way we recognize that is by seeing that we end up with a very low precision or a very low recall. And if you just take the average of (P+R)/2 from this example, the average is actually highest for Algorithm 3, even though you can get that sort of performance by predicting y=1 all the time and that's just not a very good classifier, right? You predict y=1 all the time, just normal useful classifier, but all it does is prints out y=1. And so Algorithm 1 or Algorithm 2 would be more useful than Algorithm 3. But in this example, Algorithm 3 has a higher average value of precision recall than Algorithms 1 and 2. So we usually think of this average of precision and recall as not a particularly good way to evaluate our learning algorithm.

In contrast, there's a different way for combining precision and recall. This is called the F Score and it uses that formula. And so in this example, here are the F Scores. And so we would tell from these F Scores, it looks like Algorithm 1 has the highest F Score, Algorithm 2 has the second highest, and Algorithm 3 has the lowest. And so, if we go by the F Score we would pick probably Algorithm 1 over the others. The F Score, which is also called the F1 Score, is usually written F1 Score that I have here, but often people will just say F Score, either term is used. Is a little bit like taking the average of precision and recall, but it gives the lower value of precision and recall, whichever it is, it gives it a higher weight. And so, you see in the numerator here that the F Score takes a product of precision and recall. And so if either precision is 0 or recall is equal to 0, the F Score will be equal to 0. So in that sense, it kind of combines precision and recall, but for the F Score to be large, both precision and recall have to be pretty large.

그러나 정밀도와 재현율의 평균은 좋은 해결책이 아닙니다. 예를 들어, 항상 y=1을 예측하는 단순 분류기는 매우 높은 재현율과 매우 낮은 정밀도를 가집니다. 반대로 항상 y=0을 예측하는 단순 분류기는 매우 낮은 재현율과 매우 높은 정밀도를 나타냅니다. 아마도 y=1을 예측하는 분류기의 임계값을 매우 높게 설정한 것일 것입니다. 매우 낮은 임계값과 매우 높은 임계값은 좋은 분류기가 아닙니다. 표에서 평균값은 알고리즘 3이 가장 높은 값지만, 알고리즘 3은 항상 y =1을 예측하는 나쁜 분류기입니다. 알고리즘 1과 알고리즘 2가 알고리즘 3보다 훨씬 더 유용합니다. 평균 정밀도와 재현율은 학습 알고리즘을 평가하는 좋은 방법이 아닙니다. 따라서 정밀도와 재현율을 결합하는 새로운 단일 실수 평가 지표가 필요합니다. 가장 일반적으로 많이 쓰는 것은 F score이고 공식은 다음과 같습니다.

F score는 알고리즘 1이 가장 높고, 알고리즘 2가 그다음으로 높고, 알고리즘 3이 가장 낮습니다. 즉, F score는 알고리즘 1을 선택합니다. F1 score의 분자는 정밀도와 재현율을 곱합니다. 어느 한쪽이 0이거나 0에 가까울 때 F score는 0에 가까운 값이 됩니다. F1 score는 평균을 취하는 것은 비슷하지만, 정밀도와 재현율이 모두 큰 값일 때 가장 큰 값을 나타냅니다.

I should say that there are many different possible formulas for combing precision and recall. This F Score formula is really maybe a, just one out of a much larger number of possibilities, but historically or traditionally this is what people in Machine Learning seem to use. And the term F Score, it doesn't really mean anything, so don't worry about why it's called F Score or F1 Score. But this usually gives you the effect that you want because if either a precision is zero or recall is zero, this gives you a very low F Score, and so to have a high F Score, you kind of need a precision or recall to be one. And concretely, if P=0 or R=0, then this gives you that the F Score = 0. Whereas a perfect F Score, so if precision equals one and recall equals 1, that will give you an F Score, that's equal to 1 times 1 over 2 times 2, so the F Score will be equal to 1, if you have perfect precision and perfect recall. And intermediate values between 0 and 1, this usually gives a reasonable rank ordering of different classifiers.

사실 정밀도와 재현율을 비교하는 여러 가지 공식이 있습니다. F1 score는 전통적으로 머신러닝 업계의 사람들이 자주 사용합니다. F1 score라는 용어는 아무 의미가 없지만, 원하는 효과를 발휘합니다. 정밀도가 0이거나 재현율이 0이면 매우 낮은 점수를 표시합니다. 높은 점수를 얻으려면 높은 정밀도와 재현율이 필요합니다. 구체적으로 P = 0 또는 R = 0 이면 F score = 0입니다. 반면에 P =1이고 R = 1 이면 상수 2가 있기 때문에 F score = 1입니다. 그리고, 0과 1 사이의 중간값은 분류기들의 합리적인 순위를 나타냅니다.

So in this video, we talked about the notion of trading off between precision and recall, and how we can vary the threshold that we use to decide whether to predict y=1 or y=0. So it's the threshold that says, do we need to be at least 70% confident or 90% confident, or whatever before we predict y=1. And by varying the threshold, you can control a trade off between precision and recall. We also talked about the F Score, which takes precision and recall, and again, gives you a single real number evaluation metric. And of course, if your goal is to automatically set that threshold to decide what's really y=1 and y=0, one pretty reasonable way to do that would also be to try a range of different values of thresholds. So you try a range of values of thresholds and evaluate these different thresholds on, say, your cross-validation set and then to pick whatever value of threshold gives you the highest F Score on your cross validation. And that be a pretty reasonable way to automatically choose the threshold for your classifier as well.

이번 강의에서 정밀도와 재현율 사이의 트레이드오프 개념과 y = 1 또는 y = 0 예측을 결정하는 임계값을 어떻게 변경할 수 있는지를 설명했습니다. y = 1을 예측하기 전에 최소한 70% 또는 90%의 확신이 필요할까요? 임계값을 변경하면 정밀도와 재현율 사이의 트레이드오프를 제어할 수 있습니다. 또한, 정밀도와 재현율을 통합하여 실수 평가 지표로 활용할 수 있는 F1 score를 배웠습니다. 자동으로 임계값을 설정하여 실제 y =1 및 y = 0을 결정하는 것이라면 여러 가지 임계 값을 시도합니다. 교차 검증 셋에서 임계 값을 평가한 다음 가장 높은 F1 score를 제공하는 임계 값을 선택합니다. 이것은 분류기가 임계값을 자동으로 선택하는 합리적인 방법입니다.

앤드류 응의 머신러닝 동영상 강의

정리하며

한쪽으로 치우친 왜곡된 분류 데이터를 제대로 동작하는 지를 확인하는 실수 평가 지표는 정밀도와 재현율입니다.

정밀도와 재현율은 트레이드오프를 일으킵니다. 정밀도가 높을수록 재현율이 낮아지고, 재현율이 높을수록 정밀도가 낮아집니다.

아래와 같은 그래프가 그려집니다.

따라서, 어떤 임계값이 가장 좋은 결과를 도출하는 지를 고려해야 합니다. 정밀도와 재현율의 평균을 낼 경우 도가 아주 높거나 재현율이 아주 높은 경우가 높은 점수를 내므로 효과적이지 않습니다. 머신러닝 전문가들은 F score를 사용합니다. P는 정밀도이고, R은 재현율입니다.