brunch

You can make anything
by writing

C.S.Lewis

by 라인하트 Dec 04. 2020

앤드류 응의 머신러닝(15-4):이상 탐지 시스템 평가

온라인 강의 플랫폼 코세라의 창립자인 앤드류 응 (Andrew Ng) 교수는 인공지능 업계의 거장입니다. 그가 스탠퍼드 대학에서 머신 러닝 입문자에게 한 강의를 그대로 코세라 온라인 강의 (Coursera.org)에서 무료로 배울 수 있습니다. 이 강의는 머신러닝 입문자들의 필수코스입니다. 인공지능과 머신러닝을 혼자 공부하면서 자연스럽게 만나게 되는 강의입니다.

Anomaly Detection

(이상 탐지)

Building an Anomaly Detection System

(이상 탐지 시스템 구축하기)

Developing and Evaluating an Anomaly Detection System

(이상 탐지 시스템을 개발 및 평가)

In the last video, we developed an anomaly detection algorithm. In this video, I like to talk about the process of how to go about developing a specific application of anomaly detection to a problem and in particular this will focus on the problem of how to evaluate an anomaly detection algorithm.

지난 강의에서 우리는 이상 탐지 알고리즘을 개발했습니다. 이번 강의에서 이상 탐지 애플리케이션을 개발하는 과정에서 이상 탐지 알고리즘을 평가하는 방법을 다룹니다.

In previous videos, we've already talked about the importance of real number evaluation and this captures the idea that when you're trying to develop a learning algorithm for a specific application, you need to often make a lot of choices like, you know, choosing what features to use and then so on. And making decisions about all of these choices is often much easier, and if you have a way to evaluate your learning algorithm that just gives you back a number. So if you're trying to decide, you know, I have an idea for one extra feature, do I include this feature or not. If you can run the algorithm with the feature, and run the algorithm without the feature, and just get back a number that tells you, you know, did it improve or worsen performance to add this feature? Then it gives you a much better way, a much simpler way, with which to decide whether or not to include that feature. So in order to be able to develop an anomaly detection system quickly, it would be a really helpful to have a way of evaluating an anomaly detection system.

오래전 강의에서 수치로 평가하는 것이 중요하다 것을 설명한 적이 있습니다. 만일 학습 알고리즘을 개발할 때, 수치로 평가하는 것은 선택을 좀 더 쉽게 합니다. 어떤 피처를 추가하거나 제거해야 할 때 의사 결정이 용이합니다. 예를 들어, 특정 피처를 추가한 알고리즘과 특정 피처를 추가하지 않은 알고리즘을 각각 실행하고 결과를 수치로 확인합니다. 학습 알고리즘의 성능이 향상되거나 악화되는 지를 쉽게 판단할 수 있습니다. 수치화된 평가 시스템은 의사 결정을 손쉽게 할 수 있는 간단하고 나은 방법입니다. 이상 탐지 시스템을 신속하게 개발하기 위해 알고리즘을 수치로 평가하는 방법을 알아봅니다.

In order to do this, in order to evaluate an anomaly detection system, we're actually going to assume have some labeled data. So, so far, we'll be treating anomaly detection as an unsupervised learning problem, using unlabeled data. But if you have some labeled data that specifies what are some anomalous examples, and what are some non-anomalous examples, then this is how we actually think of as the standard way of evaluating an anomaly detection algorithm.

여기 이상 탐지 시스템을 평가하기 위해 레이블이 지정된 데이터 셋이 있습니다. 지금까지 레이블이 없는 데이터를 활용하면서 이상 탐지 문제를 비지도 학습 문제로 취급하였습니다. 그러나 이상 탐지 알고리즘을 평가하는 표준적인 방법은 이상 예제와 정상적인 예제를 구분하는 레이블이 있는 데이터를 사용하는 것입니다.

So taking the aircraft engine example again. Let's say that, you know, we have some label data of just a few anomalous examples of some aircraft engines that were manufactured in the past that turns out to be anomalous. Turned out to be flawed or strange in some way. Let's say we use we also have some non-anomalous examples, so some perfectly okay examples. I'm going to use y equals 0 to denote the normal or the non-anomalous example and y equals 1 to denote the anomalous examples. The process of developing and evaluating an anomaly detection algorithm is as follows. We're going to think of it as a training set and talk about the cross validation in test sets later, but the training set we usually think of this as still the unlabeled training set. And so this is our large collection of normal, non-anomalous or not anomalous examples. And usually we think of this as being as non-anomalous, but it's actually okay even if a few anomalies slip into your unlabeled training set.

항공기 엔진의 사례로 돌아갑니다. 이상이 있던 항공기 엔진의 데이터에 이상 레이블을 표시합니다. 확실히 결함이나 이상이 밝혀진 항공기 엔진입니다. 정상 예제에 이상 예제가 썩여 있더라도 완벽하게 정상적인 예제를 사용합니다. y = 0은 정상 예제를, y =1은 이상 예제를 표시합니다. 이상 탐지 알고리즘을 개발하고 평가하는 과정은 다음과 같습니다. 여기 레이블이 없는 학습 셋은 정상 예제로 간주합니다. 레이블이 지정되지 않은 학습 셋에 몇몇 이상 예제가 있더라도 상관없습니다.

And next we are going to define a cross validation set and a test set, with which to evaluate a particular anomaly detection algorithm. So, specifically, for both the cross validation test sets we're going to assume that, you know, we can include a few examples in the cross validation set and the test set that contain examples that are known to be anomalous. So the test sets say we have a few examples with y equals 1 that correspond to anomalous aircraft engines.

다음으로 이상 탐지 알고리즘을 평가하기 위해 교차 검증 셋과 테스트 셋을 정의합니다. 교차 검증 셋과 테스트 셋은 이상 예제를 포함합니다. 테스트 셋에 y = 1인 이상 항공기 엔진 예제가 몇 개 있습니다.

So here's a specific example. Let's say that, altogether, this is the data that we have. We have manufactured 10,000 examples of engines that, as far as we know we're perfectly normal, perfectly good aircraft engines. And again, it turns out to be okay even if a few flawed engine slips into the set of 10,000 is actually okay, but we kind of assumed that the vast majority of these 10,000 examples are, you know, good and normal non-anomalous engines. And let's say that, you know, historically, however long we've been running on manufacturing plant, let's say that we end up getting features, getting 24 to 28 anomalous engines as well. And for a pretty typical application of anomaly detection, you know, the number non-anomalous examples, that is with y equals 1, we may have anywhere from, you know, 2 to 50. It would be a pretty typical range of examples, number of examples that we have with y equals 1. And usually we will have a much larger number of good examples.

여기에 구체적인 사례가 있습니다. 완벽하게 우수한 항공기 엔진 10,000개를 제조했습니다. 10,000개의 항공기 엔진 중에 결함이 있는 엔진이 포함되어도 괜찮습니다. 대부분의 엔진은 정상입니다. 지금까지 제조 공장에서 이상 항공기 엔진을 탐지하는 피처는 약 24 ~ 28개입니다. 일반적으로 y = 1인 이상 예제의 수는 2 ~. 50개 정도입니다. 실제로 이상 예제의 수는 그렇게 많지 않습니다. 대부분의 정상 예제 중에 약간의 이상 예제가 포함된 수준입니다.

So, given this data set, a fairly typical way to split it into the training set, cross validation set and test set would be as follows. Let's take 10,000 good aircraft engines and put 6,000 of that into the unlabeled training set. So, I'm calling this an unlabeled training set but all of these examples are really ones that correspond to y equals 0, as far as we know. And so, we will use this to fit p of x, right. So, we will use these 6000 engines to fit p of x, which is that p of x one parametrized by Mu 1, sigma squared 1, up to p of Xn parametrized by Mu N sigma squared n. And so it would be these 6,000 examples that we would use to estimate the parameters Mu 1, sigma squared 1, up to Mu N, sigma squared N. And so that's our training set of all, you know, good, or the vast majority of good examples.

여기 데이터셋이 있습니다. 일반적으로 학습 셋, 교차 검증 셋, 테스트 셋으로 구분합니다. 10,000개의 정상 항공기 엔진 데이터셋이 있을 때, 6,000개의 학습 셋, 2,000개의 교차 검증 셋, 2,000개의 테스트 셋으로 나눕니다. 레이블이 없는 데이터 셋이지만 모든 예제는 y = 0입니다. 6,000개의 항공기 엔진 학습 셋을 사용하여 확률 p(x)를 계산합니다.

여기서, 6,000개의 학습 예제로 모든 파라미터를 추정합니다. 학습 셋은 모두가 정상이거나 거의 대부분이 정상입니다.

Next we will take our good aircraft engines and put some number of them in a cross validation set plus some number of them in the test sets. So 6,000 plus 2,000 plus 2,000, that's how we split up our 10,000 good aircraft engines. And then we also have 20 flawed aircraft engines, and we'll take that and maybe split it up, you know, put ten of them in the cross validation set and put ten of them in the test sets. And in the next slide we will talk about how to actually use this to evaluate the anomaly detection algorithm. So what I have just described here is a you know probably the recommend a good way of splitting the labeled and unlabeled example. The good and the flawed aircraft engines. Where we use like a 60%, 20%, 20% split for the good engines and we take the flawed engines, and we put them just in the cross validation set, and just in the test set, then we'll see in the next slide why that's the case.

다시 10,000개의 정상 항공기 엔진 데이터 셋이 있습니다. 6,000개의 학습 셋, 2,000개의 교차 검증 셋, 2,000개의 테스트 셋으로 나눕니다. 또한 결함이 있는 엔진 20개를 교차 검증 셋에 10개와 테스트 셋에 10개를 넣습니다. 여기서 레이블이 있는 예제와 레이블이 없는 예제를 분리하는 방법을 예상할 수 있습니다. 정상 엔진 데이터셋은 학습 셋 60%, 교차 검증 셋 20%, 테스트 셋 20%으로 나누고, 결함이 있는 엔진 데이터 셋은 교차 검증 셋과 테스트 셋에 반으로 분할합니다.

Just as an aside, if you look at how people apply anomaly detection algorithms, sometimes you see other peoples' split the data differently as well. So, another alternative, this is really not a recommended alternative, but some people want to take off your 10,000 good engines, maybe put 6000 of them in your training set and then put the same 4000 in the cross validation set and the test set. And so, you know, we like to think of the cross validation set and the test set as being completely different data sets to each other. But you know, in anomaly detection, you know, for sometimes you see people, sort of, use the same set of good engines in the cross validation sets, and the test sets, and sometimes you see people use exactly the same sets of anomalous engines in the cross validation set and the test set. And so, all of these are considered, you know, less good practices and definitely less recommended. Certainly using the same data in the cross validation set and the test set, that is not considered a good machine learning practice. But, sometimes you see people do this too.

가끔 이상 탐지 알고리즘은 데이터셋을 다른 방식으로 분할하기도 합니다. 다른 방법은 권장하지 않지만 실제로 사용합니다. 10,000개의 정상 엔진 데이터셋을 학습 셋 6,000개와 교차 검증 셋과 테스트 셋에 동일하게 4,000개를 할당합니다. 보통 교차 검증 셋과 테스트 셋을 서로 완전히 다른 데이터 셋으로 구성하지만, 이상 탐지 알고리즘은 교차 검증 셋과 데이터 셋을 동일하게 사용하기도 합니다. 가끔은 교차 검증 셋과 테스트 셋에 똑같은 비정상 데이터셋을 사용합니다. 권장하지는 않아도 사람들이 이렇게 사용하기도 합니다.

So, given the training, cross validation and test sets, here's how you evaluate or here is how you develop and evaluate an algorithm. First, we take the training sets and we fit the model p of x. So, we fit, you know, all these Gaussians to my m unlabeled examples of aircraft engines, and these, I am calling them unlabeled examples, but these are really examples that we're assuming our goods are the normal aircraft engines. Then imagine that your anomaly detection algorithm is actually making prediction. So, on the cross validation of the test set, given that, say, test example X, think of the algorithm as predicting that y is equal to 1, p of x is less than epsilon, we must be taking zero, if p of x is greater than or equal to epsilon. So, given x, it's trying to predict, what is the label, given y equals 1 corresponding to an anomaly or is it y equals 0 corresponding to a normal example?

So given the training, cross validation, and test sets. How do you develop an algorithm? And more specifically, how do you evaluate an anomaly detection algorithm? Well, to this whole, the first step is to take the unlabeled training set, and to fit the model p of x lead training data. So you take this, you know on I'm coming, unlabeled training set, but really, these are examples that we are assuming, vast majority of which are normal aircraft engines, not because they're not anomalies and it will fit the model p of x. It will fit all those parameters for all the Gaussians on this data.

학습 셋, 교차 검증 셋, 테스트 셋으로 학습 알고리즘을 개발하고 평가하는 방법을 살펴봅니다.

첫 번째, 가우시안 분포를 따르는 m개의 항공기 엔진 학습 예제에 적합한 모델 p(x)를 구합니다. 레이블이 없는 예제들은 정상 데이터로 간주합니다.

두 번째, 이상 탐지 알고리즘은 테스트 셋과 교차 검증 셋에서 예측합니다. 확률 p(x)가 앱실론보다 작으면 이상이고 앱실론보다 크면 정상입니다.

여기 학습 셋, 교차 검증 셋, 테스트 셋이 있습니다. 알고리즘을 어떻게 개발하고 평가할 수 있을까요? 첫 번째 단계는 레이블이 없는 학습 셋을 적합한 모델 p(x)를 구합니다. 레이블이 없는 학습 셋 x^(1), x^(2),..., x^(m)을 사용합니다. 학습 셋은 모두 정상이라고 가정합니다. 가우시안 분포를 따르는 데이터 셋에 모든 파라미터가 모델 p(x)에 적합하기 때문입니다.

Next on the cross validation of the test set, we're going to think of the anomaly detection algorithm as trying to predict the value of y. So in each of like say test examples. We have these X-I tests, Y-I test, where y is going to be equal to 1 or 0 depending on whether this was an anomalous example. So given input x in my test set, my anomaly detection algorithm think of it as predicting the y as 1 if p of x is less than epsilon. So predicting that it is an anomaly, it is probably is very low. And we think of the algorithm is predicting that y is equal to 0. If p of x is greater then or equals epsilon. So predicting those normal example if the p of x is reasonably large.

두 번째 단계는 이상 탐지 알고리즘이 테스트 셋과 교차 검증 셋에서 예측을 시도합니다. 여기 테스트 예제 (xtest^(i), ytest^(i))가 있습니다. ytest^(i)는 비정상이면 1, 정상이면 0입니다. 따라서, 테스트 셋의 입력 xtest^(i)가 주어지면 이상 탐지 알고리즘은 예측을 합니다.

p(x)가 합리적으로 크다면 정상으로 예측합니다.

And so we can now think of the anomaly detection algorithm as making predictions for what are the values of these y labels in the test sets or on the cross validation set. And this puts us somewhat more similar to the supervised learning setting, right? Where we have label test set and our algorithm is making predictions on these labels and so we can evaluate it you know by seeing how often it gets these labels right. Of course these labels are will be very skewed because y equals zero, that is normal examples, usually be much more common than y equals 1 than anomalous examples. But, you know, this is much closer to the source of evaluation metrics we can use in supervised learning.

이상 탐지 알고리즘은 테스트 셋과 교차 검증 셋에서 레이블 y의 값을 예측한다고 간주합니다. 지도 학습과 다소 유사합니다. 레이블이 있는 테스트 셋이 있고 이상 탐지 알고리즘은 레이블을 예측합니다. 예측이 레이블과 일치하는지 여부에 따라 알고리즘을 평가합니다. 대부분의 예제가 y = 0 레이블을 가진 정상 예제이고, y = 1 레이블을 가진 이상 예제는 너무나 적기 때문에 매우 왜곡될 수 있습니다. 그러나 지도 학습에서 사용할 수 있는 평가 지표를 사용할 수 있습니다.

So what's a good evaluation metric to use. Well, because the data is very skewed, because y equals 0 is much more common, classification accuracy would not be a good the evaluation metrics. So, we talked about this in the earlier video. So, if you have a very skewed data set, then predicting y equals 0 all the time, will have very high classification accuracy. Instead, we should use evaluation metrics, like computing the fraction of true positives, false positives, false negatives, true negatives or compute the position of the v curve of this algorithm or do things like compute the f1 score, right, which is a single real number way of summarizing the position and the recall numbers. And so these would be ways to evaluate an anomaly detection algorithm on your cross validation set or on your test set.

사용하기 좋은 평가 지표는 무엇일까요? 데이터가 y = 0 레이블에 매우 치우친 왜곡된 데이터는 분류 정확도를 평가하기는 좋지 않습니다. 과거 강의에서 다루었던 내용입니다. 한쪽으로 매우 치우친 데이터 셋은 y = 0을 항상 예측하면 분류 정확도가 매우 높습니다. 대신 True Positive, False Positive, False Negative, True Negative의 비율을 계산하는 평가 지표를 사용하거나 알고리즘의 v 곡선의 위치를 계산하거나 F1-score를 계산합니다. 이것이 교차 검증 셋과 테스트 셋에서 이상 탐지 알고리즘을 평가하는 방법입니다.

Finally, earlier in the anomaly detection algorithm, we also had this parameter epsilon, right? So, epsilon is this threshold that we would use to decide when to flag something as an anomaly.And so, if you have a cross validation set, another way to choose this parameter epsilon would be to try a different, try many different values of epsilon, and then pick the value of epsilon that maximizes f1 score, or that otherwise does well on your cross validation set. And more generally, the way to reduce the training, testing, and cross validation sets, is that when we are trying to make decisions, like what features to include, or trying to, you know, tune the parameter epsilon, we would then continually evaluate the algorithm on the cross validation sets and make all those decisions like what features did you use, you know, how to set epsilon, use that, evaluate the algorithm on the cross validation set, and then when we've picked the set of features, when we've found the value of epsilon that we're happy with, we can then take the final model and evaluate it, you know, do the final evaluation of the algorithm on the test sets.

마지막으로, 이상 탐지 알고리즘의 파라미터 앱실론(ε)을 좀 더 설명합니다. ε은 어떤 것을 이상으로 표시할지 말지를 결정하는 임계치입니다. 교차 검증 셋이 있을 때 파라미터 ε을 선택하는 방법이 있습니다. 수많은 다른 ε의 값을 선택한 다음 F1-score를 최대화하거나 교차 검증 셋에서 잘 수행하는 ε값을 선택합니다. 보다 널리 쓰이는 방법은 학습 셋, 테스트 셋, 교차 검증 셋에서 파라미터 ε을 조정하는 피처를 결정하는 것입니다. 교차 검증 셋에서 계속해서 알고리즘을 평가하고 사용할 피처와 ε을 설정하는 법을 결정합니다. 만족할만한 앱실론 (ε)의 값을 찾았을 때, 최종 모델을 평가할 수 있습니다. 테스트 셋에서 최종 평가를 진행합니다.

In this video, we started to use a bit of labeled data in order to evaluate the anomaly detection algorithm and this takes us a little bit closer to a supervised learning setting. So, in this video, we talked about the process of how to evaluate an anomaly detection algorithm, and again, having being able to evaluate an algorithm, you know, with a single real number evaluation, with a number like an F1 score that often allows you to much more efficient use of your time when you are trying to develop an anomaly detection system.

이번 강의는 이상 탐지 알고리즘을 평가하기 위해 레이블이 지정된 데이터를 사용하는 법을 다루었습니다. 지도 학습의 평가 방법과 유사합니다. 또한, 수치화된 평가 지표를 사용하기 위해 F1-Score를 다루었습니다. 이제 이상 탐지 시스템 개발에 시간을 효율적으로 사용할 수 있을 것입니다.

And we try to make these sorts of decisions. I have to chose epsilon, what features to include, and so on. In the next video, I'm going to say a bit more about that. And in particular we'll talk about when should you be using an anomaly detection algorithm and when should we be thinking about using supervised learning instead, and what are the differences between these two formalisms.

다음 강의에서 앱실론 (ε)을 선택하는 방법에 대해 좀 더 설명할 것입니다. 특히 언제 이상 탐지 알고리즘을 사용해야 하는지, 지도 학습을 대신 사용해야 할 경우 두 형식의 차이점은 무엇인지에 대해 다룰 것입니다.

앤드류 응의 머신러닝 동영상 강의

정리하며

이상 탐지 알고리즘은 어떤 행동이나 물체가 정상인지 비정상인지를 탐지합니다. 수학적으로 데이터 셋에서 확률 p(x)를 모델링합니다. 새로운 예제 p(xtest)가 엡실론(ε) 보다 작으면 이상으로 표시합니다. 반면에 p(xtest)가 엡실론(ε) 보다 크면 정상으로 표시합니다.

이상 탐지 알고리즘을 개발하는 순서는 다음과 같습니다.

1) 비정상적이거나 이상으로 구별할 수 있는 피처 xi를 결정합니다.

시스템에 사기 행위를 할 수도 있는 비정상적인 사용자나 비정상 항공기 엔진을 선별할 수 있는 피처를 정의합니다. 수학적으로 비정상적인 예는 피처 xi에 대해 비정상적으로 큰 값을 취하거나 비정상적으로 작은 값을 나타낼 것입니다.

2) μ1,μ2,..., μn까지 파라미터와 σ1^2, σ2^2,..., σn^2까지의 파라미터 값을 추정합니다.

시스템에서 두 파라미터의 값을 추정하는 방법은 다음과 같습니다.

3) 새로운 예제에 대해 모든 피처에 대한 확률 p(x)을 계산합니다.

새로운 예제가 주어졌을 때 즉 새로운 항공기 엔진이 주어졌을 때 이상이 있는 지를 판단합니다.

4) 새로운 예제가 이상인지 정상인지를 판단합니다.

p(xtest)가 엡실론(ε) 보다 작으면 이상으로, 엡실론(ε) 보다 크면 정상으로 표시합니다. 이차원 그래프로 본다면 타원 밖에 있는 점들은 이상으로 표시하고, 타원 안에 있는 점들은 것들은 정상으로 표시합니다. 삼차원 그래프로 본다면 엡실론(ε) 보다 낮은 높이를 가진 점들은 이상으로 표시하고, 엡실론(ε) 보다 높은 값을 가진 점들은 정상으로 표시합니다.

이상 탐지 알고리즘 평가하기

이상 탐지 알고리즘을 개발할 때 수치로 평가하는 방법이 있다면 모든 선택을 좀 더 쉽게 할 수 있습니다. 예를 들어, 특정 피처를 추가한 알고리즘과 특정 피처를 추가하지 않은 알고리즘을 각각 실행하고 수치화된 결과를 확인한다면, 어느 알고리즘이 성능을 향상하는지 또는 악화하는 지를 쉽게 판단할 수 있습니다.

이상 탐지 알고리즘을 개발할 때 지도 학습에서 평가하는 것과 마찬가지로 레이블이 없는 전체 데이터 셋을 다음과 같이 분할합니다. 정상 엔진 데이터를 학습 셋 60%, 교차 검증 셋 20%, 테스트 셋 20%으로 나누고, 결함이 있는 엔진 데이터 셋은 교차 검증 셋과 테스트 셋에 반으로 분할합니다.

학습 셋에서 파라미터 μj, σj^2를 추정한 후에 교차 검증 셋과 테스트 셋에서 확률 p(x)를 계산합니다. 정상 예제가 정상으로 표시되는지 이상 예제가 이상으로 표시되는 지를 확인합니다. 한쪽으로 매우 치우친 데이터 셋이 있을 때 y = 0을 항상 예측하면 분류 정확도가 매우 높습니다. 따라서, True Positive, False Positive, False Negative, True Negative의 비율을 계산하는 것과 같은 평가 지표를 사용하거나 알고리즘의 v 곡선의 위치를 계산하거나 F1-score를 계산하는 작업을 수행합니다.

이것이 교차 검증 셋과 테스트 셋에서 이상 탐지 알고리즘을 평가하는 방법입니다.

문제 풀이

교차 검증 셋과 테스트 셋에서 알고리즘을 평가할 때 알고리즘은 다음과 같이 예측합니다. 분류 정확도는 알고리즘의 성능을 측정할 수 있는 좋은 방법인가요?

정답은 3번입니다.

브런치는 최신 브라우저에 최적화 되어있습니다. IE chrome safari