brunch

You can make anything
by writing

C.S.Lewis

by 라인하트 Dec 06. 2020

앤드류 응의 머신러닝(15-6):이상 탐지 피처 선택

온라인 강의 플랫폼 코세라의 창립자인 앤드류 응 (Andrew Ng) 교수는 인공지능 업계의 거장입니다. 그가 스탠퍼드 대학에서 머신 러닝 입문자에게 한 강의를 그대로 코세라 온라인 강의 (Coursera.org)에서 무료로 배울 수 있습니다. 이 강의는 머신러닝 입문자들의 필수코스입니다. 인공지능과 머신러닝을 혼자 공부하면서 자연스럽게 만나게 되는 강의입니다.

Anomaly Detection

(이상 탐지)

Building an Anomaly Detection System

(이상 탐지 시스템 구축하기)

Choosing What Features to Use (피처 선택하기)

By now you've seen the anomaly detection algorithm and we've also talked about how to evaluate an anomaly detection algorithm. It turns out, that when you're applying anomaly detection, one of the things that has a huge effect on how well it does, is what features you use, and what features you choose, to give the anomaly detection algorithm. So in this video, what I'd like to do is say a few words, give some suggestions and guidelines for how to go about designing or selecting features give to an anomaly detection algorithm.

지금까지 이상 탐지 알고리즘의 동작 방식과 평가 방법을 다루었습니다. 이상 탐지 알고리즘의 성능에 큰 영향을 미치는 요소 중 하나는 피처입니다. 이번 강의는 이상 탐지 알고리즘에서 사용할 피처를 설계하거나 선택하는 방법을 다룹니다.

In our anomaly detection algorithm, one of the things we did was model the features using this sort of Gaussian distribution. With xi to mu i, sigma squared i, lets say. And so one thing that I often do would be to plot the data or the histogram of the data, to make sure that the data looks vaguely Gaussian before feeding it to my anomaly detection algorithm. And, it'll usually work okay, even if your data isn't Gaussian, but this is sort of a nice sanitary check to run. And by the way, in case your data looks non-Gaussian, the algorithms will often work just find. But, concretely if I plot the data like this, and if it looks like a histogram like this, and the way to plot a histogram is to use the HIST, or the HIST command in Octave. But it looks like this, this looks vaguely Gaussian, so if my features look like this, I would be pretty happy feeding into my algorithm.

이상 탐지 알고리즘은 가우시안 분포를 사용하여 피처를 모델링합니다. 피처 xi를 모델링하는 식은 다음과 같습니다.

먼저 이상 탐지 알고리즘을 적용하기 전에 데이터나 데이터의 히스토그램을 도식화하여 가우시안 분포를 따르는 지를 확인합니다. 히스토그램(Histogram)은 왼쪽의 막대그래프처럼 특정 변수에 대해 구간별 빈도수를 나타냅니다. 데이터가 가우시안 분포를 따르지 않더라도 정상적으로 작동하지만 일종의 무결성 체크입니다. 우선 데이터가 가우시안 분포처럼 보이지 않는 경우를 찾습니다. 구체적으로 데이터를 히스토그램으로 도식화할 때 옥타브 프로그램에서 'hist()' 함수를 사용합니다. 데이터를 도식화한 그래프가 왼쪽의 그림처럼 가우시안 분포처럼 보인다면 알고리즘이 원하는 형태의 데이터입니다.

But if i were to plot a histogram of my data, and it were to look like this. Well, this doesn't look at all like a bell shaped curve, this is a very asymmetric distribution, it has a peak way off to one side. If this is what my data looks like, what I'll often do is play with different transformations of the data in order to make it look more Gaussian. And again the algorithm will usually work okay, even if you don't. But if you use these transformations to make your data more gaussian, it might work a bit better. So given the data set that looks like this, what I might do is take a log transformation of the data and if i do that and re-plot the histogram, what I end up with in this particular example, is a histogram that looks like this. And this looks much more Gaussian, right? This looks much more like the classic bell shaped curve, that we can fit with some mean and variance paramater sigma.

하지만, 데이터의 히스토그램을 그렸을 때 왼쪽 그림처럼 보일 수 있습니다. 이것은 종모양의 그래프가 아닌 한쪽이 뾰족한 비대칭 분포입니다. 일반적으로 이런 모양의 데이터를 가우시안 분포처럼 보이게 다양한 변환 방법을 사용합니다. 이상 탐지 알고리즘은 뾰족한 비대칭 분포의 데이터에서 제대로 동작하지만, 가우시안 분포의 데이터에서 더 잘 동작합니다. 뾰족한 비대칭 분포의 데이터에 Log() 변환을 적용하여 히스토그램을 그립니다. 최종적으로 오른쪽 그래프처럼 가우시안 분포 모양의 그래프를 얻을 수 있습니다. 이것은 평균 및 분산 파라미터 μ와 σ^2를 추정할 수 있는 전형적인 종 모양의 곡선입니다.

So what I mean by taking a log transform, is really that if I have some feature x1 and then the histogram of x1 looks like this then I might take my feature x1 and replace it with log of x1 and this is my new x1 that I'll plot to the histogram over on the right, and this looks much more Guassian. Rather than just a log transform some other things you can do, might be.

피처 x1의 데이터로 히스토그램을 그렸을 때 왼쪽 하단의 그래프와 같다면, x1의 데이터를 Log(x1)로 대체할 수 있습니다. 즉, 우측 하단의 그래프처럼 훨씬 더 가우시안 분포처럼 보일 수 있습니다.

Let's say I have a different feature x2, maybe I'll replace that will log x plus 1, or more generally with log x with x2 and some constant c and this constant could be something that I play with, to try to make it look as Gaussian as possible. Or for a different feature x3, maybe I'll replace it with x3, I might take the square root. The square root is just x3 to the power of one half, right? And this one half is another example of a parameter I can play with. So, I might have x4 and maybe I might instead replace that with x4 to the power of something else, maybe to the power of 1/3. And these, all of these, this one, this exponent parameter, or the C parameter, all of these are examples of parameters that you can play with in order to make your data look a little bit more Gaussian.

또 다른 피처 x 2가 있다고 가정합니다. 피처 x2는 Log(x2 + 1)입니다. 더 일반적으로 쓰면 Log (x2 + C)입니다. 여기서 C는 상수입니다. 상수 C는 다양한 값으로 대체될 수 있을 것입니다. 또 다른 피처 x3는 √(x3)입니다. x3는 (√(x3))^2와 같습니다. 피처 x4는 세제곱근 x4^(1/3)입니다. x4 대신에 x4^(1/3)으로 대체할 수 있습니다. 여기 세제곱근, 제곱근, 상수 C는 데이터를 가우시안 분포처럼 보이게 하기 위한 파라미터입니다.

So, let me show you a live demo of how I actually go about playing with my data to make it look more Gaussian. So, I have already loaded in to octave here a set of features x I have a thousand examples loaded over there. So let's pull up the histogram of my data. Use the hist x command.

데이터를 좀 더 가우시안 분포처럼 보이기 할 수 있는 방법을 설명합니다. 옥타브 프로그램에서 실제 데모를 보여 드리겠습니다. 이미 옥타브 프로그램에 수천 개의 예제를 가진 피처 셋을 업로드했습니다.

size(x) % 데이터의 크기를 확인

hist(x) % 데이터의 히스토그램을 도식화

So there's my histogram. By default, I think this uses 10 bins of histograms, but I want to see a more fine grid histogram. So we do hist to the x, 50, so, this plots it in 50 different bins. Okay, that looks better. Now, this doesn't look very Gaussian, does it?

기본적으로 10개의 히스토그램 막대를 사용합니다. 더 정밀하게 히스토 그램을 보기 위해 50개의 막대를 사용합니다.

hist(x, 50) % 데이터를 50개의 막대로 히스토그램을 도식화

훨씬 보기 좋습니다. 하지만, 가우시안 분포로 보이지 않습니다.

So, lets start playing around with the data. Lets try a hist of x to the 0.5. So we take the square root of the data, and plot that histogram.And, okay, it looks a little bit more Gaussian, but not quite there, so let's play at the 0.5 parameter.

그래서 데이터를 변환합니다. 데이터 x에 제곱근을 취하고 히스토그램을 그립니다.

hist(x^0.5, 50) % 데이터를 50개의 막대로 히스토그램을 도식화

조금 더 가우시안 분포처럼 보이지만 그다지 확실하지 않습니다.

Let's see. Set this to 0.2. Looks a little bit more Gaussian.

데이터 x에 0.2 승을 하고 히스토그램을 그립니다.

hist(x^0.2, 50) % 데이터를 50개의 막대로 히스토그램을 도식화

훨씬 더 가우시안 분포처럼 보입니다.

Let's reduce a little bit more 0.1. Yeah, that looks pretty good. I could actually just use 0.1.

데이터 x에 0.1 승을 하고 히스토그램을 그립니다.

hist(x^0.1, 50) % 데이터를 50개의 막대로 히스토그램을 도식화

훨씬 더 가우시안 분포처럼 보입니다. 이 값을 사용할 수 있을 것입니다.

Well, let's reduce it to 0.05. And, you know? Okay, this looks pretty Gaussian, so I can define a new feature which is x mu equals x to the 0.05, and now my new feature xNew looks more Gaussian than my previous one and then I might instead use this new feature to feed into my anomaly detection algorithm.

데이터 x에 0.05 승을 하고 히스토그램을 그립니다.

hist(x^0.05, 50) % 데이터를 50개의 막대로 히스토그램을 도식화

xNew = x^0.05 % xNew 변수에 새로운 x의 값을 적용

가우시안 분포처럼 보입니다. 여기서 데이터 x를 x^0.05승을 한 새로운 피처 xNew를 정의합니다. 새로운 기능 xNew는 전의 x 보다 훨씬 더 가우시안 분포처럼 보입니다. x 대신에 xNew를 이상 탐지 알고리즘에 적용합니다.

And of course, there is more than one way to do this. You could also have hist of log of x, that's another example of a transformation you can use. And, you know, that also look pretty Gaussian. So, I can also define xNew equals log of x. and that would be another pretty good choice of a feature to use.

물론, 이것보다 더 나은 것도 있습니다. 데이터 x에 Log()를 취하고 히스토그램을 그립니다.

hist(log(x), 50) % 데이터를 50개의 막대로 히스토그램을 도식화

xNew = log(x) % xNew 변수에 새로운 x의 값을 적용

이것은 또 다른 변환의 예입니다. 이것은 더 좋은 가우시안 분포처럼 보입니다. 이것은 원래 피처 x보다 꽤 좋은 선택이 될 것입니다.

So to summarize, if you plot a histogram with the data, and find that it looks pretty non-Gaussian, it's worth playing around a little bit with different transformations like these, to see if you can make your data look a little bit more Gaussian, before you feed it to your learning algorithm, although even if you don't, it might work okay. But I usually do take this step.

요약하자면, 데이터를 히스토그램으로 도식화하여 가우시안 분포처럼 보이는지 아닌지를 점검합니다. 가우시안 분포처럼 보이지 않을 때는 다양한 변환 방법을 통해 좀 더 가우시안 분포처럼 보이게 조정합니다. 학습 알고리즘에 적용하기 전에 데이터를 조정합니다. 사실 데이터를 보정하지 않더라도 알고리즘은 정상적으로 작동합니다.

Now, the second thing I want to talk about is, how do you come up with features for an anomaly detection algorithm. And the way I often do so, is via an error analysis procedure. So what I mean by that, is that this is really similar to the error analysis procedure that we have for supervised learning, where we would train a complete algorithm, and run the algorithm on a cross validation set, and look at the examples it gets wrong, and see if we can come up with extra features to help the algorithm do better on the examples that it got wrong in the cross-validation set. So lets try to reason through an example of this process.

두 번째로 이상 탐지 알고리즘의 피처를 설계하는 방법을 설명합니다. 오류 분석 절차를 활용합니다. 지도 학습 알고리즘에서 다룬 오류 분석 절차와 거의 유사합니다. 알고리즘은 학습 셋에서 완벽하게 학습한 후에 교차 검증 셋에서 알고리즘을 실행합니다. 교차 검증 셋에서 이상 예제를 더 잘 찾아낼 수 있는 피처가 있는 있을지를 확인합니다. 여기서 이 과정을 따라가 봅니다.

In anomaly detection, we are hoping that p of x will be large for the normal examples and it will be small for the anomalous examples. And so a pretty common problem would be if p of x is comparable, maybe both are large for both the normal and the anomalous examples. Lets look at a specific example of that. Let's say that this is my unlabeled data. So, here I have just one feature, x1 and so I'm gonna fit a Gaussian to this. And maybe my Gaussian that I fit to my data looks like that. And now let's say I have an anomalous example, and let's say that my anomalous example takes on an x value of 2.5. So I plot my anomalous example there. And you know, it's kind of buried in the middle of a bunch of normal examples, and so, just this anomalous example that I've drawn in green, it gets a pretty high probability, where it's the height of the blue curve, and the algorithm fails to flag this as an anomalous example. Now, if this were maybe aircraft engine manufacturing or something, what I would do is, I would actually look at my training examples and look at what went wrong with that particular aircraft engine, and see, if looking at that example can inspire me to come up with a new feature x2, that helps to distinguish between this bad example, compared to the rest of my red examples, compared to all of my normal aircraft engines.

이상 탐지 문제에서 확률 p(x)는 정상적인 예제에서 크고 이상 예제에서 작습니다. 반대로 일반적인 문제에서 두 개의 p(x)의 값이 비슷하고 둘 다 큰 값입니다. 레이블이 없는 항공기 엔진 데이터는 피처 x1 하나만 있고 가우시안 분포를 따릅니다. 이상 예제가 녹색 점에 있을 때 x는 2.5입니다. 이상 예제가 정상 예제의 한가운데에 있습니다. 녹색의 이상 예제는 꽤 높은 확률이므로 알고리즘은 이상으로 표시하지 않습니다. 오류 분석을 위해 학습 예제에서 항공기 엔진의 이상이 무엇인지를 점검합니다. 이상 예제로 검출하기 위한 새로운 피처가 필요한 지를 확인합니다. 빨간색의 정상 항공기 엔진과 비교하여 이상 예제를 구별할 수 있는 새로운 피처 x 2가 필요합니다.

And if I managed to do so, the hope would be then, that, if I can create a new feature, X2, so that when I re-plot my data, if I take all my normal examples of my training set, hopefully I find that all my training examples are these red crosses here. And hopefully, if I find that for my anomalous example, the feature x2 takes on the the unusual value. So for my green example here, this anomaly, right, my X1 value, is still 2.5. Then maybe my X2 value, hopefully it takes on a very large value like 3.5 over there, or a very small value. But now, if I model my data, I'll find that my anomaly detection algorithm gives high probability to data in the central regions, slightly lower probability to that, sightly lower probability to that. An example that's all the way out there, my algorithm will now give very low probability to. And so, the process of this is, really look at the mistakes that it is making. Look at the anomaly that the algorithm is failing to flag, and see if that inspires you to create some new feature. So find something unusual about that aircraft engine and use that to create a new feature, so that with this new feature it becomes easier to distinguish the anomalies from your good examples. And so that's the process of error analysis and using that to create new features for anomaly detection.

새로운 피처 x2를 설계하고, 학습 셋을 x1과 x2의 이차원 그래프로 도식화합니다. 녹색 예제는 x1은 2.5이고 x2는 3.5입니다. x2에서 정상 범위를 벗어난 매우 큰 값 또는 매우 작은 값입니다. 이제 녹색 예제는 피처 x1에서 정상이지만 피처 x2에서 이상입니다. 이상 탐지 알고리즘이 확률 p(x)를 모델링하면, 중앙 지역의 데이터는 매우 높은 확률의 값을 가지고, 다음은 약간 낮은 확률을, 그다음은 눈에 띄게 낮은 확률을 가집니다. 나머지는 매우 낮은 확률을 가집니다. 이것이 새로운 피처를 찾는 과정입니다. 알고리즘이 발견하지 못하는 이상 예제를 살펴보고 새로운 피처를 만들 수 있는 지를 확인합니다. 이상이 있는 항공기 엔진에서 특이한 것을 찾아서 새로운 피처를 만듭니다. 그러면 새로운 피처를 사용하면 정상 예제들에서 이상 예제를 쉽게 구별할 수 있습니다. 이것이 오류 분석의 과정이자 이상 탐지를 위한 새로운 피처를 만드는 과정입니다.

Finally, let me share with you my thinking on how I usually go about choosing features for anomaly detection. So, usually, the way I think about choosing features is I want to choose features that will take on either very, very large values, or very, very small values, for examples that I think might turn out to be anomalies. So let's use our example again of monitoring the computers in a data center. And so you have lots of machines, maybe thousands, or tens of thousands of machines in a data center. And we want to know if one of the machines, one of our computers is acting up, so doing something strange. So here are examples of features you may choose, maybe memory used, number of disc accesses, CPU load, network traffic.

마지막으로 이상 탐지 피처를 선택하는 방법을 설명합니다. 일반적으로 피처를 선택할 때 예외로 판명될 수 있을 것 같은 예제에 대해 매우 큰 값이나 매우 작은 값이 되는 피처를 선택합니다. 데이터 센터의 서버를 모니터링하는 예제로 돌아가 봅시다. 데이터 센터에는 수천 또는 수 만대의 서버가 있습니다. 서버가 잘 작동하는 지를 확인하기 위해 여러 가지 피처를 사용합니다. x1은 메모리 사용량, x2는 CPU 부하, x3는 하드디스크 액세스 수, x4는 네트워크 트래픽 사용량 등입니다.

But now, lets say that I suspect one of the failure cases, let's say that in my data set I think that CPU load the network traffic tend to grow linearly with each other. Maybe I'm running a bunch of web servers, and so, here if one of my servers is serving a lot of users, I have a very high CPU load, and have a very high network traffic. But let's say, I think, let's say I have a suspicion, that one of the failure cases is if one of my computers has a job that gets stuck in some infinite loop. So if I think one of the failure cases, is one of my machines, one of my web servers--server code-- gets stuck in some infinite loop, and so the CPU load grows, but the network traffic doesn't because it's just spinning it's wheels and doing a lot of CPU work, you know, stuck in some infinite loop. In that case, to detect that type of anomaly, I might create a new feature, X5, which might be CPU load divided by network traffic. And so here X5 will take on a unusually large value if one of the machines has a very large CPU load but not that much network traffic and so this will be a feature that will help your anomaly detection capture, a certain type of anomaly.

여기서 이상 사례 중 하나를 생각해 봅시다. 일반적으로 웹서버는 CPU 부하와 네트워크 트래픽 사용량 간에 선형적인 관계가 있습니다. 웹서버가 많은 사용자에게 서비스를 제공할 때 CPU 부하와 네트워크 트래픽 사용량이 동시에 높습니다. 이상 사례 중 하나는 서버가 무한 루프에 갇힌 경우입니다. 무한 루프에 갇힌 서버는 CPU 부하가 증가하지만 네트워크 트래픽은 증가하지 않습니다. 바퀴를 계속 돌리는 것처럼 CPU 부하만 증가하는 유형의 이상을 감지하기 위해 새로운 피처 x5를 설계합니다. x5는 CPU 부하를 네트워크 트래픽 사용량으로 나눈 값입니다. x5는 서버가 매우 큰 CPU 부하가 있지만 네트워크 트래픽 사용량이 많지 않은 경우를 탐지합니다. x5는 무한루프에 빠진 서버를 포착합니다.

And you can also get creative and come up with other features as well. Like maybe I have a feature x6 thats CPU load squared divided by network traffic. And this would be another variant of a feature like x5 to try to capture anomalies where one of your machines has a very high CPU load, that maybe doesn't have a commensurately large network traffic. And by creating features like these, you can start to capture anomalies that correspond to unusual combinations of values of the features.

또한, 창의력을 발휘하여 다른 피처를 설계할 수 있습니다. 예를 들면 x5와 반대로 x6는 네트워크 트래픽 사용량을 CPU 부하로 나눈 값입니다. x6은 서버가 CPU 부하는 낮고 트래픽 사용량은 높은 이상 현상을 포착합니다. 이와 같은 피처는 피처 값의 비정상적인 조합에 해당하는 이상 현상을 포착할 수 있습니다.

So in this video we talked about how to and take a feature, and maybe transform it a little bit, so that it becomes a bit more Gaussian, before feeding into an anomaly detection algorithm. And also the error analysis in this process of creating features to try to capture different types of anomalies. And with these sorts of guidelines hopefully that will help you to choose good features, to give to your anomaly detection algorithm, to help it capture all sorts of anomalies.

이번 강의에서 데이터가 가우시안 분포를 따르도록 데이터를 변형하는 방법을 설명했습니다. 또한 다양한 유형의 이상을 포착하기 위해 오류 분석을 통해 피처를 설계하는 방법을 설명했습니다. 이상 감지 알고리즘은 좋은 피처를 사용할수록 다양한 종류의 이상을 포착할 수 있습니다.

앤드류 응의 머신 러닝 동영상 강의

정리하며

첫 번째로 데이터를 히스토그램으로 도식화하여 가우시안 분포처럼 보이는지 아닌지를 점검합니다. 가우시안 분포처럼 보이지 않을 때는 다양한 변환 방법을 통해 좀 더 가우시안 분포처럼 보이게 조정합니다. 학습 알고리즘에 적용하기 전에 데이터를 조정합니다. 사실 데이터를 보정하지 않더라도 알고리즘은 정상적으로 작동하지만, 데이터를 보정하면 훨씬 더 잘 동작합니다.

데이터를 가우시안 분포처럼 만들기 위해 여러 가지 방법을 취합니다. 옥타브 프로그램에서 hist() 명령어를 사용해서 히스토그램을 그릴 수 있습니다.

hist(x^0.5, 50)

hist(x^0.05, 50)

hist(log(x), 50)

두 번째로 이상 탐지 알고리즘의 피처를 생각해내는 방법을 설명합니다. 주로 오류 분석 절차를 활용합니다. 지도 학습 알고리즘에서 활용한 오류 분석 절차와 정말 유사합니다. 학습 셋에서 설정한 피처를 가지고 교차 검증 셋과 테스트 셋에서 확인합니다. 알고리즘이 발견하지 못하는 이상 예제를 살펴보고 새로운 피처를 만들 수 있는 지를 확인합니다. 그러면 새로운 피처를 사용하면 정상 예제들에서 이상 예제를 쉽게 구별할 수 있습니다.

세 번째로 이상 탐지 알고리즘의 새로운 피처를 만드는 과정을 또 다른 방법을 설명합니다. 새로운 문제점을 발견하기 위해 기존 피처를 조합한 새로운 피처를 만듭니다. 예를 들면, 데이터 센터의 웹서버가 무한 루프에 갇혀서 CPU 부하가 증가하지만 네트워크 트래픽은 증가하지 않는 유형의 장애가 있습니다. 이런 유형의 이상을 감지하기 위해 CPU 부하를 네트워크 트래픽 사용량으로 나눈 값이라는 새로운 피처를 만듭니다.

문제 풀이

이상 탐지 알고리즘의 성능이 좋지 않습니다. 교차 검증 셋에서 많은 이상 예제와 많은 정상 예제에 매우 큰 값의 p(x)를 출력합니다. 알고리즘을 어떻게 변경해야 할까요?

정답은 2번입니다.

브런치는 최신 브라우저에 최적화 되어있습니다. IE chrome safari