brunch

You can make anything
by writing

C.S.Lewis

by 라인하트 Dec 01. 2020

앤드류 응의 머신러닝(15-1): 이상 탐지 문제 개요

온라인 강의 플랫폼 코세라의 창립자인 앤드류 응 (Andrew Ng) 교수는 인공지능 업계의 거장입니다. 그가 스탠퍼드 대학에서 머신 러닝 입문자에게 한 강의를 그대로 코세라 온라인 강의 (Coursera.org)에서 무료로 배울 수 있습니다. 이 강의는 머신러닝 입문자들의 필수코스입니다. 인공지능과 머신러닝을 혼자 공부하면서 자연스럽게 만나게 되는 강의입니다.

Anomaly Detection

(이상 탐지)

Density Estimation (밀도 추정)

Problem Motivation (문제)

In this next set of videos, I'd like to tell you about a problem called Anomaly Detection. This is a reasonably commonly use you type machine learning. And one of the interesting aspects is that it's mainly for unsupervised problem, that there's some aspects of it that are also very similar to sort of the supervised learning problem.

이번 강의부터 이상 탐지 문제를 다룹니다. 이상 탐지 분야는 머신 러닝을 많이 활용합니다. 주로 비지도 학습 알고리즘을 사용하지만, 어떤 부분은 지도 학습 문제와도 비슷합니다.

So, what is anomaly detection? To explain it, let me use the motivating example of: Imagine that you're a manufacturer of aircraft engines, and let's say that as your aircraft engines roll off the assembly line, you're doing, you know, QA or quality assurance testing, and as part of that testing you measure features of your

aircraft engine, like maybe, you measure the heat generated, things like the vibrations and so on. I share some friends that worked on this problem a long time ago, and these were actually the sorts of features that they were collecting off actual aircraft engines so you now have a data set of X1 through Xm, if you have manufactured m aircraft engines, and if you plot your data, maybe it looks like this. So, each point here, each cross here as one of your unlabeled examples. So, the anomaly detection problem is the following.

이상 탐지란 무엇입니까? 예를 들어 보겠습니다. 여러분은 항공기 엔진 제조업체에서 근무합니다. 여러분은 조립 라인에서 조립을 마친 항공기 엔진에 대한 품질 보증(QA, Quality Assurance) 테스트를 합니다. 테스트에서 엔진의 발열과 진동 등과 같은 피처를 측정합니다. 실제 항공기 엔진에서 수집한 정보를 바탕으로 데이터셋은 x^(1), x^(2),..., x^(m)입니다. m개의 항공기 엔진을 그림과 같이 도식화합니다. 각 점들은 레이블이 없는 학습 예제입니다.

Let's say that on, you know, the next day, you have a new aircraft engine that rolls off the assembly line and your new aircraft engine has some set of features x-test. What the anomaly detection problem is, we want to know if this aircraft engine is anomalous in any way, in other words, we want to know if, maybe, this engine should undergo further testing because, or if it looks like an okay engine, and so it's okay to just ship it to a customer without further testing. So, if your new aircraft engine looks like a point over there, well, you know, that looks a lot like the aircraft engines we've seen before, and so maybe we'll say that it looks okay. Whereas, if your new aircraft engine, if x-test, you know, were a point that were out here, so that if X1 and X2 are the features of this new example. If x-tests were all the way out there, then we would call that an anomaly and maybe send that aircraft engine for further testing before we ship it to a customer, since it looks very different than the rest of the aircraft engines we've seen before.

여기 조립을 마친 새로운 항공기 엔진이 있습니다. 새로운 항공기 엔진의 품질 보증 테스트 결과는 xtest입니다. 이상 탐지 문제는 항공기 엔진이 정상인지 비정상인지를 탐지하는 것입니다. 새로운 항공기 엔진에 대한 품질 보증 테스트에서 추가 테스트 여부를 경정합니다. 품질 보증 테스트를 통과한 좋은 엔진은 추가 테스트 없이 고객에게 배송하지만, 통과하지 못한 엔진은 고객에게 배송하기 전에 추가 테스트를 합니다. 새로운 항공기 엔진의 테스트 결과 xtest가 기존의 항공기 엔진들과 비슷한 위치에 있다면 좋은 엔진입니다. 반면에 테스트 결과 xtest가 기존 항공기 엔진들과 전혀 다른 위치에 있다면 비정상(anomaly) 엔진입니다. 즉, 기존의 항공기 엔진 테스트 결과와 유사한 지 다른 지를 평가합니다.

More formally in the anomaly detection problem, we're give some data sets, x1 through Xm of examples, and we usually assume that these end examples are normal or non-anomalous examples, and we want an algorithm to tell us if some new example x-test is anomalous. The approach that we're going to take is that given this training set, given the unlabeled training set, we're going to build a model for p of x. In other words, we're going to build a model for the probability of x, where x are these features of, say, aircraft engines.

이상 탐지 문제를 수학적으로 접근합니다. 레이블이 없는 데이터셋은 x^(1), x^(2),..., x^(m)까지 있습니다. 항공기 엔진에 대한 학습 예제는 정상 또는 비정상일 수 있습니다. 이상 탐지 알고리즘이 새로운 예제가 이상 또는 비정상인지를 검출하기 위한 모델 p(x)를 설계합니다. 다시 말해서, 항공기 엔진에 관한 피처 x의 확률에 대한 모델을 설계합니다.

And so, having built a model of the probability of x we're then going to say that for the new aircraft engine, if p of x-test is less than some epsilon then we flag this as an anomaly. So we see a new engine that, you know, has very low probability under a model p of x that we estimate from the data, then we flag this anomaly, whereas if p of x-test is, say, greater than or equal to some small threshold. Then we say that, you know, okay, it looks okay. And so, given the training set, like that plotted here, if you build a model, hopefully you will find that aircraft engines, or hopefully the model p of x will say that points that lie, you know, somewhere in the middle, that's pretty high probability, whereas points a little bit further out have lower probability. Points that are even further out have somewhat lower probability, and the point that's way out here, the point that's way out there, would be an anomaly. Whereas the point that's way in there, right in the middle, this would be okay because p of x right in the middle of that would be very high cause we've seen a lot of points in that region.

새로운 항공기 엔진의 테스트 결과 xtest의 확률 p(xtest)가 매우 낮으면 이상입니다.

x의 확률 모델 p(xtest)가 엡실론(ε) 보다 작으면 이상이고, p(xtest)가 엡실론(ε) 보다 크면 정상입니다. 그래서, 기존 항공기 엔진들이 모여있는 점들과 같은 위치를 모델로 만든다면, 정상적인 항공기 엔진이라고 결정합니다. p(x)가 기존 데이터들의 중앙에 있다면, 정상일 확률이 매우 높습니다. p(x)가 중앙에서 조금 멀리 떨어진 위치에 있다면 정상일 확률은 조금 낮습니다. p(x)가 중앙에서 멀어질수록 정상일 확률은 점점 낮아집니다. 기존의 정상적인 항공기 엔진은 중간 영역에 많이 분포하기 때문입니다.

Here are some examples of applications of anomaly detection. Perhaps the most common application of anomaly detection is actually for detection if you have many users, and if each of your users take different activities, you know maybe on your website or in the physical plant or something, you can compute features of the different users activities. And what you can do is build a model to say, you know, what is the probability of different users behaving different ways. What is the probability of a particular vector of features of a users behavior? So you know examples of features of a users activity may be on the website it'd be things like, maybe x1 is how often does this user log in, x2, you know, maybe the number of what pages visited, or the number of transactions, maybe x3 is, you know, the number of posts of the users on the forum, feature x4 could be what is the typing speed of the user and some websites can actually track that was the typing speed of this user in characters per second. And so you can model p of x based on this sort of data.

이상 탐지 문제에 대한 몇 가지 응용 사례입니다. 가장 일반적인 응용 사례는 많은 사용자들이 접근하는 웹사이트나 공장 또는 캠퍼스에서 일반적이지 않은 행동을 포착하는 것입니다. 사용자들의 활동에 대한 확률을 계산하는 모델을 구축합니다. 사용자의 활동에 대한 피처 벡터의 확률은 무엇일까요? 인터넷 웹사이트를 상상해 봅니다. x1은 사용자가 로그인하는 빈도수, x2는 방문하는 페이지 수 또는 거래 수, x3는 게시판에 있는 사용자의 게시물 수, x4는 사용자의 타이핑 속도, 등입니다. 실제로 웹사이트들은 초당 입력 문자의 수를 추적합니다. 이런 종류의 데이터를 기반으로 p(x)를 모델링합니다.

And finally having your model p of x, you can try to identify users that are behaving very strangely on your website by checking which ones have probably effects less than epsilon and maybe send the profiles of those users for further review. Or demand additional identification from those users, or some such to guard against you know, strange behavior or fraudulent behavior on your website. This sort of technique will tend of flag the users that are behaving unusually, not just users that maybe behaving fraudulently. So not just constantly having stolen or users that are trying to do funny things, or just find unusual users. But this is actually the technique that is used by many online websites that sell things to try identify users behaving strangely that might be indicative of either fraudulent behavior or of computer accounts that have been stolen.

웹사이트에서 매우 이상한 활동을 하는 사용자를 식별하기 위해 확률 p(x)가 앱실론(ε) 보다 작은 값을 가지는 사용자를 식별합니다. 식별된 사용자는 추가 검토를 위한 사용자 프로파일을 담당자에게 전송하거나 웹사이트에서 이상한 활동이나 사기 행위로부터 다른 사용자들을 보호를 위한 조치를 할 수 있습니다. 이상 탐지 기술은 비정상적인 활동을 하는 사용자나 사기 행위 자를 구별하기 위해 식별자(flag)를 지정합니다. 이상 행동을 반복적이고 지속적으로 하면서 사기를 치거나, 계정을 훔치거나, 또는 남에게 피해를 주는 사용자를 찾습니다. 이상 탐지 기술은 물건을 판매하는 수많은 온라인 웹사이트에서 일반적으로 사용하는 기술입니다.

Another example of anomaly detection is manufacturing. So, already talked about the aircraft engine thing where you can find unusual, say, aircraft engines and send those for further review.

두 번째 응용 사례는 제조업 분야에서 품질 보증 테스트에서 비정상적인 제품을 포착하는 것입니다. 이미 비정상적인 항공기 엔진을 찾아서 추가 검토를 하는 사례를 다루었습니다.

A third application would be monitoring computers in a data center. I actually have some friends who work on this too. So if you have a lot of machines in a computer cluster or in a data center, we can do things like compute features at each machine. So maybe some features capturing you know, how much memory used, number of disc accesses, CPU load. As well as more complex features like what is the CPU load on this machine divided by the amount of network traffic on this machine? Then given the dataset of how your computers in your data center usually behave, you can model the probability of x, so you can model the probability of these machines having different amounts of memory use or probability of these machines having different numbers of disc accesses or different CPU loads and so on. And if you ever have a machine whose probability of x, p of x, is very small then you know that machine is behaving unusually and maybe that machine is about to go down, and you can flag that for review by a system administrator. And this is actually being used today by various data centers to watch out for unusual things happening on their machines.

세 번째 응용 사례는 데이터 센터의 컴퓨터를 모니터링하는 것입니다. 컴퓨터 클러스터 또는 데이터 센터에서 설치된 서버들의 이상을 탐지하기 위해 각 서버의 피처를 계산합니다. 주요 피처는 각 서버에서 수집할 수 있는 메모리 사용량, 하드 디스크 접근 회수, CPU 부하 등입니다. 수집 정보를 바탕으로 CPU의 부하를 서버의 네트워크 트래픽 양으로 나누는 좀 더 복잡한 피처를 생성합니다. 데이터 센터의 서버에서 메모리 사용량, 하드디스크 접근 회수, CPU 부하 등에 대한 확률 p(x)를 모델링합니다. 특정 서버의 확률 p(x)가 엡실론보다 작으면 비정상이라고 판단하고 시스템 관리자에게 추가 검토를 요청하는 플래그를 지정합니다. 오늘날 데이터 센터에서 서버에서 발생하는 비정상적인 작업을 감시하기 위해 이상 탐지 알고리즘을 사용합니다.

So, that's anomaly detection. In the next video, I'll talk a bit about the Gaussian distribution and review properties of the Gaussian probability distribution, and in videos after that, we will apply it to develop an anomaly detection algorithm.

지금까지 이상 탐지 알고리즘을 설명했습니다. 다음 강의에서 가우시안 분포와 가우시안 분포를 활용하는 이상 탐지 알고리즘을 설명합니다.

앤드류 응의 머신러닝 동영상

정리하며

이상 탐지 문제는 어떤 행동이나 물체가 정상인지 비정상인지를 탐지하는 것입니다. 이상 탐지 알고리즘은 레이블이 없는 비지도 학습 예제를 활용합니다. 이상 탐지는 데이터 셋 x의 확률 모델을 설계한 후 새로운 예제 p(xtest)가 엡실론(ε) 보다 작으면 이상으로 표시합니다. 반면에 p(xtest)가 엡실론(ε) 보다 크면 정상으로 표시합니다.

이상 탐지를 활용하는 사례가 몇 가지 있습니다.

첫 번째 응용 사례는 사용자들의 이상 행동을 탐지하는 것입니다. 웹사이트 또는 공장에 방문한 사용자들의 일반적이지 않은 다른 활동을 계산합니다. 먼저, 사용자들의 활동에 대한 확률이 얼마인지를 계산하는 모델을 구축하는 것입니다. 주요 피처는 x1은 사용자가 로그인하는 빈도수, x2는 방문하는 페이지 수 또는 거래 수, x3는 포험에 있는 사용자의 게시물의 수, x4는 사용자의 타이핑 속도 등을 확률로 모델링합니다. 이상 탐지 알고리즘은 확률을 분석하여 담당자에게 이상 행동을 하는 사용자들을 추가 검토를 요청합니다.

두 번째 응용 사례는 제조업에서 제품의 품질 보증 테스트에 활용합니다. 방금 조립을 완료한 항공기 엔진이 이상이 있는지 없는 지를 확인하기 위해 사용합니다. 주요 피처는 x1은 엔진의 열, x2는 엔진의 진동 등을 확률로 모델링합니다. 이상 탐지 알고리즘은 확률을 분석하여 항공기 엔진이 이상이 있다고 판단하고 담당자에게 추가 검토를 요청합니다.

마지막 응용 사례는 데이터 센터의 컴퓨터를 모니터링하는 것입니다. 컴퓨터 클러스터 또는 데이터 센터에 많은 서버들이 있을 때 각 서버에 대한 피처를 계산할 수 있습니다. 주요 피처는 x1 메모리 사용량, x2 하드 디스크에 접근하는 수, x3 CPU 부하 등을 확률로 모델링힙니다. 또는 복잡한 피처 x4는 CPU의 부하를 서버의 네트워크 트래픽 양으로 나눈 것 등을 활용합니다. 이상 탐지 알고리즘은 확률을 분석하여 서버가 다운될 것이라는 판단하고 시스템 관리자에게 추가 검토를 요청합니다.

문제풀이

이상 탐지 시스템은 p(x)가 엡실론(ε) 보다 작을 때 이상 플래그를 지정합니다. 시스템이 너무 많은 이상 플래그를 검출합니다. 지도 학습에서 False Positive와 비슷합니다. 어떻게 해결할 수 있을까요?

정답은 2번입니다.

브런치는 최신 브라우저에 최적화 되어있습니다. IE chrome safari