brunch

You can make anything
by writing

C.S.Lewis

by 라인하트 Dec 02. 2020

앤드류 응의 머신러닝(15-2): 가우시안 분포

온라인 강의 플랫폼 코세라의 창립자인 앤드류 응 (Andrew Ng) 교수는 인공지능 업계의 거장입니다. 그가 스탠퍼드 대학에서 머신 러닝 입문자에게 한 강의를 그대로 코세라 온라인 강의 (Coursera.org)에서 무료로 배울 수 있습니다. 이 강의는 머신러닝 입문자들의 필수코스입니다. 인공지능과 머신러닝을 혼자 공부하면서 자연스럽게 만나게 되는 강의입니다.

Anomaly Detection

(이상 탐지)

Density Estimation (밀도 추정)

Gaussian Distribution (가우시안 분포)

In this video, I'd like to talk about the Gaussian distribution which is also called the normal distribution. In case you're already intimately familiar with the Gaussian distribution, it's probably okay to skip this video, but if you're not sure or if it has been a while since you've worked with the Gaussian distribution or normal distribution then please do watch this video all the way to the end. And in the video after this we'll start applying the Gaussian distribution to developing an anomaly detection algorithm.

이번 강의에서 정규 분포라고 불리기도 하는 가우시안 분포를 다룹니다. 가우시안 분포에 익숙한 분들은 이번 강의를 건너뛰어도 됩니다. 가우시안 분포에 익숙하지 않은 분들은 이번 강의가 매우 유용합니다. 이상 탐지 알고리즘을 개발하기 위해 가우시안 분포를 설명합니다.

Let's say x is a real value's random variable, so x is a real number. If the probability distribution of x is Gaussian with mean mu and variance sigma squared. Then, we'll write this as x, the real variable. Tilde, this little tilde, this is distributed as. And then to denote a Gaussian distribution, sometimes I'm going to write script N parentheses mu comma sigma script. So this script N stands for normal since Gaussian and normal they mean the thing are synonyms. And the Gaussian distribution is parametarized by two parameters, by a mean parameter which we denote mu and a variance parameter which we denote via sigma squared.

If we plot the Gaussian distribution or Gaussian probability density. It'll look like the bell shaped curve which you may have seen before. And so this bell shaped curve is paramafied by those two parameters, mu and sequel. And the location of the center of this bell shaped curve is the mean mu. And the width of this bell shaped curve, roughly that, is this parameter, sigma, is also called one standard deviation, and so this specifies the probability of x taking on different values. So, x taking on values here in the middle here it's pretty high, since the Gaussian density here is pretty high, whereas x taking on values further, and further away will be diminishing in probability.

x의 가우시안 분포는 다음과 같습니다.

가우시안 분포와 정규 분포는 동의어입니다. 가우시안 분포는 평균을 나타내는 파라미터 μ와 분산을 나타내는 파라미터 σ^2를 활용합니다. 분산은 표준편차 σ의 제곱입니다.

가우스 분포 또는 가우스 확률 밀도는 종 모양의 곡선으로 그립니다. 종 모양의 곡선은 μ와 σ 두 파라미터로 결정합니다. μ는 곡선의 중앙 위치를 결정하고, σ는 곡선의 폭을 결정합니다. 넓이는 x의 확률과 같습니다. x의 값이 평균 μ에 가까울수록 밀도가 가장 높고 확률도 높습니다. x가 평균 μ에서 멀어질수록 확률은 감소합니다.

Finally just for completeness let me write out the formula for the Gaussian distribution. So the probability of x, and I'll sometimes write this as the p (x) when we write this as P ( x ; mu, sigma squared), and so this denotes that the probability of X is parameterized by the two parameters mu and sigma squared. And the formula for the Gaussian density is this 1/ root 2 pi, sigma e (-(x-mu/g) squared/2 sigma squared. So there's no need to memorize this formula. This is just the formula for the bell-shaped curve over here on the left. There's no need to memorize it, and if you ever need to use this, you can always look this up. And so that figure on the left, that is what you get if you take a fixed value of mu and take a fixed value of sigma, and you plot P(x) so this curve here. This is really p(x) plotted as a function of X for a fixed value of Mu and of sigma squared. And by the way sometimes it's easier to think in terms of sigma squared that's called the variance. And sometimes is easier to think in terms of sigma. So sigma is called the standard deviation, and so it specifies the width of this Gaussian probability density, where as the square sigma, or sigma squared, is called the variance.

단일 변수 x에 대한 가우시안 분포 공식은 다음과 같습니다.

이 공식을 외울 필요 없습니다. 가우시안 분포를 사용할 때 여기에서 참조하면 됩니다. 가우시안 분포 공식은 종모양의 그래프를 나타냅니다.

왼쪽 그래프는 μ값을 지정하고 p(x)를 그립니다. μ와 σ^2의 값을 알면 가우시안 분포를 도식화할 수 있습니다. μ는 가우스 확률 밀도의 중심을 나타내고, σ는 가우스 확률 밀도의 폭을 결정합니다. 관례적으로 분산이 아닌 표준 편차의 제곱으로 표현합니다.

Let's look at some examples of what the Gaussian distribution looks like. If mu equals zero, sigma equals one. Then we have a Gaussian distribution that's centered around zero, because that's mu and the width of this Gaussian, so that's one standard deviation is sigma over there. Let's look at some examples of Gaussians. If mu is equal to zero and sigma equals one, then that corresponds to a Gaussian distribution that is centered at zero, since mu is zero, and the width of this Gaussian is controlled by sigma by that variance parameter sigma.

Here's another example. That same mu is equal to 0 and sigma is equal to. 5 so the standard deviation is. 5 and the variance sigma squared would therefore be the square of 0.5 would be 0.25 and in that case the Gaussian distribution, the Gaussian probability density goes like this. Is also sent as zero. But now the width of this is much smaller because the smaller the area is, the width of this Gaussian density is roughly half as wide. But because this is a probability distribution, the area under the curve, that's the shaded area there. That area must integrate to one this is a property of probability distributing. So this is a much taller Gaussian density because this half is Y but half the standard deviation but it twice as tall.

Another example is sigma is equal to 2 then you get a much fatter a much wider Gaussian density and so here the sigma parameter controls that Gaussian distribution has a wider width. And once again, the area under the curve, that is the shaded area, will always integrate to one, that's the property of probability distributions and because it's wider it's also half as tall in order to still integrate to the same thing.

And finally one last example would be if we now change the mu parameters as well. Then instead of being centered at 0 we now have a Gaussian distribution that's centered at 3 because this shifts over the entire Gaussian distribution.

가우스 분포의 그래프 모양에 대해 몇 가지 예를 살펴보겠습니다.

좌측 상단의 그래프는 μ= 0, σ = 1인 가우시안 확률 밀도 그래프입니다. 그래프의 중심이 0인 가우시안 분포입니다. μ는 그래프의 가장 높은 곳인 x의 값이고, σ 는 그래프의 폭의 값으로 μ에서 얼마만큼 떨어져 있는 지를 나타냅니다. 왼쪽 상단의 그래프는 σ = 1이므로 σ^2 = 1입니다.

우측 상단의 그래프는 μ = 0, σ = 0.5인 가우시안 확률 밀도 그래프입니다. 표준 편차 σ가 0.5 이므로 σ^2은 0.25입니다. 가우시안 확률 밀도는 x의 값이 0을 중심으로 너비가 매우 좁은 그래프입니다. 왼쪽 상단의 너비에 비해 절반 정도 작습니다. 확률 분포 그래프의 모든 면적의 합은 1에 수렴합니다. 이것은 확률 분포의 속성입니다. 표준 편차 σ는 절반이지만 높이는 왼쪽 상단의 그래프에 비해 두 배가 됩니다.

좌측 하단의 그래프는 μ= 0, σ = 2인 가우시안 확률 밀도 그래프입니다. 그래프는 훨씬 더 넓게 분포하는 만큼 높이는 낮습니다. 확률 분포 그래프의 면적이 1에 수렴하기 때문입니다..

우측 하단의 그래프는 μ= 3, σ = 0.5인 가우시안 확률 밀도 그래프입니다. 우측 상단의 그래프와 모양이 동일하지만 중심이 0에서 3으로 이동합니다. 평균 μ가 0이 아닌 3이기 때문입니다.

Next, let's talk about the Parameter estimation problem. So what's the parameter estimation problem? Let's say we have a dataset of m examples so exponents x m and lets say each of this example is a real number. Here in the figure I've plotted an example of the dataset so the horizontal axis is the x axis and either will have a range of examples of x, and I've just plotted them on this figure here. And the parameter estimation problem is, let's say I suspect that these examples came from a Gaussian distribution. So let's say I suspect that each of my examples, x i, was distributed. That's what this tilde thing means. Let's not suspect that each of these examples were distributed according to a normal distribution, or Gaussian distribution, with some parameter mu and some parameter sigma square. But I don't know what the values of these parameters are. The problem of parameter estimation is, given my data set, I want to try to figure out, well I want to estimate what are the values of mu and sigma squared.

다음으로 파라미터를 추정하는 문제를 설명합니다. 파라미터 추정 문제란 무엇일까요? 여기 예제가 x^(1), x^(2),... x^(m)까지 m개의 데이터셋이 있습니다. 각 예제는 실수입니다. 그림은 데이터 셋에 있는 예제를 도식화합니다. 수평축은 x 예제의 범위입니다. 파라미터 추정 문제는 가우스 분포를 따릅니다. 각 예제 x^(i)가 배포되었을 때 확률을 계산합니다. 각 예제는 파라미터 μ와 σ^2를 사용하는 가우스 분포에 따라 분포하지만, 파라미터 μ와 σ^2의 값이 무엇인지를 모릅니다.

So if you're given a data set like this, it looks like maybe if I estimate what Gaussian distribution the data came from, maybe that might be roughly the Gaussian distribution it came from. With mu being the center of the distribution, sigma standing for the deviation controlling the width of this Gaussian distribution. Seems like a reasonable fit to the data. Because, you know, looks like the data has a very high probability of being in the central region, and a low probability of being further out, even though probability of being further out, and so on. So maybe this is a reasonable estimate of mu and sigma squared. That is, if it corresponds to a Gaussian distribution function that looks like this. So what I'm going to do is just write out the formula the standard formulas for estimating the parameters Mu and sigma squared. Our estimate or the way we're going to estimate mu is going to be just the average of my example. So mu is the mean parameter. Just take my training set, take my m examples and average them. And that just means the center of this distribution. How about sigma squared? Well, the variance, I'll just write out the standard formula again, I'm going to estimate as sum over one through m of x i minus mu squared. And so this mu here is actually the mu that I compute over here using this formula. And what the variance is, or one interpretation of the variance is that if you look at this term, that's the square difference between the value I got in my example minus the mean. Minus the center, minus the mean of the distribution. And so in the variance I'm gonna estimate as just the average of the square differences between my examples, minus the mean.

And as a side comment, only for those of you that are experts in statistics. If you're an expert in statistics, and if you've heard of maximum likelihood estimation, then these parameters, these estimates, are actually the maximum likelihood estimates of the primes of mu and sigma squared but if you haven't heard of that before don't worry about it, all you need to know is that these are the two standard formulas for how to figure out what are mu and Sigma squared given the data set.

데이터셋을 계산하면 가우시안 분포의 파라미터를 추정합니다. 평균 μ가 가우시안 분포의 중심이고 표준 편차 σ는 가우시안 분포의 너비를 결정합니다. 여기 그림이 데이터에 합리적으로 맞는 것 같습니다. 데이터는 중앙에 있을 확률이 매우 높고 중앙에서 멀어질수록 확률이 낮습니다. 이것이 평균 μ와 표준 편차 σ에 대한 합리적인 추정입니다. 여기 가우시안 분포 함수의 파라미터를 추정하는 공식이 있습니다.

μ는 학습 셋 m개의 학습 예제에 대한 평균이자 가우시안 분포의 중심입니다. σ^2은 데이터셋에서 평균을 뺀 값의 제곱하여 평균을 낸 것입니다.

그리고, 여러분이 통계 전문가이거나 최대 우도 추정법을 안다면, μ와 σ^2에 관한 최대 우도 추정법을 사용할 수 있습니다. 만일 최대 우도 추정법을 몰라도 걱정할 필요가 없습니다. 데이터 셋이 주어졌을 때 두 가지 공식을 적용하면 파라미터 μ와 σ^2의 값을 계산할 수 있습니다.

Finally one last side comment again only for those of you that have maybe taken the statistics class before but if you've taken statistics This class before. Some of you may have seen the formula here where this is M-1 instead of M so this first term becomes 1/M-1 instead of 1/M. In machine learning people tend to learn 1/M formula but in practice whether it is 1/M or 1/M-1 it makes essentially no difference assuming M is reasonably large a reasonably large training set size. So just in case you've seen this other version before. In either version it works just about equally well but in machine learning most people tend to use 1/M in this formula.And the two versions have slightly different theoretical properties like these are different math properties. Bit of practice it really makes makes very little difference, if any.

마지막으로 통계 수업을 들어보지 못한 분들을 위한 마지막 코멘트가 있습니다. 다른 곳에서 σ^2를 설명할 때 1/m 항 대신에 1/(m-1) 항을 사용하기도 합니다. 머신러닝 분야는 1/m 항을 선호합니다. 실제로 1/m 항이나 1/(m-1) 항에 상관없이 m이 합리적으로 충분히 크다면 본질적으로 차이가 없습니다. 두 버전 모두가 거의 똑같이 잘 작동하지만 머신러닝 분야에서는 1/m을 사용하는 경향이 있습니다. 수학적으로 약간의 차이가 있지만 실제에서 거의 차이가 없습니다.

So, hopefully you now have a good sense of what the Gaussian distribution looks like, as well as how to estimate the parameters mu and sigma squared of Gaussian distribution if you're given a training set, that is if you're given a set of data that you suspect comes from a Gaussian distribution with unknown parameters, mu and sigma squared. In the next video, we'll start to take this and apply it to develop an anomaly detection algorithm.

지금까지 가우시안 분포를 설명했습니다. 학습 셋에서 가우스 분포 파라미터 μ와 σ^2의 값을 추정하는 법과 가우시안 확률 밀도 함수의 그래프 모양을 설명했습니다. 다음 강의에서 가우시안 분포를 활용한 이상 탐지 알고리즘을 개발합니다.

앤드류 응의 머신러닝 동영상 강의

정리하며 - 정규 분포

대표적인 정규분포는 가우시안 분포입니다. 가우시안 분포는 확률 밀도 함수라고도 합니다. 가우시안 분포는 다음과 같이 정의합니다.

가우시안 분포는 새 가지 특징이 있습니다. 하나는 데이터의 평균값 μ를 중심으로 좌우 대칭입니다. 두 번째는 σ는 그래프의 폭을 결정하므로 그래프의 모양을 결정합니다. 예를 들어, μ의 값이 다르고 σ가 같으면 위치만 다르고 모양은 같습니다. 세 번째는 가우스 분포를 따르는 어떤 그래프도 내적의 합은 1입니다.

따라서, 머신러닝 분야에서 가우시안 분포 x ~ (μ,σ^2)를 기준으로 그래프를 그리기 위해서는 μ와 σ^2의 값을 결정해야 합니다. m개의 데이터 셋이 주어졌을 때

여기서, 최대 우도 추정법을 안다면 쉽게 이해할 수 있습니다. 최대 우도 추정법(maximum likelihood estimation)은 '발생한 사건은 자주 일어나는 사건'이라는 발상에서 시작합니다. 예를 들어 검은 구슬과 흰 구슬이 든 항아리에서 10번 구슬을 끄집어냈을 때 4개의 검은 구슬과 6개의 흰 구슬이 나왔다면, 검은 구슬보다 흰 구슬이 더 많기 때문이지 우연히 흰 구슬이 더 많이 나왔다고 가정하지 않습니다. 우도 (Likelihood)는 발생할 가능성을 말하고 결과에 따라 가능한 가설들을 평가할 수 있는 척도입니다. 최대 우도 (Maximum Likelihood)는 가설마다 측정한 우도 값 중 가장 큰 값을 의미합니다. 즉, 최대 우도 추정법은 발생 사건의 확률을 가장 크게 하는 p(x)의 값을 찾는 것입니다.

문제 풀이

가우시안 밀도와 관련된 공입니다. 아래 그래프에 맞는 가우시안 분포 계산 식은?

정답은 3번입니다.

"머신러닝 강의 노트 (상)"를 출간하다

이 책이 필요한 사람은 누구일까? 2021년 5월 30일 머신 러닝 강의 노트를 출간하였습니다. 이 책은 B5 크기에 총 475페이지로 구성되었습니다. 두 번째 하권은 가을 전에 출시할 것입니다. 이미

brunch.co.kr/@linecard/660

브런치는 최신 브라우저에 최적화 되어있습니다. IE chrome safari