brunch

You can make anything
by writing

C.S.Lewis

by 라인하트 Oct 29. 2020

앤드류 응의 머신러닝(9-5):경사도 검사

온라인 강의 플랫폼 코세라의 창립자인 앤드류 응 (Andrew Ng) 교수는 인공지능 업계의 거장입니다. 그가 스탠퍼드 대학에서 머신 러닝 입문자에게 한 강의를 그대로 코세라 온라인 강의 (Coursera.org)에서 무료로 배울 수 있습니다. 이 강의는 머신러닝 입문자들의 필수코스입니다. 인공지능과 머신러닝을 혼자 공부하면서 자연스럽게 만나게 되는 강의입니다.

Neural Networks : Learning

인공 신경망 : 학습

Backpropagation in Practice

(역전파 실습)

Gradient Checking (경사도 검사)

In the last few videos we talked about how to do forward propagation and back propagation in a neural network in order to compute derivatives. But back prop as an algorithm has a lot of details and can be a little bit tricky to implement. And one unfortunate property is that there are many ways to have subtle bugs in back prop.

지난 강의에서 비용 함수를 미분하기 위해 신경망에서 순전파와 역전파를 수행하는 방법을 공부했습니다. 역전파 알고리즘은 많은 세부 사항으로 구현하기 약간 까다롭고, 미묘한 버그를 발생시킬 방법들이 있습니다.

So that if you run it with gradient descent or some other optimizational algorithm, it could actually look like it's working. And your cost function, J of theta may end up decreasing on every iteration of gradient descent. But this could prove true even though there might be some bug in your implementation of back prop. So that it looks J of theta is decreasing, but you might just wind up with a neural network that has a higher level of error than you would with a bug free implementation. And you might just not know that there was this subtle bug that was giving you worse performance. So, what can we do about this? There's an idea called gradient checking that eliminates almost all of these problems. So, today every time I implement back propagation or a similar gradient to a [INAUDIBLE] on a neural network or any other reasonably complex model, I always implement gradient checking. And if you do this, it will help you make sure and sort of gain high confidence that your implementation of four prop and back prop or whatever is 100% correct. And from what I've seen this pretty much eliminates all the problems associated with a sort of a buggy implementation as a back prop.

And in the previous videos I asked you to take on faith that the formulas I gave for computing the deltas and the vs and so on, I asked you to take on faith that those actually do compute the gradients of the cost function. But once you implement numerical gradient checking, which is the topic of this video, you'll be able to absolute verify for yourself that the code you're writing does indeed, is indeed computing the derivative of the cross function J.

경사 하강법과 고급 최적 알고리즘을 실행하면 실제로 동작합니다. 비용 함수 J(Θ)는 경사 하강법이 반복될 때마다 감소합니다. 역전파를 구현할 때 버그가 있을지라도 비용 함수 J(Θ)가 감소하더라도 버그가 없을 때보다 훨씬 높은 수준의 오류가 있을 수 있습니다. 때때로 신경망의 성능을 저하시키는 버그가 있다는 사실조차 모를 수 있습니다. 그래서 버그를 모두 제거할 수 있는 경사도 검사 (Gradient Checking)가 필요합니다. 저는 신경망이나 복잡한 모델에서 경사 하강법을 구현할 때 항상 경사도 검사 (Gradient Checking)을 합니다. 경사도 검사는 순전파와 역전파가 100% 완벽하게 구현되었다는 신뢰를 얻을 수 있습니다. 제 경험상 버그와 관련된 거의 모든 문제를 제거합니다.

지난 강의에서 δ와 미분항을 계산하기 위해 사용한 공식이 실제로 비용 함수의 경사 또는 기울기를 계산한다고 강조했습니다. 그러나, 이번 강의에서 경사도 검사를 구현하면 코드가 실제로 비용 함수 J(θ)의 도함수를 계산하고 있는 지를 직접 확인할 수 있습니다.

So here's the idea, consider the following example. Suppose that I have the function J of theta and I have some value theta and for this example gonna assume that theta is just a real number. And let's say that I want to estimate the derivative of this function at this point and so the derivative is equal to the slope of that tangent one.

Here's how I'm going to numerically approximate the derivative, or rather here's a procedure for numerically approximating the derivative. I'm going to compute theta plus epsilon, so now we move it to the right. And I'm gonna compute theta minus epsilon and I'm going to look at those two points, And connect them by a straight line And I'm gonna connect these two points by a straight line, and I'm gonna use the slope of that little red line as my approximation to the derivative. Which is, the true derivative is the slope of that blue line over there. So, you know it seems like it would be a pretty good approximation.

여기 예제가 있습니다. 비용 함수 J(θ)의 그래프와 θ가 있습니다. θ는 실수입니다. 이 점에서 함수의 미분을 계산하면 접선의 기울기입니다. 수치적으로 미분을 근사하는 방법이 있습니다. θ + ε (엡실론)을 계산하기 위해 오른쪽에 점을 찍습니다. θ - ε을 계산하기 위해 왼쪾에 점을 찍습니다. 그리고 이 두 점의 J(θ + ε)와 J(θ - ε)를 직선으로 연결합니다. 실제 미분의 값은 파란색 선분이지만 빨간색 선분도 꽤 좋은 근사치입니다.

Mathematically, the slope of this red line is this vertical height divided by this horizontal width. So this point on top is the J of (Theta plus Epsilon). This point here is J (Theta minus Epsilon), so this vertical difference is J (Theta plus Epsilon) minus J of theta minus epsilon and this horizontal distance is just 2 epsilon.So my approximation is going to be that the derivative respect of theta of J of theta at this value of theta, that that's approximately J of theta plus epsilon minus J of theta minus epsilon over 2 epsilon. Usually, I use a pretty small value for epsilon, expect epsilon to be maybe on the order of 10 to the minus 4. There's usually a large range of different values for epsilon that work just fine. And in fact, if you let epsilon become really small, then mathematically this term here, actually mathematically, it becomes the derivative. It becomes exactly the slope of the function at this point. It's just that we don't want to use epsilon that's too, too small, because then you might run into numerical problems. So I usually use epsilon around ten to the minus four.

수학적으로 빨간색 선분의 기울기는 수직의 높이를 수평의 너비로 나눈 값입니다. 수직은 J(θ + ε) - J(θ - ε)의 값이고 수평은 2 ε입니다. 그래서 근사값은 θ에 J(θ)에 관한 미분 값입니다.

여기서, ε은 매우 작은 값입니다. ε = 10^(-4) 정도입니다. 일반적으로 잘 작동하는 ε의 값은 있습니다. 실제로 ε이 정말 작으면 수학적으로 미분과 같습니다. 즉, 한 점에서 함수의 기울기입니다. 수학적 문제에 부딪히지 않도록 너무 작은 ε의 값을 사용하지 않습니다. 보통은 ε의 값은 10^(-4)를 사용합니다.

And by the way some of you may have seen an alternative formula for s meeting the derivative which is this formula. This one on the right is called a one-sided difference, whereas the formula on the left, that's called a two-sided difference. The two sided difference gives us a slightly more accurate estimate, so I usually use that, rather than this one sided difference estimate.

도함수를 충족하는 대체 공식이 있습니다. 오른쪽 공식을 단측 차이라고 하고 왼쪽 공식을 양측 차이라고 합니다.

양측 차이가 약간 더 정확한 추정치를 제공합니다. 단측 차이보다 양측 차이를 사용합니다.

So, concretely, when you implement an octave, is you implemented the following, you implement call to compute gradApprox, which is going to be our approximation derivative as just here this formula, J of theta plus epsilon minus J of theta minus epsilon divided by 2 times epsilon. And this will give you a numerical estimate of the gradient at that point. And in this example it seems like it's a pretty good estimate.

구체적으로 옥타브 프로그램에서 gradApprox를 계산합니다. gradApprox는 근사 미분입니다. 이것은 기울기의 수치적 추정치를 제공합니다. 이 예제에서 꽤 좋은 추정치입니다.

gradApprox = (J(theta + EPSILON) - J(theta - EPSILON))/(2 * EPSILON)

Now on the previous slide, we considered the case of when theta was a rolled number. Now let's look at a more general case of when theta is a vector parameter, so let's say theta is an R n. And it might be an unrolled version of the parameters of our neural network. So theta is a vector that has n elements, theta 1 up to theta n. We can then use a similar idea to approximate all the partial derivative terms. Concretely the partial derivative of a cost function with respect to the first parameter, theta one, that can be obtained by taking J and increasing theta one. So you have J of theta one plus epsilon and so on. Minus J of this theta one minus epsilon and divide it by two epsilon. The partial derivative respect to the second parameter theta two, is again this thing except that you would take J of here you're increasing theta two by epsilon, and here you're decreasing theta two by epsilon and so on down to the derivative. With respect of theta n would give you increase and decrease theta and by epsilon over there. So, these equations give you a way to numerically approximate the partial derivative of J with respect to any one of your parameters theta i. Completely, what you implement is therefore the following.

θ는 행렬 Θ^(1), Θ^(2), Θ^(3)를 언롤링한 벡터입니다. 이제 파라미터 벡터 θ에 대한 일반적인 사례를 보겠습니다. θ는 R^(n)이므로 n개의 성분을 갖는 n차 벡터이며, 인공신경망의 파라미터들이 펼쳐진 버전입니다. θ는 θ1, θ2, θ3,..., θn까지입니다. 그리고, 모든 편미분 항을 근사화합니다.

첫 줄은 θ1에 대한 비용 함수 J(θ)의 편미분이고, θ1에 대해 ε만큼 더하거나 뺍니다. 두 번째 줄은 θ2에 대한 비용 함수 J(θ)의 편미분이고 θ2에 대해 ε만큼 증가시키고 감소시킵니다. 세 번째 줄은 θn에 대한 비용 함수 J(θ)의 편미분이고 θn에 대해 ε만큼 증가시키고 감소시킵니다. 따라서, 이 방정식은 파라미터 θ^(i) 중 하나에 대해 J의 편미분을 수치적으로 근사하는 방법입니다. 따라서, 완전히 구현하는 것은 다음과 같습니다.

We implement the following in octave to numerically compute the derivatives. We say, for i = 1:n, where n is the dimension of our parameter of vector theta. And I usually do this with the unrolled version of the parameter. So theta is just a long list of all of my parameters in my neural network, say. I'm gonna set thetaPlus = theta, then increase thetaPlus of the (i) element by epsilon. And so this is basically thetaPlus is equal to theta except for thetaPlus(i) which is now incremented by epsilon. Epsilon, so theta plus is equal to, write theta 1, theta 2 and so on. Then theta I has epsilon added to it and then we go down to theta N. So this is what theta plus is. And similar these two lines set theta minus to something similar except that this instead of theta I plus Epsilon, this now becomes theta I minus Epsilon. And then finally you implement this gradApprox (i) and this would give you your approximation to the partial derivative respect of theta i of J of theta

미분을 수치적으로 계산하기 위해 옥타브 프로그램에서 다음과 같이 구현합니다.

EPSILON = 0.0001

For i = 1:n,

thetaPlus = theta;

thetaPlus(i) = thetaPlus(i) + EPSILON;

thetaMinus = theta;

thetaMinus(i) = thetaMinus(i) - EPSILON;

gradApprox(i) = (J(thetaPlus) - J(thetaMinus)) / (2 * EPSILON);

End;

마지막으로 gradApprox(i)는 ∂/(∂θ^(i))*J(θ)의 편미분에 대한 근사치입니다.

And the way we use this in our neural network implementation is, we would implement this for loop to compute the top partial derivative of the cost function for respect to every parameter in that network, and we can then take the gradient that we got from backprop. So DVec was the derivative we got from backprop. All right, so backprop, backpropogation, was a relatively efficient way to compute a derivative or a partial derivative of a cost function with respect to all our parameters. And what I usually do is then, take my numerically computed derivative that is this gradApprox that we just had from up here. And make sure that that is equal or approximately equal up to small values of numerical round up, that it's pretty close. So the DVec that I got from backprop. And if these two ways of computing the derivative give me the same answer, or give me any similar answers, up to a few decimal places, then I'm much more confident that my implementation of backprop is correct. And when I plug these DVec vectors into gradient assent or some advanced optimization algorithm, I can then be much more confident that I'm computing the derivatives correctly, and therefore that hopefully my code will run correctly and do a good job optimizing J of theta.

신경망의 모든 파라미터에 대한 비용 함수 J(θ)의 편미분을 계산하기 위한 For 루프를 구현합니다. 그리고 역전파 알고리즘으로 계산한 비용 함수 J(Θ)의 미분 값인 기울기 값은 DVec입니다. 역전파는 모든 파라미터에 대한 비용 함수를 미분 또는 편미분을 계산하는 상대적으로 효과적인 방법입니다. 그리고, 경사도 검사 (gradient Checking)로 계산한 미분의 근사인 gradApprox입니다. 미분을 계산하는 역전파 알고리즘의 결과 Dvec과 경사도 검사의 결과인 gradApprox가 같은 값이거나 소수점 자릿수까지 비슷하다면, 역전파의 구현이 정확하다고 확신할 수 있습니다. 그리고, DVec 벡터를 경사 하강법이나 일부 고급 최적화 알고리즘에서 계산을 한다면, 미분의 계산은 제대로 이루어지고 비용 함수 J(Θ)를 최적화한다고 확신할 수 있습니다.

Finally, I wanna put everything together and tell you how to implement this numerical gradient checking. Here's what I usually do. First thing I do is implement back propagation to compute DVec. So there's a procedure we talked about in the earlier video to compute DVec which may be our unrolled version of these matrices. So then what I do, is implement a numerical gradient checking to compute gradApprox. So this is what I described earlier in this video and in the previous slide. Then should make sure that DVec and gradApprox give similar values, you know let's say up to a few decimal places. And finally and this is the important step, before you start to use your code for learning, for seriously training your network, it's important to turn off gradient checking and to no longer compute this gradApprox thing using the numerical derivative formulas that we talked about earlier in this video. And the reason for that is the numeric code gradient checking code, the stuff we talked about in this video, that's a very computationally expensive, that's a very slow way to try to approximate the derivative.

마지막으로 수치적으로 경사도 검사를 구현하는 방법을 정리합니다. 저는 보통 다음과 같이 합니다.

첫 번째, 역전파를 구현하여 DVec을 계산합니다.

지난 강의에서 D^(1), D^(2), D^(3) 행렬을 전개하여 DVec을 계산하는 절차가 있습니다.

두 번째, 경사도 검사를 구현하여 gradApprox을 계산합니다.

이번 강의에서 설명하였습니다.

세 번째, DVec과 gradApprox이 비슷한 값을 제공하는 지를 확인합니다.

소수 자릿수까지 입력할 수 있습니다.

마지막으로, 신경망이 학습을 하기 전에 경사도 검사를 끕니다.

학습 중에 gradApprox을 계산하지 않는 것이 중요합니다. 경사도 검사 코드는 계산적으로 매우 복잡하고 미분을 근사하는 것이므로 매우 느린 방법입니다.

Whereas In contrast, the back propagation algorithm that we talked about earlier, that is the thing we talked about earlier for computing. You know, D1, D2, D3 for Dvec. Backprop is much more computationally efficient way of computing for derivatives. So once you've verified that your implementation of back propagation is correct, you should turn off gradient checking and just stop using that. So just to reiterate, you should be sure to disable your gradient checking code before running your algorithm for many iterations of gradient descent or for many iterations of the advanced optimization algorithms, in order to train your classifier.Concretely, if you were to run the numerical gradient checking on every single iteration of gradient descent. Or if you were in the inner loop of your costFunction, then your code would be very slow. Because the numerical gradient checking code is much slower than the backpropagation algorithm, than the backpropagation method where, you remember, we were computing delta(4), delta(3), delta(2), and so on. That was the backpropagation algorithm. That is a much faster way to compute derivates than gradient checking. So when you're ready, once you've verified the implementation of back propagation is correct, make sure you turn off or you disable your gradient checking code while you train your algorithm, or else you code could run very slowly.

이와 대조적으로 역전파 알고리즘은 반대입니다. DVec의 경우 D^(1), D^(2), D^(3)이고, 역전파는 미분을 계산하는 훨씬 더 효율적인 계산 방법입니다. 따라서, 역전파 구현이 올바른 지를 확인한 후에 경사도 검사를 끄고 사용을 중지해야 합니다. 다시 말하자면, 분류기가 학습을 위해 여러 번의 경사 하강법 또는 고급 최적화 알고리즘을 반복 실행하기 전에 경사도 검사 코드를 비활성화합니다. 경사 하강 법의 매 반복마다 경사도 검사를 실행할 때 costFunction의 내부 루프에서 경사도 검사 코드가 있다면 성능이 심하게 저하됩니다. 경사도 검사 코드는 역전파 알고리즘보다 훨씬 느리기 때문입니다. 역전파 알고리즘은 δ^(l)j를 계산했습니다. 역전파 알고리즘은 미분을 계산하는 경사도 검사보다 훨씬 빠른 방법입니다. 따라서 역전파 구현이 올바른지 확인한 후 학습하는 동안 경사도 검사를 비활성화해야 합니다. 그렇지 않으면 매우 느립니다.

So, that's how you take gradients numericaly, and that's how you can verify tha implementation of back propagation is correct. Whenever I implement back propagation or similar gradient discerning algorithm for a complicated mode, l I always use gradient checking and this really helps me make sure that my code is correct.

지금까지 경사도 검사(Gradient Checking)를 다루었습니다. 경사도 검사는 역전파 구현이 올바른지를 검증하는 방법입니다. 복잡한 역전파 또는 유사한 경사도 식별 알고리즘을 구현할 때마다 항상 경사도 검사를 사용하면 코드가 올바른 지 확인할 수 있습니다.

앤드류 응의 머신 러닝 동영상 강의

정리하며

역전파 알고리즘은 실제로 구현하기 약간 까다롭고 미묘한 버그가 발생할 확률이 높습니다. 그래서, 버그 여부를 확인할 수 있는 검사도 검사 (Gradient Checking)가 있습니다. 순전파와 역전파의 구현이 100% 올바른지에 대한 높은 신뢰도를 얻을 수 있습니다. 버그와 관련된 거의 모든 문제를 제거합니다.

경사도 검사는 다음과 같습니다.

일반적으로 잘 작동하는 ε의 값이 있습니다. 보통 ε = 10^(-4) 정도입니다. ε이 정말 작아지면 수학적으로 미분이 됩니다. 구체적으로 옥타브 프로그램에서 gradApprox를 계산합니다. gradApprox는 근사 미분입니다. 이것은 기울기의 수치적 추정치를 제공합니다. gradApprox(i)는 ∂/(∂θ^(i))*J(θ)의 편미분에 대한 근사치입니다.

gradApprox = (J(theta + EPSILON) - J(theta - EPSILON))/(2 * EPSILON)

이 과정을 것을 모든 θ에 대해 θ1, θ2, θ3,..., θn까지 진행합니다.