brunch

You can make anything
by writing

C.S.Lewis

by 라인하트 Oct 27. 2020

앤드류 응의 머신러닝(9-2): 신경망의 역전파

온라인 강의 플랫폼 코세라의 창립자인 앤드류 응 (Andrew Ng) 교수는 인공지능 업계의 거장입니다. 그가 스탠퍼드 대학에서 머신 러닝 입문자에게 한 강의를 그대로 코세라 온라인 강의 (Coursera.org)에서 무료로 배울 수 있습니다. 이 강의는 머신러닝 입문자들의 필수코스입니다. 인공지능과 머신러닝을 혼자 공부하면서 자연스럽게 만나게 되는 강의입니다.

Neural Networks : Learning

인공 신경망 : 학습

Cost Function and Backpropagation

(비용 함수와 역전파)

Backpropagation Algorithm (역전파 알고리즘)

In the previous video, we talked about a cost function for the neural network. In this video, let's start to talk about an algorithm, for trying to minimize the cost function. In particular, we'll talk about the back propagation algorithm.

지난 강의에서 신경망의 비용 함수를 설명했습니다. 이번 강의에서 비용 함수를 최소화하기 위한 알고리즘인 역전파(Back Propagation)를 설명합니다.

Here's the cost function that we wrote down in the previous video. What we'd like to do is try to find parameters theta to try to minimize j of theta. In order to use either gradient descent or one of the advance optimization algorithms. What we need to do therefore is to write code that takes this input the parameters theta and computes j of theta and these partial derivative terms. Remember, that the parameters in the the neural network of these things, theta superscript l subscript ij, that's the real number and so, these are the partial derivative terms we need to compute. In order to compute the cost function j of theta, we just use this formula up here and so, what I want to do for the most of this video is focus on talking about how we can compute these partial derivative terms.

여기 지난 강의에서 배운 정규화된 비용 함수가 있습니다.

비용 함수 J(Θ)를 최소화하기 위한 파라미터 Θ를 찾기 위해 경사 하강법 또는 고급 최적화 알고리즘 중 하나를 사용합니다. 파라미터 Θ 값을 활용하여 비용 함수 J(Θ)와 J(Θ)의 편미분항을 계산하는 코드를 작성합니다. 신경망 파라미터 Θ^(l)ij (위 첨자 l, 아래 첨자 ij)는 실수이고 편미분 함수입니다. 비용 함수 J(Θ)는 이미 설명하였고, J(Θ)의 편미분항에 대해 설명합니다.

Let's start by talking about the case of when we have only one training example, so imagine, if you will that our entire training set comprises only one training example which is a pair xy. I'm not going to write x1y1 just write this. Write a one training example as xy and let's tap through the sequence of calculations we would do with this one training example.

전체 학습 데이터 셋이 (x, y)의 쌍입니다. 학습 데이터 셋 예제를 (x^(1), y^(1))이 아닌 (x, y)로 적습니다. 학습 데이터 셋 예제가 하나일 때부터 계산 절차를 살펴봅니다. 학습 예제 (x, y)를 계산하는 순서를 살펴봅니다.

The first thing we do is we apply forward propagation in order to compute whether a hypotheses actually outputs given the input. Concretely, the called the a(1) is the activation values of this first layer that was the input there. So, I'm going to set that to x and then we're going to compute z(2) equals theta(1) a(1) and a(2) equals g, the sigmoid activation function applied to z(2) and this would give us our activations for the first middle layer. That is for layer two of the network and we also add those bias terms. Next we apply 2 more steps

of this four and propagation to compute a(3) and a(4) which is also the upwards of a hypotheses h of x. So this is our vectorized implementation of forward propagation and it allows us to compute the activation values for all of the neurons in our neural network.

입력 x에 대한 가설이자 출력 값을 계산하기 위해 순전파 알고리즘을 적용합니다.

첫 번째, 입력층이자 첫 번째 층 a^(1)는 학습 예제 x의 값입니다.

두 번째, 두 번째 층 a^(2)는 시그모이드 함수 g(z^(2))이고, z^(2)= θ^(1)*a^(1)입니다.

값이 1인 바이어스 유닛 a^(2)0를 추가합니다.

세 번째, 세 번째 층 a^(3)는 시그모이드 함수 g(z^(3)이고, z^(3) = g(θ^(2)*a^(2))입니다.

값이 1인 바이어스 유닛 a^(3)0를 추가합니다.

네 번째, 출력층 a^(4)는 시그모이드 함수 g(z^(4)이고, z^(4) = g(θ^(3)*a^(3)) 입니다.

출력 유닛 a(4)는 가설 hΘ(x)와 같습니다.

이것이 순전파 알고리즘을 벡터화 구현한 것입니다. 순전파 알고리즘은 신경망의 모든 유닛의 활성화 함수를 계산합니다.

Next, in order to compute the derivatives, we're going to use an algorithm called back propagation. The intuition of the back propagation algorithm is that for each node we're going to compute the term delta superscript l subscript j that's going to somehow represent the error of node j in the layer l. So, recall that a superscript l subscript j that does the activation of the j of unit in layer l and so, this delta term is in some sense going to capture our error in the activation of that neural network. So, how we might wish the activation of that node is slightly different. Concretely, taking the example neural network that we have on the right which has four layers. And so capital L is equal to 4. For each output unit, we're going to compute this delta term. So, delta for the j of unit in the fourth layer is equal tojust the activation of that unit minus what was the actual value of 0 in our training example. So, this term here can also be written h of x subscript j, right. So this delta term is just the difference between when a hypotheses output and what was the value of y in our training set whereas y subscript j is the j of element of the vector value y in our labeled training set.

다음으로, 미분항을 계산하기 위해 역전파 알고리즘을 사용합니다. 역전파 알고리즘은 단순하게 각 층의 노드인 델타δ^(l)j을 계산합니다.

δ^(l)j는 l 번째 층의 j 노드가 오차가 있는 지를 나타냅니다. 즉, a^(l)j는 l층에 있는 j유닛의 활성화 함수이고, δ^(l)j는 활성화된 노드의 오차를 나타냅니다. 그래서 노드의 활성화는 약간 다릅니다.

예를 들어, 여기 4 개 층을 갖는 신경망이 있습니다. 총층의 수 L은 4입니다. 각 출력 유닛에서 델타 δ^(l) j을 계산합니다. 네 번째 층에 있는 j 번째 δ^(4)j = a^(4)j - yj입니다. 출력값 a^(4)j와 가설 (hθ(x))j는 같습니다

δ(델타)항은 가설들의 출력 중 하나와 학습 데이터 셋의 y 벡터에서 j 번째 성분인 yj 값의 차이입니다.

And by the way, if you think of delta a and y as vectors then you can also take those and come up with a vectorized implementation of it, which is just delta 4 gets set as a4 minus y. Where here, each of these delta 4 a4 and y, each of these is a vector whose dimension is equal to the number of output units in our network. So we've now computed the era term's delta 4 for our network. What we do next is compute the delta terms for the earlier layers in our network. Here's a formula for computing delta 3 is delta 3 is equal to theta 3 transpose times delta 4. And this dot times, this is the element y's multiplication operation that we know from MATLAB. So delta 3 transpose delta 4, that's a vector; g prime z3 that's also a vector and so dot times is in element y's multiplication between these two vectors. This term g prime of z3, that formally is actually the derivative of the activation function g evaluated at the input values given by z3. If you know calculus, you can try to work it out yourself and see that you can simplify it to the same answer that I get. But I'll just tell you pragmatically what that means. What you do to compute this g prime, these derivative terms is just a3 dot times1 minus A3 where A3 is the vector of activations. 1 is the vector of ones and A3 is again the activation the vector of activation values for that layer. Next you apply a similar formula to compute delta 2 where again that can be computed using a similar formula. Only now it is a2 like so and I then prove it here but you can actually, it's possible to prove it if you know calculus that this expression is equal to mathematically, the derivative of the g function of the activation function, which I'm denoting by g prime.

And finally, that's it and there is no delta1 term, because the first layer corresponds to the input layer and that's just the feature we observed in our training sets, so that doesn't have any error associated with that. It's not like, you know, we don't really want to try to change those values. And so we have delta terms only for layers 2, 3 and for this example.

δ^(4)j = a^(4)j - yi 에서 각 δ, a, y를 벡터화 구현한 식은 다음과 같습니다.

여기서 δ^(4), a^(4), y는 인공 신경망의 출력 유닛의 수와 같은 차원 벡터입니다. 이제 신경망에서 δ^(4) 오차 항을 계산했습니다. 다음으로 바로 전 층의 δ^(3)을 계산합니다. 공식은 다음과 같습니다.

여기서 ' .* '는 두 벡터의 성분 간의 곱셉입니다. g'(z^3)는 시그모이드 함수 g(z^(3))을 미분한다는 의미입니다. g'(z^(3))과 g'(z^(2))의 미분을 합니다. a^(3)와 a^(2)는 활성화 벡터입니다. 1 = [1; 1; 1;... ; 1]처럼 생긴 활성화 벡터입니다. 여기서 다루지 않겠지만, 미분된 활성화 함수 g'()와 수학적으로 동일하다는 것을 증명할 수 있습니다.

δ^(1)항은 없습니다. 첫 번째 층은 입력층이고 학습 데이터 셋의 피처 x의 값입니다. 오차나 에러가 있을 수 없고, 학습 데이터의 피처 값을 변경하지 않습니다. δ항은 2, 3, 4 층만 있습니다.

The name back propagation comes from the fact that we start by computing the delta term for the output layer and then we go back a layer and compute the delta terms for the third hidden layer and then we go back another step to compute delta 2 and so, we're sort of back propagating the errors from the output layer to layer 3 to their to hence the name back complication.

역전파란 이름은 δ의 계산을 출력층부터 시작해서 세 번째 은닉층의 δ^(3)를 거쳐 δ^(2)까지

역으로 계산한다는 의미로 붙여졌습니다. 역전파의 결과로 인하여 출력층의 오차를 3층에서 2층으로 역으로 계산합니다.

Finally, the derivation is surprisingly complicated, surprisingly involved but if you just do this few steps steps of computation it is possible to prove viral frankly some what complicated mathematical proof. It's possible to prove that if you ignore authorization then the partial derivative terms you want are exactly given by the activations and these delta terms. This is ignoring lambda or alternatively the regularization term lambda will equal to 0. We'll fix this detail later about the regularization term, but so by performing back propagation and computing these delta terms, you can, you know, pretty quickly compute these partial derivative terms for all of your parameters.

마지막으로 미분은 놀라울 정도로 복잡합니다. 그러나 몇 단계의 계산을 수행하면 수학적으로 증명할 수 있습니다. 정규화를 무시한다면 편미분 항은 활성화 항과 δ항으로 주어진 다는 것으로 증명할 수 있습니다.

이것은 람다(λ)를 무시거나 정규화 변수 λ값이 0과 같다고 처리한 것입니다. 나중에 정규화 항을 자세히 수정할 것입니다. 역전파 알고리즘을 활용하여 δ를 계산하고 모든 파라미터에 대해 편미분항을 빠르게 계산합니다.

So this is a lot of detail. Let's take everything and put it all together to talk about how to implement back propagation to compute derivatives with respect to your parameters. And for the case of when we have a large training set, not just a training set of one example, here's what we do. Suppose we have a training set of m examples like that shown here. The first thing we're going to do is we're going to set these delta l subscript i j. So this triangular symbol? That's actually the capital Greek alphabet delta. The symbol we had on the previous slide was the lower case delta. So the triangle is capital delta. We're gonna set this equal to zero for all values of l i j. Eventually, this capital delta l i j will be used to compute the partial derivative term, partial derivative respect to theta l i j of J of theta. So as we'll see in a second, these deltas are going to be used as accumulators that will slowly add things in order to compute these partial derivatives.

Next, we're going to loop through our training set. So, we'll say for i equals 1 through m and so for the i iteration, we're going to working with the training example xi, yi.So the first thing we're going to do is set a1 which is the activations of the input layer, set that to be equal to

xi is the inputs for our i training example, and then we're going to perform forward propagation to compute the activations for layer two, layer three and so on up to the final layer, layer capital L. Next, we're going to use the output label yi from this specific example we're looking at to compute the error term for delta L for the output there. So delta L is what

a hypotheses output minus what the target label was? And then we're going to use the back propagation algorithm to compute delta L minus 1, delta L minus 2, and so on down to delta 2 and once again there is now delta 1 because we don't associate an error term with the input layer.

지금부터 모든 학습 데이터 셋을 입력할 때 파라미터 행렬 Θ에 대한 미분을 계산하는 역전파를 구현하는 방법을 설명합니다. 단 하나의 예제를 가진 학습 데이터 셋이 아닌 엄청나게 많은 예제를 가진 학습 데이터 셋에 대해 살펴봅시다.

1) m의 학습 데이터 셋이 있습니다.

2) 델타 Δ^(l)ij을 초기화합니다.

삼각형 기호는 그리스의 알파벳 델타입니다. δ기호는 소문자 델타이고, Δ은 대문자 델타입니다. Δ 행렬의 모든 성분을 0으로 초기화합니다.

3) 모든 학습 예제에 대한 For 루프를 구동합니다.

3-1) 학습 데이터 셋을 입력층 a^(1)에 입력합니다.

3-2) 순전파 알고리즘으로 a^(l)을 계산합니다.

3-3) Δ^(l)ij는 비용 함수 J(θ)에 대한 편미분 항을 계산합니다. 즉, 출력층의 오차를 계산합니다.

3-4) 모든 층의 모든 유닛에 대해 경사를 계산하고 경사를 업데이트

다음으로 Δ는 편미분을 계산한 값들을 계속 업데이트합니다. 그리고, 이 과정을 반복합니다. i=1부터 m까지 반복하면서 신경망은 학습 예제 (x^(i), y^(i))를 학습합니다. 첫 번째 루프에서 입력층의 활성화된 a^(1)ii 번째 훈련 예제를 x^(i)에 대입합니다. 두 번째 층과 세 번째 층, 그리고 마지막 층 L까지의 순방향 전파를 진행합니다. 그다음에 출력 레이블 y^(i)를 에러항인 δ^(L)을 계산하기 위해 사용합니다. 그래서 δ^(L) = 가설의 출력 a(L) - yi입니다. 그리고 역전파 알고리즘을 δ^(L-1), δ^(L-2),…, δ^(2)까지 계산합니다. 다시 한번 설명하지만, δ^(1)은 입력층이기 때문에 오차와 무관합니다.

And finally, we're going to use these capital delta terms to accumulate these partial derivative terms that we wrote down on the previous line. And by the way, if you look at this expression, it's possible to vectorize this too. Concretely, if you think of delta ij as a matrix, indexed by subscript ij. Then, if delta L is a matrix we can rewrite this as delta L, gets updated as delta L plus lower case delta L plus one times aL transpose. So that's a vectorized implementation of this that automatically does an update for all values of i and j.

마지막으로 편미분 항의 값을 축적하기 위해서 델타 Δ^(l)ij를 사용합니다. 각각을 For 루프로 계산하는 것보다 벡터화 구현이 훨씬 쉽습니다.

여기서, Δij를 행렬로 생각하고 아래 첨자 ij로 성분으로 표현한 것입니다. 이것은 모든 i, j의 값에 대해서 자동으로 업데이트합니다.

Finally, after executing the body of the four-loop we then go outside the four-loop and we compute the following. We compute capital D as follows and we have two separate cases for j equals zero and j not equals zero.The case of j equals zero corresponds to the bias term so when j equals zero that's why we're missing is an extra regularization term. Finally, while the formal proof is pretty complicated what you can show is that once you've computed these D terms, that is exactly the partial derivative of the cost function with respect to each of your perimeters and so you can use those in either gradient descent or in one of the advanced authorizationalgorithms.

마지막으로 For 루프를 완료한 후 정규화 항을 추가하고 학습 데이터 셋의 수로 나누어 줍니다.

대문자 D는 정규화를 하는 항과 하지 않는 항을 구분합니다. j=0은 바이어스 항을 나타내므로 추가적인 정규화 항이 없습니다. 공식 증명은 꽤 복잡하지만, 일단 D 항을 계산해보면 각 파라미터에 대한 비용 함수의 편미분입니다.

최적의 파라미터 D을 경사 하강법이나 다른 고급 알고리즘에서 사용할 수 있습니다.

So that's the back propagation algorithm and how you compute derivatives of your cost function for a neural network. I know this looks like this was a lot of details and this was a lot of steps strung together. But both in the programming assignments write out and later in this video, we'll give you a summary of this so we can have all the pieces of the algorithm together so that you know exactly what you need to implement if you want to implement back propagation to compute the derivatives of your neural network's cost function with respect to those parameters.

지금까지 역전파 알고리즘과 인공 신경망의 비용 함수를 미분하는 방법을 설명했습니다. 이번 강의는 많은 내용을 담고 있고 많은 사항들이 서로 얽혀 있습니다. 그러나, 프로그래밍 실습과 이 과정에서 좀 더 쉽게 설명을 할 것입니다. 여러분들이 모든 종류의 알고리즘을 함께 다룰 수 있을 것이고, 신경망의 비용 함수와 관련된 파라미터를 계산하는 미분을 계산하는 역전파를 구현하는 방법을 더 잘 이해할 수 있을 것입니다.