brunch

You can make anything
by writing

C.S.Lewis

by 라인하트 Oct 29. 2020

앤드류 응의 머신러닝(9-3):인공신경망의 역전파 이해

온라인 강의 플랫폼 코세라의 창립자인 앤드류 응 (Andrew Ng) 교수는 인공지능 업계의 거장입니다. 그가 스탠퍼드 대학에서 머신 러닝 입문자에게 한 강의를 그대로 코세라 온라인 강의 (Coursera.org)에서 무료로 배울 수 있습니다. 이 강의는 머신러닝 입문자들의 필수코스입니다. 인공지능과 머신러닝을 혼자 공부하면서 자연스럽게 만나게 되는 강의입니다.

Neural Networks : Learning

인공 신경망 : 학습

Cost Function and Backpropagation

(비용 함수와 역전파)

Backpropagation Intuition (역전파의 이해)

In the previous video, we talked about the backpropagation algorithm. To a lot of people seeing it for the first time, their first impression is often that wow this is a really complicated algorithm, and there are all these different steps, and I'm not sure how they fit together. And it's kinda this black box of all these complicated steps. In case that's how you're feeling about backpropagation, that's actually okay. Backpropagation maybe unfortunately is a less mathematically clean, or less mathematically simple algorithm, compared to linear regression or logistic regression. And I've actually used backpropagation, you know, pretty successfully for many years. And even today I still don't sometimes feel like I have a very good sense of just what it's doing, or intuition about what back propagation is doing. If, for those of you that are doing the programming exercises, that will at least mechanically step you through the different steps of how to implement back prop. So you'll be able to get it to work for yourself.

And what I want to do in this video is look a little bit more at the mechanical steps of backpropagation, and try to give you a little more intuition about what the mechanical steps the back prop is doing to hopefully convince you that, you know, it's at least a reasonable algorithm. In case even after this video in case back propagation still seems very black box and kind of like a, too many complicated steps and a little bit magical to you, that's actually okay. And Even though I've used back prop for many years, sometimes this is a difficult algorithm to understand, but hopefully this video will help a little bit. In order to better understand backpropagation, let's take another closer look at what forward propagation is doing.

지난 강의에서 역전파 알고리즘을 설명했습니다. 역전파 알고리즘에 대한 첫인상은 정말 복잡하다는 것입니다. 많은 사람들이 모든 층과 유닛들이 어떻게 서로가 적합하게 동작할 수 있는 지를 모릅니다. 신경망은 다수의 층과 유닛들로 이루어진 블랙박스입니다. 여러분들이 블랙박스라고 느껴도 괜찮습니다. 역전파는 불행히도 선형 회귀 또는 로지스틱 회귀에 비해 수학적으로 불명확하고 복잡한 알고리즘입니다. 저는 역전파를 수년 동안 꽤 성공적으로 사용했습니다. 지금까지도 인공 신경망이 무엇을 하는지 또는 역전파가 무엇을 하는지에 대한 정확힌 안다고 생각하지 않습니다. 프로그래밍 실습에서 많은 사람들이 기계적으로 역전파를 구현하는 방법을 단계적으로 따라 합니다. 그러면 인공 신경망은 스스로 작동합니다.

이번 강의에서 역전파의 기계적인 순서를 조금 더 살펴보겠습니다. 여러분들은 역전파를 기계적으로 순서대로 진행하는 것들이 무엇인지 알 수 있을 것입니다. 적어도 역전파 알고리즘은 합리적인 알고리즘입니다. 이 강의 이후에도 역전파가 너무 복잡하거나 미술 같은 블랙박스처럼 보여도 괜찮습니다. 저는 몇 년 동안 역전파를 사용해 왔지만, 여전히 이해하기 어려운 알고리즘입니다. 이 강의는 블랙박스를 이해하는 것에 조금 더 도움이 될 것입니다. 역전파를 더 잘 이해하기 위해 순전파가 하는 일부터 다시 자세히 살펴보겠습니다.

Here's a neural network with two input units that is not counting the bias unit, and two hidden units in this layer, and two hidden units in the next layer. And then, finally, one output unit. Again, these counts two, two, two, are not counting these bias units on top.

여기 인공 신경망이 있습니다.

바이어스 유닛을 계산하지 하지 않겠습니다. 입력층에 두 개의 입력 유닛, 두 번째 층의 두 개의 은닉 유닛, 세 번째 층의 두 개의 은닉 유닛, 그리고 마지막 출력층에 한 개의 출력 유닛이 있습니다.

In order to illustrate forward propagation, I'm going to draw this network a little bit differently.And in particular I'm going to draw this neuro-network with the nodes drawn as these very fat ellipsis, so that I can write text in them. When performing forward propagation, we might have some particular example. Say some example x i comma y i. And it'll be this x i that we feed into the input layer. So this maybe x i 2 and x i 2 are the values we set the input layer to. And when we forward propagated to the first hidden layer here, what we do is compute z (2) 1 and z (2) 2. So these are the weighted sum of inputs of the input units. And then we apply the sigmoid of the logistic function, and the sigmoid activation function applied to the z value. Here's are the activation values. So that gives us a (2) 1 and a (2) 2. And then we forward propagate again to get here z (3) 1. Apply the sigmoid of the logistic function, the activation function to that to get a (3) 1. And similarly, like so until we get z (4) 1. Apply the activation function. This gives us a (4)1, which is the final output value of the neural network.

순전파를 설명하기 위해 신경망의 유닛을 조금 크게 그립니다. 신경망의 노드를 뚱뚱한 타원으로 그려서 숫자를 씁니다.

학습 데이터 셋의 예제 (x^(i), y^(i))가 있습니다. 입력층에서는 x^(i)에 대응하는 유닛 또는 노드가 있습니다. 각 노드 별로 학습 데이터 셋이 입력되기 때문에 x^(i)1과 x^(i)2로 나타냅니다. 두 번째 층인 은닉층으로 전파할 때 z^(2)1과 z^(2)2를 계산합니다. z는 입력 유닛과 가중치를 곱한 값들의 합입니다. 시그모이드 함수 g(z)를 계산합니다. 따라서, a^(2)1 = g(z^(2)1), a^(2)2 = g(z^(2)2) 입니다. 그리고 다음 층으로 이동합니다. z^(3)1과 z^(3)2를 계산하고 시그모이드 함수에 적용하고 활성화 함수를 계산합니다. a^(3)1 = g(z^(3)1), a^(3)2 = g(z^(3)2) 입니다. 마지막 출력층에서 다시 반복합니다. z^(4)1를 구한 후 시그모이드 함수에 적용하여 최종 출력 값을 구합니다. a^(4)1 = g(z^(4)1) 입니다.

Let's erase this arrow to give myself some more space. nd if you look at what this computation really is doing, focusing on this hidden unit, let's say. We have to add this weight. Shown in magenta there is my weight theta (2) 1 0, the indexing is not important. And this way here, which I'm highlighting in red, that is theta (2) 1 1 and this weight here, which I'm drawing in cyan, is theta (2) 1 2. So the way we compute this value, z(3)1 is, z(3)1 is as equal to this magenta weight times this value. So that's theta (2) 10 x 1. And then plus this red weight times this value, so that's theta(2) 11 times a(2)1. And finally this cyan weight times this value, which is therefore plus theta(2)12 times a(2)1.

실제 계산하는 방식을 보기 위해 분홍색으로 친 은닉 유닉에 초점을 맞춥니다. z는 바로 전 유닛의 활성화 값과 가중치 θ를 곱한 것입니다. 인덱싱은 중요하지 않지만, 첫 번째 가중치를 θ^(2)10, 두 번째 가중치 θ^(2)11, 세 번째 가중치 θ^(2)12로 표기합니다. 따라서, z^(3)1 = θ^(2)10 * 1 + θ^(2)11 * a^(2)1 + θ^(2)12 * a^(2)1 입니다.

And so that's forward propagation. And it turns out that as we'll see later in this video, what. backpropagation is doing is doing a process very similar to this. Except that instead of the computations flowing from the left to the right of this network, the computations since their flow from the right to the left of the network. And using a very similar computation as this. And I'll say in two slides exactly what I mean by that.

이것이 순전파입니다. 이 강의의 뒷부분에서 밝혀지겠지만 역전파는 순전파와 매우 유사한 프로세스입니다. 순전파는 신경망의 왼쪽에서 오른쪽으로 계산하는 절차이고 역전파는 오른쪽에서 왼쪽으로 계산하는 절차입니다. 각 유닛을 계산하는 z의 계산은 비슷합니다.

To better understand what backpropagation is doing, let's look at the cost function. It's just the cost function that we had for when we have only one output unit. If we have more than one output unit, we just have a summation you know over the output units indexed by k there. If you have only one output unit then this is a cost function.

여기 역전파를 이해하기 위한 이진 분류 비용 함수 J(Θ)가 있습니다.

출력 유닛이 하나일 때 사용하는 비용 함수 J(Θ)입니다. 두 개 이상 K개의 출력 유닛이 있을 때 k =1에서 K까지를 시그마 합산을 추가합니다.

And we do forward propagation and backpropagation on one example at a time. So let's just focus on the single example, x (i) y (i) and focus on the case of having one output unit. So y (i) here is just a real number. And let's ignore regularization, so lambda equals 0. And this final term, that regularization term, goes away. Now if you look inside the summation, you find that the cost term associated with the training example, that is the cost associated with the training example x(i), y(i).

That's going to be given by this expression. So, the cost to live off examplie i is written as follows. And what this cost function does is it plays a role similar to the squared arrow. So, rather than looking at this complicated expression, if you want you can think of cost of i being approximately the square difference between what the neural network outputs, versus what is the actual value. Just as in logistic repression, we actually prefer to use the slightly more complicated cost function using the log. But for the purpose of intuition, feel free to think of the cost function as being the sort of the squared error cost function. And so this cost(i) measures how well is the network doing on correctly predicting example i. How close is the output to the actual observed label y(i)?

학습 데이터 셋 (x^(i). y^(i)) 를 하나의 예제를 가지고 하나의 출력 유닛이 있다고 가정합니다. 단순화한 하나의 예제로 순전파와 역전파를 모두 다룹니다. y(i)는 실수이고 정규화 항은 무시합니다. 정규화 파라미터 람다(λ)의 값을 0으로 하면 정규화 항은 사라집니다. 이제 비용 함수 J(Θ)는 학습 데이터 셋 (x^(i). y^(i))에 대한 함수입니다.

cost(i) 함수를 더욱 단순하게 표현할 수 있습니다. Cost(i)는 출력 유닛의 결과값과 학습 예제의 실제 값 사이의 오차의 제곱입니다. 로지스틱 회귀와 마찬가지로 log 함수를 사용하는 약간 더 복잡한 비용 함수를 사용하는 것을 선호합니다. 직관적으로 이해하기 위해 비용 함수를 일종의 오차의 제곱 함수로 생각합니다. 따라서, cost(i)는 신경망이 새로운 예제 i을 얼마나 잘 예측했는 지를 측정합니다. 즉, 출력 값이 실제 관찰된 레이블 y(i)와 얼마나 비슷한 지를 보는 것입니다.

Now let's look at what backpropagation is doing. One useful intuition is that back propagation is computing these delta superscript l subscript j terms. And we can think of these as the quote error of the activation value that we got for unit j in the layer, in the lth layer. More formally, for, and this is maybe only for those of you who are familiar with calculus. More formally, what the delta terms actually are is this, they're the partial derivative with respect to z,l,j, that is this weighted sum of inputs that were confusing these z terms. Partial derivatives with respect to these things of the cost function. So concretely, the cost function is a function of the label y and of the value, this h of x output value neural network. And if we could go inside the neural network and just change those z l j values a little bit, then that will affect these values that the neural network is outputting. And that will end up changing the cost function. And again really, this is only for those of you who are expert in Calculus. If you're comfortable with partial derivatives, what these delta terms are is they turn out to be the partial derivative of the cost function, with respect to these intermediate terms that were confusing.

And so they're a measure of how much would we like to change the neural network's weights, in order to affect these intermediate values of the computation. So as to affect the final output of the neural network h(x) and therefore affect the overall cost. In case this lost part of this partial derivative intuition, in case that doesn't make sense. Don't worry about the rest of this, we can do without really talking about partial derivatives.

이제 역전파가 무엇을 하는 지를 살펴봅시다.

δ^(l)j 는 l 번째 층의 j 유닛의 비용 오차입니다. l은 위 첨자이고 j는 아래 첨자입니다. δ^(l)j 는 l 번째 층 j 유닛에서 얻은 활성화 힘수의 오차입니다. δ^(l)j는 z^(l)j에 대한 편미분입니다. z항은 입력에 가중치 Θ를 곱한 값들의 합입니다. 비용 함수는 레이블 y^(i)의 값과 인공 신경망의 출력 값 hθ(x^(i)에 관한 함수입니다. 신경망에서 z^(l)j의 값을 약간 변경한다면 출력 값에 영향을 미치므로 결국 비용 함수를 변경합니다. δ^(l)j는 Cost(i)를 미분한 값과 같습니다.

δ^(l)j 는 계산 과정에서 출력값 hθ(x)에 영향을 미치기 위해 신경망의 가중치 Θ의 값을 얼마나 변경해야 할지를 측정합니다. 신경망의 마지막 출력값 hθ(x)에 영향을 미치기 때문에 전체 비용에 영향을 미칩니다. 편미분으로 인해 이해하기 더 어렵다고 생각돼도 걱정할 필요가 없습니다. 편미분 없이도 설명을 할 수 있습니다.

But let's look in more detail about what backpropagation is doing. For the output layer, the first set's this delta term, delta (4) 1, as y (i) if we're doing forward propagation and back propagation on this training example i. That says y(i) minus a(4)1. So this is really the error, right? It's the difference between the actual value of y minus what was the value predicted, and so we're gonna compute delta(4)1 like so. Next we're gonna do, propagate these values backwards. I'll explain this in a second, and end up computing the delta terms for the previous layer. We're gonna end up with delta(3) 1. Delta(3) 2. And then we're gonna propagate this further backward, and end up computing delta(2) 1 and delta(2) 2. Now the backpropagation calculation is a lot like running the forward propagation algorithm, but doing it backwards. So here's what I mean.

미분 없이 역전파 알고리즘의 역할을 설명합니다.

출력층의 첫 번째 유닛은 δ^(4)1입니다. 먼저 훈련용 데이터 셋 예제 i에 대해 순전파를 진행한 결과 a^(4)1 또는 hΘ(x^(i))를 계산합니다. 그리고, 역전파를 진행하기 위해 δ^(4)1 = y^(i) - a^(4)1입니다. 델타(δ)가 오차인 이유는 학습 예제의 실제 값 y에서 신경망의 계산한 예측값을 뺀 값입니다. δ^(4)1 을 바로 전 계층으로 전파합니다. 신경망의 3 층의 δ^(3) 1과 δ^(3) 2를 계산합니다. 그리고 더 신경망의 2층으로 전파하고 δ^(2)1과 δ^(2)2를 계산합니다. 역전파 알고리즘은 순 전파 알고리즘과 비슷하지만 진행 방향이 반대입니다.

Let's look at how we end up with this value of delta(2) 2. So we have delta(2) 2. And similar to forward propagation, let me label a couple of the weights. So this weight, which I'm going to draw in cyan. Let's say that weight is theta(2) 1 2, and this one down here when we highlight this in red. That is going to be let's say theta(2) of 2 2. So if we look at how delta(2) 2, is computed, how it's computed with this note. It turns out that what we're going to do, is gonna take this value and multiply it by this weight, and add it to this value multiplied by that weight. So it's really a weighted sum of these delta values, weighted by the corresponding edge strength. So completely, let me fill this in, this delta(2) 2 is going to be equal to, Theta(2) 1 2 is that magenta lay times delta(3) 1. Plus, and the thing I had in red, that's theta (2) 2 times delta (3) 2. So it's really literally this red wave times this value, plus this magenta weight times this value. And that's how we wind up with that value of delta. And just as another example, let's look at this value How do we get that value? Well it's a similar process. If this weight, which I'm gonna highlight in green, if this weight is equal to, say, delta (3) 1 2. Then we have that delta (3) 2 is going to be equal to that green weight, theta (3) 12 times delta (4) 1.

δ^(2)2를 계산하는 과정을 살펴봅니다.

여기 δ^(2) 2 가 있습니다. 순방향 계산과 유사하게 몇 개의 가중치에 인덱싱 하겠습니다. 분홍색 가중치 θ^(2)12와 빨간색 가중치 θ^(2)22를 표시합니다. 여기 δ^(2)2가 어떻게 계산하는 지를 살펴봅시다. δ^(2)2 = θ^(2)12 * δ^(3)1 + θ^(2)22 * δ^(3) 2입니다. 문자 그대로 빨간색끼리 곱하고 분홍색끼리 곱한 것입니다. 이것이 δ의 값을 구하는 방법입니다. 또 다른 예를 봅시다. 검은색으로 표시한 δ^(3)2의 값을 어떻게 구할까요? 비슷합니다. θ^(3) 12 가중치를 녹색으로 표시합니다. δ^(3)2 = θ^(3)12 * δ^(4)1입니다.

And by the way, so far I've been writing the delta values only for the hidden units, but excluding the bias units. Depending on how you define the backpropagation algorithm, or depending on how you implement it, you know, you may end up implementing something that computes delta values for these bias units as well. The bias units always output the value of plus one, and they are just what they are, and there's no way for us to change the value. And so, depending on your implementation of back prop, the way I usually implement it. I do end up computing these delta values, but we just discard them, we don't use them. Because they don't end up being part of the calculation needed to compute a derivative.

그런데 은닉 유닛에 대해서 δ값을 구했지만 바이어스 유닛은 제외했습니다. 역전파 알고리즘을 정의하는 방법에 따라 또는 구현하는 방법에 따라 바이어스 유닛에 대한 δ값을 계산할 수도 있습니다. 바이어스 유닛은 항상 +1의 값을 가지므로 변경할 수 없습니다. 그래서, 역전파는 일반적으로 바이어스 유닛을 제외합니다. 미분을 계산할 때 상수는 0으로 제거됩니다.

So hopefully that gives you a little better intuition about what back propegation is doing.

In case of all of this still seems sort of magical, sort of black box, in a later video, in the putting it together video, I'll try to get a little bit more intuition about what backpropagation is doing. But unfortunately this is a difficult algorithm to try to visualize and understand what it is really doing. But fortunately I've been, I guess many people have been using very successfully for many years. And if you implement the algorithm you can have a very effective learning algorithm. Even though the inner workings of exactly how it works can be harder to visualize.

역전파의 역할을 이해하였기를 바랍니다. 역전파 알고리즘은 여전히 마술이나 블랙박스처럼 여전히 보일지라도 다음 강의에서 더 잘 설명해 보겠습니다. 역전파 알고리즘은 시각화하거나 이해하는 것이 힘든 알고리즘이자만, 지난 수년 동안 전문가들은 역전파 알고리즘을 매우 잘 사용합니다. 역전파 알고리즘을 잘 이해하지 못하더라도 전문가들은 알고리즘을 구현하여 잘 사용하고 효과를 인정합니다.

앤드류 응의 머신러닝 동영상 강의

정리하며 - 인경 신경망의 비용 함수와 역전파

역전파 알고리즘은 정말 복잡하여 블랙박스처럼 느껴집니다. 역전파는 선형 회귀 또는 로지스틱 회귀에 비해 수학적으로 불명확하고 복잡한 알고리즘입니다. 하지만, 프로그래밍 실습을 하는 분들은 기계적으로 역전파를 구현하는 방법을 순서대로 따라 하면 인공 신경망은 스스로 작동합니다.

순전파 알고리즘을 먼저 수행합니다. 두 개의 피처와 다수의 학습 데이터 셋 예제가 있다면 다음과 같이 계산합니다.