brunch

You can make anything
by writing

C.S.Lewis

by 라인하트 Oct 30. 2020

앤드류 응의 머신러닝(9-7): 인공 신경망 총정리

온라인 강의 플랫폼 코세라의 창립자인 앤드류 응 (Andrew Ng) 교수는 인공지능 업계의 거장입니다. 그가 스탠퍼드 대학에서 머신 러닝 입문자에게 한 강의를 그대로 코세라 온라인 강의 (Coursera.org)에서 무료로 배울 수 있습니다. 이 강의는 머신러닝 입문자들의 필수코스입니다. 인공지능과 머신러닝을 혼자 공부하면서 자연스럽게 만나게 되는 강의입니다.

Neural Networks : Learning

인공 신경망 : 학습

Backpropagation in Practice

(역전파 실습)

Putting It Together (요약)

So, it's taken us a lot of videos to get through the neural network learning algorithm. In this video, what I'd like to do is try to put all the pieces together, to give a overall summary or a bigger picture view, of how all the pieces fit together and of the overall process of how to implement a neural network learning algorithm.

지금까지 신경망 학습 알고리즘을 다루었습니다. 이번 강의는 흩어져 있던 조각들을 하나하나 모아서 정리하고 빅피처를 그립니다. 모든 신경망의 조각들이 어떻게 조화를 이루는지와 인공 신경망 학습 알고리즘을 구현하는 방법을 순서대로 설명합니다.

When training a neural network, the first thing you need to do is pick some network architecture and by architecture I just mean connectivity pattern between the neurons. So, you know, we might choose between say, a neural network with three input units and five hidden units and four output units versus one of 3, 5 hidden, 5 hidden, 4 output and here are 3, 5, 5, 5 units in each of three hidden layers and four open units, and so these choices of how many hidden units in each layer and how many hidden layers, those are architecture choices. So, how do you make these choices?

신경망을 사용할 때 가장 먼저 해야 할 일은 인공신경망 아키택처를 선택하는 것입니다. 아키택처는 뉴런들 간의 연결 패턴을 의미합니다. 왼쪽 그림은 3개의 입력 유닛, 5개의 유닉, 4개의 출력 유닛을 가진 신경망 아키택처이고, 중간 그림은 3개의 입력 유닛, 5개의 은닉 유닛, 5개의 은닉 유닉, 4개의 출력 유닛을 가진 신경망 아키택처이고, 오른쪽 그림은 3개의 입력 유닛, 5개의 은닉 유닛, 5개의 은닉 유닛, 5개의 유닛, 4개의 출력 유닛을 가진 신경망 아키택처입니다. 세 개의 신경망 아키택처 중에서 하나를 선택합니다. 결국, 신경망 아키택처를 선택하는 것은 은닉층의 수와 각 층의 은닉 유닛 수를 결정하는 것입니다. 그렇다면 어떻게 선택해야 할까요?

Well first, the number of input units well that's pretty well defined. And once you decides on the fix set of features x the number of input units will just be, you know, the dimension of your features x(i) would be determined by that. And if you are doing multiclass classifications the number of output of this will be determined by the number of classes in your classification problem. And just a reminder if you have a multiclass classification where y

takes on say values between 1 and 10, so that you have ten possible classes. Then remember to right, your output y as these were the vectors. So instead of clause one, you recode it as a vector like that, or for the second class you recode it as a vector like that. So if one of these apples takes on the fifth class, you know, y equals 5, then what you're showing to your neural network is not actually a value of y equals 5, instead here at the upper layer which would have ten output units, you will instead feed to the vector which you know with one in the fifth position and a bunch of zeros down here. So the choice of number of input units and number of output units is maybe somewhat reasonably straightforward.

첫 번째, 입력 유닛의 수를 결정합니다. 입력 유닛의 수는 피처 x의 수이자 학습 예제 x^(i)의 차원 수입니다. 두 번째, 출력 유닛의 수를 결정합니다. 멀티클래스 분류 문제에서 출력 유닛의 수는 클래스 수입니다. y가 1에서 10 사이의 값 중에 선택하는 것이라면 클래스는 10개입니다. 출력 y는 벡터입니다. 로지스틱 회귀에서 클래스를 나타내는 y를 1,2,3,...,10 중에 하나의 값으로 표현지만, 신경망에서 클래스를 나타내는 y를 10 X 1차원 벡터로 표현합니다. 예를 들면, 클래스 1은 [1; 0; 0;...; 0], 클래스 2는 [0; 1; 0;...; 0], 클래스 3은 y = [0; 0; 1;...; 0], 마지막으로 클래스 10은 [0; 0; 0;...; 1]로 표현합니다. 옥타브 프로그램에서 y의 값을 벡터로 다시 코딩합니다. 입력 유닛의 수 와 출력 유닛의 수를 결정하는 것은 간단합니다.

And as for the number of hidden units and the number of hidden layers, a reasonable default is to use a single hidden layer and so this type of neural network shown on the left with just one hidden layer is probably the most common. Or if you use more than one hidden layer, again the reasonable default will be to have the same number of hidden units in every single layer. So here we have two hidden layers and each of these hidden layers have the same number five of hidden units and here we have, you know, three hidden layers and each of them has the same number, that is five hidden units. Rather than doing this sort of network architecture on the left would be a perfect ably reasonable default. And as for the number of hidden units - usually, the more hidden units the better; it's just that if you have a lot of hidden units, it can become more computationally expensive, but very often, having more hidden units is a good thing. And usually the number of hidden units in each layer will be maybe comparable to the dimension of x, comparable to the number of features, or it could be any where from same number of hidden units of input features to maybe so that three or four times of that. So having the number of hidden units is comparable. You know, several times, or some what bigger than the number of input features is often a useful thing to do

So, hopefully this gives you one reasonable set of default choices for neural architecture and and if you follow these guidelines, you will probably get something that works well, but in a later set of videos where I will talk specifically about advice for how to apply algorithms, I will actually say a lot more about how to choose a neural network architecture. Or actually have quite a lot I want to say later to make good choices for the number of hidden units, the number of hidden layers, and so on.

마지막으로 은닉층의 수와 은닉 유닛의 수를 결정합니다. 신경망의 은닉 층의 수에 대한 기본값은 한 개입니다. 두 개 이상의 은닉층 사용할 경우 모든 은닉층의 은닉 유닛의 수를 똑같이 유지합니다. 예를 들면, 중간 그림은 두 개의 은닉층과 층마다 5개의 은닉 유닛을 동일하게 보유합니다. 오른쪽 그림은 세 개의 은닉층과 층마다 5 개의 은닉 유닛을 보유합니다. 왼쪽 그림은 가장 합리적인 기본값입니다. 일반적으로 은닉 유닛 수는 많을수록 좋습니다. 물론, 은닉 유닛 수가 많을수록 연산 비용은 증가합니다. 각 층의 은닉 유닛 수는 피처의 수 또는 벡터 x의 차원의 수와 비슷할 수 있습니다. 또는 입력 피처의 3 또는 4배가 되기도 합니다. 종종 은닉 유닛이 수가 입력 피처의 수보다 더 큰 것이 유용합니다.

이것이 신경망에서 기본적으로 아키택처를 선택하는 방법입니다. 신경망의 아키택처를 설계할 때 가이드라인을 준수한다면 대체로 신경망은 잘 작동할 것입니다. 나중에 신경망 아키택처를 선택하는 방법을 좀 더 자세히 설명할 것입니다. 은닉 유닛 수 와 은닉층의 수의 결정하기 위해 고려해야 할 요소가 좀 많습니다.

Next, here's what we need to implement in order to trade in neural network, there are actually six steps that I have; I have four on this slide and two more steps on the next slide. First step is to set up the neural network and to randomly initialize the values of the weights. And we usually initialize the weights to small values near zero. Then we implement forward propagation so that we can input any excellent neural network and compute h of x which is this output vector of the y values. We then also implement code to compute this cost function j of theta. And next we implement back-prop, or the back-propagation algorithm, to compute these partial derivatives terms, partial derivatives of j of theta with respect to the parameters.

다음으로 인공 신경망은 6 단계의 구현 과정을 거칩니다. 이 슬라이드에서 4개, 다음 슬라이드에서 2개를 말할 것입니다.

첫 번째 단계, 가중치 파라미터를 무작위로 초기화합니다. 가중치를 0에 가까운 작은 값으로 초기화합니다.

두 번째 단계, 신경망에 x^(i)를 입력하고 출력 벡터인 hθ(x^(i))를 계산할 수 있는 순전파를 구현합니다.

세 번째 단계, 비용 함수 J(θ)를 계산할 수 있는 코드를 작성합니다.

네 번째 단계, 비용 함수 J(θ)에 대한 편미분을 계산하고 역전파 알고리즘을 구현합니다.

Concretely, to implement back prop. Usually we will do that with a for loop over the training examples. Some of you may have heard of advanced, and frankly very advanced factorization methods where you don't have a four-loop over the m-training examples, that the first time you're implementing back prop there should almost certainly the four loop in your code, where you're iterating over the examples, you know, x1, y1, then so you do forward prop and back prop on the first example, and then in the second iteration of the four-loop, you do forward propagation and back propagation on the second example, and so on. Until you get through the final example. So there should be a four-loop in your implementation of back prop, at least the first time implementing it. And then there are frankly somewhat complicated ways to do this without a four-loop, but I definitely do not recommend trying to do that much more complicated version the first time you try to implement back prop.

So concretely, we have a four-loop over my m-training examples and inside the four-loop we're going to perform fore prop and back prop using just this one example. And what that means is that we're going to take x(i), and feed that to my input layer, perform forward-prop, perform back-prop and that will if all of these activations and all of these delta terms for all of the layers of all my units in the neural network then still inside this four-loop, let me draw some curly braces just to show the scope with the four-loop, this is in octave code of course, but it's more a sequence Java code, and a four-loop encompasses all this. We're going to compute those delta terms, which are is the formula that we gave earlier. Plus, you know, delta l plus one times a, l transpose of the code. And then finally, outside the having computed these delta terms, these accumulation terms, we would then have some other code and then that will allow us to compute these partial derivative terms. Right and these partial derivative terms have to take into account the regularization term lambda as well. And so, those formulas were given in the earlier video. So, how do you done that you now hopefully have code to compute these partial derivative terms.

역전파를 구현하는 과정을 좀 더 자세히 설명합니다. 학습 데이터 셋 예제에 대한 For 루프를 사용합니다. 어떤 분들은 m개의 학습 데이터 셋 예제에 대한 For 루프 없이 고급 인수 분해 방법을 아실 것입니다. 처음 역전파를 구현할 때 거의 확실히 코드에 For 루프가 있어야 합니다. For 루프에서 (x^(1), y^(1)) 첫 번째 학습 예제가 순전파와 역전파를 수행합니다. 그다음 For 루프에서 (x^(2), y^(2)) 두 번째 학습 예제가 순전파와 역전파를 수행합니다. (x^(m), y^(m)) 마지막 학습 예제까지 반복힙니다. 적어도 처음 역전파를 구현할 때는 For 루프가 있어야 합니다. For 루프 없이 할 수 있는 복잡한 방법이 있지만 권장하지 않습니다.

예를 들어, m개의 학습 데이터 셋 예제에 대해 For 루프가 있습니다. For 루프 안에서 학습 예제 1개에 대한 순전파와 역전파를 수행합니다. 학습 예제 x^(i)를 입력층에 제공하고 순전파와 역전파를 수행합니다. 신경망에 있는 모든 활성화 함수 a^(l)과 모든 δ 항을 학습 예제 x^(i) 별로 계산합니다. For 루프의 범위 { } 안에는 순전파와 역전파를 계산하는 옥타브 프로그램의 코드가 있습니다. δ 항을 계산할 때 사용할 공식을 이미 제시했습니다.

편미분항은 정규화 항 람다(λ)도 고려합니다. 지난 강의에서 편미분항을 계산하는 코드를 설명했습니다.

Next is step five, what I do is then use gradient checking to compare these partial derivative terms that were computed. So, I've compared the versions computed using back propagation versus the partial derivatives computed using the numerical estimates as using numerical estimates of the derivatives. So, I do gradient checking to make sure that both of these give you very similar values.Having done gradient checking just now reassures us that our implementation of back propagation is correct, and is then very important that we disable

gradient checking, because the gradient checking code is computationally very slow.

다섯 번째 단계, 경사도 검사를 수행하고 편미분 항의 값과 비교한다 역전파 알고리즘의 결과와 미분 항의 결과를 비교합니다. 두 결과가 비슷한 지를 확인하여 역전파 구현이 정확하다는 것을 검증합니다. 특히, 경사도 검사가 끝난 후 경사도 검사를 비활성화하는 것이 매우 중요합니다. 경사도 검사는 계산이 복잡하여 신경망 알고리즘의 성능을 저하시킵니다.

And finally, we then use an optimization algorithm such as gradient descent, or one of

the advanced optimization methods such as LB of GS, contract gradient has embodied into fminunc or other optimization methods. We use these together with back propagation, so back propagation is the thing that computes these partial derivatives for us. And so, we know how to compute the cost function, we know how to compute the partial derivatives using back propagation, so we can use one of these optimization methods to try to minimize j of theta as a function of the parameters theta.

여섯 번째 단계, 비용 함수J(Θ)이 값을 최소화하기 위해 경사 하강법과 같은 최적화 알고리즘을 사용하거나 fminunc() 함수를 활용하여 고급 최적화 알고리즘을 구현합니다. 역전파 알고리즘은 편미분 항을 계산합니다. 비용 함수를 계산하는 방법과 역전파를 사용하여 미분항을 계산하는 방법을 배웠습니다. 따라서, 이러한 최적화 방법 중 하나를 사용하여 파라미터 Θ의 함수 비용 함수 J(Θ)를 최소화합니다.

And by the way, for neural networks, this cost function j of theta is non-convex, or is not convex and so it can theoretically be susceptible to local minima, and in fact algorithms like gradient descent and the advance optimization methods can, in theory, get stuck in local optima, but it turns out that in practice this is not usually a huge problem and even though we can't guarantee that these algorithms will find a global optimum, usually algorithms like gradient descent will do a very good job minimizing this cost function j of theta and get a very good local minimum, even if it doesn't get to the global optimum.

그런데 신경망의 파라미터 행렬 Θ의 비용 함수 J(Θ)는 비볼록 함수입니다. 이론적으로 비용 함수 J(Θ)에 대한 경사 하강법과 고급 최적화 방법은 이론적으로 로컬 최소값에 갇힐 수 있지만 실제로 큰 문제가 아닙니다. 최적화 방법들이 전역 최적 값을 찾는다는 것을 보장할 수 없을지라도, 일반적으로 경사 하강법은 비용 함수 J(Θ)를 최소화합니다.

Finally, gradient descents for a neural network might still seem a little bit magical. So, let me just show one more figure to try to get that intuition about what gradient descent for a neural network is doing. This was actually similar to the figure that I was using earlier to explain gradient descent. So, we have some cost function, and we have a number of parameters in our neural network. Right here I've just written down two of the parameter values. In reality, of course, in the neural network, we can have lots of parameters with these. Theta one, theta two--all of these are matrices, right? So we can have very high dimensional parameters but because of the limitations the source of parts we can draw. I'm pretending that we have only two parameters in this neural network. Although obviously we have a lot more in practice.

마지막으로 신경망의 경사 하강법은 여전히 마술처럼 보일 수 있습니다. 신경망의 경사 하강법이 어떤 일을 하는지에 대한 감각을 갖기 위해 그림 하나를 더 보여 드리겠습니다. 여기 있는 그림은 여러 강의에서 다루었던 경사 하강법을 설명하는 그림과 유사합니다. 비용 함수 J(Θ)가 있고 파라미터 Θ1과 Θ2가 있습니다. 실제로 인공 신경망은 매우 많은 파라미터를 가질 수 있습니다. 여기 Θ1과 Θ2는 모두 행렬입니다. 매우 높은 차원의 파라미터를 가질 수 있지만 그림으로 표현할 수 있는 한계가 있습니다. 신경망은 파라미터는 매우 많지만 두 개만 있다고 가정합니다.

Now, this cost function j of theta measures how well the neural network fits the training data. So, if you take a point like this one, down here, that's a point where j of theta is pretty low, and so this corresponds to a setting of the parameters. There's a setting of the parameters theta, where, you know, for most of the training examples, the output of my hypothesis, that may be pretty close to y(i) and if this is true than that's what causes my cost function to be pretty low. Whereas in contrast, if you were to take a value like that, a point like that corresponds to, where for many training examples, the output of my neural network is far from the actual value y(i) that was observed in the training set. So points like this on the line correspond to where the hypothesis, where the neural network is outputting values on the training set that are far from y(i). So, it's not fitting the training set well, whereas points like this with low values of the cost function corresponds to where j of theta is low, and therefore corresponds to where the neural network happens to be fitting my training set well, because I mean this is what's needed to be true in order for j of theta to be small.

파라미터 행렬 Θ에 대한 비용 함수 J(Θ)는 신경망이 학습 데이터 셋에 얼마나 잘 맞는 지를 측정합니다. 여기 맨 아래의 점은 J(Θ)가 꽤 낮은 점입니다. 이것은 파라미터 Θ의 설정입니다. 대부분의 훈련용 데이터 셋 예제에서 가설 출력 hΘ(x^(i))는 y^(i)와 매우 가깝습니다. 비용 함수는 꽤 낮은 값입니다. 이와 반대로 산 위 정상에 값은 대부분의 학습 데이터 셋 예제에서 가설 출력 hΘ(x^(i))는 y^(i)와 매우 멀리 떨어져 있습니다. 따라서, 산의 정상의 값은 학습 데이터 셋과 잘 맞지 않고 가장 낮은 값은 훈련용 데이터 셋과 잘 맞습니다. 이것이 J(Θ)가 작아지기 위해 필요한 것입니다.

So what gradient descent does is we'll start from some random initial point like that one over there, and it will repeatedly go downhill. And so what back propagation is doing is computing the direction of the gradient, and what gradient descent is doing is it's taking little steps downhill until hopefully it gets to, in this case, a pretty good local optimum. So, when you implement back propagation and use gradient descent or one of the advanced optimization methods, this picture sort of explains what the algorithm is doing. It's trying to find a value of the parameters where the output values in the neural network closely matches the values of the y(i)'s observed in your training set. So, hopefully this gives you a better sense of how the many different pieces of neural network learning fit together.

경사 하강법이 하는 일은 임의의 초기 지점에서 시작하여 반복적으로 내리막으로 가는 것입니다. 역전파 알고리즘이 하는 일은 기울기의 방향을 계산하는 것입니다. 기울기가 하강하는 경우는 꽤 좋은 지역 최적 값에 도달할 때까지 약간 내리막을 유지하는 것입니다. 따라서, 역전파를 구현하고 경사 하강법 또는 고급 최적화 방법 중 하나를 사용할 때 이 그림은 알고리즘이 하는 일을 설명합니다. 신경망의 출력 값이 훈련용 데이터 셋에서 관찰된 y^(i) 값과 거의 일치하는 파라미터 값을 찾습니다. 이것이 신경망 학습의 많은 부분들을 얼마나 잘 맞는지를 더 잘 이해할 수 있게 해 줄 것입니다.

In case even after this video, in case you still feel like there are, like, a lot of different pieces and it's not entirely clear what some of them do or how all of these pieces come together, that's actually okay. Neural network learning and back propagation is a complicated algorithm. And even though I've seen the math behind back propagation for many years and I've used back propagation, I think very successfully, for many years, even today I still feel like I don't always have a great grasp of exactly what back propagation is doing sometimes. And what the optimization process looks like of minimizing j if theta. Much this is a much harder algorithm to feel like I have a much less good handle on exactly what this is doing

compared to say, linear regression or logistic regression, whhich were mathematically and conceptually much simpler and much cleaner algorithms.

이번 강의 후에도 신경망의 다른 조각들이 여전히 있다고 느낄 수 있고, 그것들 중 일부가 무엇을 하는지 그리고 구 모든 조각들이 어떻게 합쳐지는지 명확하지 않을 수 있습니다. 괜찮습니다. 인공 신경망 학습과 역전파 알고리즘은 복잡합니다. 수년간 역전파 알고리즘의 수학을 보았고 사용해왔지만 저도 여전히 정확히 파악하고 있는 것 같지 않습니다. 그리고 최적화 과정은 J(θ)의 값을 최소화하는 것입니다. 최적화 알고리즘은 정확히 무엇을 하는지에 쉽게 이해하기에는 매우 어려운 알고리즘입니다. 선형 회귀나 로지스틱 회귀는 최적화 알고리즘에 비해 수학적으로나 개념적으로 훨씬 더 간단하고 깔끔한 알고리즘입니다.

But so in case if you feel the same way, you know, that's actually perfectly okay, but if you

do implement back propagation, hopefully what you find is that this is one of the most powerful learning algorithms and if you implement this algorithm, implement back propagation, implement one of these optimization methods, you find that back propagation will be able to fit very complex, powerful, non-linear functions to your data, and this is one of the most effective learning algorithms we have today.

여러분이 이렇게 생각해도 괜찮습니다. 그러나 역전파를 구현한다면 아주 강력한 알고리즘이라는 것을 발견할 것입니다. 역전파 알고리즘은 데이터에 최적화된 매우 복잡하지만 강력한 비선형함수를 계산할 수 있습니다. 역전파 알고리즘은 오늘날 우리가 보유한 가장 효과적인 학습 알고리즘 중 하나입니다.

앤드류 응의 머신러닝 동영상 강의

정리하며

아키택처는 뉴런들 간의 연결 패턴을 의미합니다. 아키택처를 선택한다는 것은 은닉층의 수와 각 층의 은닉 유닛 수를 결정하는 것입니다.

입력층의 유닛 수 결정 : Feature x의 수이자 Feature x^(i)의 차원 수

출력층의 유닛 수 결정 : 클래스의 수 (예, 숫자를 구분하는 인공 신경망의 클래스는 10)

출력 값은 벡터로 표현합니다. 예를 들어, 인공 신경망이 5 번째 클래스를 선택한다면 y = [0; 0; 0; 0; 1; 0; 0; 0; 0; 0]입니다.

은닉층은 기본 하나를 사용하고, 두 개 이상의 은닉층 사용할 경우 모든 은닉층의 은닉 유닛의 수를 똑같이 유지합니다. 은닉 유닛의 수는 일반적으로 많을수록 좋습니다. 은닉층의 수와 은닉 유닛의 수를 결정하는 것은 나중에 더 자세히 다룰 것입니다.

은닉층의 층의 수 결정: 테스트를 통해 결과가 좋은 것을 선택

은닉층의 유닛 수 결정: Feature의 수와 같거나 3 또는 4배를 서택

인공 신경망이 학습할 때 필요한 순서는 다음과 같습니다.

1) 인공 신경망을 설계하고 가중치를 무작위로 초기화 (0에 가까운 값)

2) 인공 신경망의 입력층에 x^(i)를 입력하고 출력 벡터인 hθ(x^(i))를 계산하는 순전파를 구현

3) 비용 함수 J(θ)를 계산하는 코드 작성

4) J(θ)에 대한 편미분을 계산하고 역전파 알고리즘을 구현

5) 경사도 검사를 통해 편미분항과 비교 (비슷한 값이면 정상)

6) J(θ)이 값을 최소화하기 위해 경사 하강법 또는 고급 최적화 알고리즘 구현

인공 신경망의 파라미터 θ의 함수 J(θ)는 비볼록 함수이고 볼록하지 않지만 전역 최적 값을 찾아갑니다. 선형 회귀의 경사 하강법을 배울 때처럼 2개의 Feature가 있다고 가정할 때 3차원 그래프를 그릴 수 있고, 반복적인 스텝으로 하강합니다. 결국, 역전파 알고리즘은 기울기의 방향을 계산하는 것입니다.