brunch

You can make anything
by writing

C.S.Lewis

by 라인하트 Oct 25. 2020

앤드류 응의 머신러닝(8-4): 신경망 모델표현 II

온라인 강의 플랫폼 코세라의 창립자인 앤드류 응 (Andrew Ng) 교수는 인공지능 업계의 거장입니다. 그가 스탠퍼드 대학에서 머신 러닝 입문자에게 한 강의를 그대로 코세라 온라인 강의 (Coursera.org)에서 무료로 배울 수 있습니다. 이 강의는 머신러닝 입문자들의 필수코스입니다. 인공지능과 머신러닝을 혼자 공부하면서 자연스럽게 만나게 되는 강의입니다.

Neural Networks : Representation

신경망 : 표현

Neural Network (신경망)

Model Representation (모델 표현)

In the last video, we gave a mathematical definition of how to represent or how to compute the hypotheses used by Neural Network. In this video, I like show you how to actually carry out that computation efficiently, and that is show you a vectorised implementation. And second, and more importantly, I want to start giving you intuition about why these neural network representations might be a good idea and how they can help us to learn complex nonlinear hypotheses.

지난 강의에서 신경망에서 사용하는 가설을 표현하는 방법과 가설을 계산하는 방법에 대한 수학적 정의를 설명했습니다. 이번 강의에서는 실제로 효율적으로 계산하기 위한 벡터화 구현을 설명합니다. 그리고 신경망 표현이 얼마나 좋은 아이디어인지와 복잡한 비선형 가설을 배우기에 적합한 지에 대한 감각을 익힐 것입니다.

Consider this neural network. Previously we said that the sequence of steps that we need in order to compute the output of a hypotheses is these equations given on the left where we compute the activation values and the three hidden units and then we use those to compute the final output of our hypotheses h of x. Now, I'm going to define a few extra terms.

여기 지난 강의에서 다루었던 신경망이 있습니다.

가설의 출력값을 계산하는 순서를 설명했습니다. 3 개의 활성화 함수와 3개의 은닉 유닛을 계산하는 3개의 방정식으로 가설 hθ(x)의 최종 출력 값을 계산합니다. 이제 몇 가지 추가 항을 정의하겠습니다.

So, this term that I'm underlining here, I'm going to define that to be z superscript 2 subscript 1. So that we have that a(2)1, which is this term is equal to g of z to 1. And by the way, these superscript 2, you know, what that means is that the z2 and this a2 as well, the superscript 2 in parentheses means that these are values associated with layer 2, that is with the hidden layer in the neural network. Now this term here I'm going to similarly define as z(2)2. And finally, this last term here that I'm underlining, let me define that as z(2)3. So that similarly we have a(2)3 equals g of z(2)3. So these z values are just a linear combination, a weighted linear combination, of the input values x0, x1, x2, x3 that go into a particular neuron. Now if you look at this block of numbers, you may notice that that block of numbers corresponds suspiciously similar to the matrix vector operation, matrix vector multiplication of x1 times the vector x. Using this observation we're going to be able to vectorize this computation of the neural network.

은닉층의 은닉 유닛별 활성화 함수는 시그모이드 함수 g(z) 로 표현할 수 있습니다. 은닉층은 z^(2)로 표현하며 각 활성화 함수는 아래 첨자로 표현할 수 있습니다. 신경망에 은닉층인 Layer 2의 각 유닛을 계산하는 식은 다음과 같습니다.

따라서, z는 특정 뉴런으로 들어가는 입력 값 x0, x1, x2, x3의 선형 조합, 즉 가중치를 부여한 선형 조합입니다. z^(2)의 숫자 블록이 행렬 벡터 연산, 행렬과 벡터의 곱과 비슷합니다. 인공 신경망의 계산을 벡터화할 수 있습니다.

Concretely, let's define the feature vector x as usual to be the vector of x0, x1, x2, x3 where x0 as usual is always equal 1 and that defines z2 to be the vector of these z-values, you know, of z(2)1 z(2)2, z(2)3. And notice that, there, z2 this is a three dimensional vector. We can now vectorize the computation of a(2)1, a(2)2, a(2)3 as follows. We can just write this in two steps. We can compute z2 as theta 1 times x and that would give us this vector z2; and then a2 is g of z2 and just to be clear z2 here, This is a three-dimensional vector and a2 is also a three-dimensional vector and thus this activation g. This applies the sigmoid function element-wise to each of the z2's elements. And by the way, to make our notation a little more consistent with what we'll do later, in this input layer we have the inputs x, but we can also thing it is as in activations of the first layers. So, if I defined a1 to be equal to x. So, the a1 is vector, I can now take this x here and replace this with z2 equals theta1 times a1 just by defining a1 to be activations in my input layer.

구체적으로 벡터화 구현을 정리합니다.

피처 벡터 x는 R^(4X1)차원이고, x0 = 1입니다. 벡터 z^(2)는 R^(3X1)차원입니다. 은닉층의 활성화 함수 a^(2)1, a^(2)2, a^(2)3를 벡터 a^(2)로 벡터화할 수 있습니다. a^(2) = g(z^(2))입니다. 벡터 z^(2)는 R^(3X1) 차원이므로 a^(2)는 3차원 벡터입니다. 여기서 g()는 활성화 함수입니다. z^(2)의 각 성분마다 시그모이드 함수를 적용합니다.

그런데, 표기법을 단일화하면 입력층인 Layer 1의 피처 벡터를 a^(1)로 표기할 수 있습니다. 즉, x = a^(1)입니다. 따라서, a^(1)은 벡터입니다. z^(2) = θ^(1) * x를 z^(2) = θ^(1) * a^(1)으로 대체할 수 있습니다.

Now, with what I've written so far I've now gotten myself the values for a1, a2, a3, and really I should put the superscripts there as well. But I need one more value, which is I also want this a(0)2 and that corresponds to a bias unit in the hidden layer that goes to the output there. Of course, there was a bias unit here too that, you know, it just didn't draw under here but to take care of this extra bias unit, what we're going to do is add an extra a0 superscript 2, that's equal to one, and after taking this step we now have that a2 is going to be a four dimensional feature vector because we just added this extra, you know, a0 which is equal to 1 corresponding to the bias unit in the hidden layer.

지금까지 입력층 a^(1), 은닉층 a^(2), 출력층 a^(3)에 대한 값을 이해했습니다. 한 가지 더 이야기하자면, a^(2)0는 은닉층의 바이어스 유닛입니다. 모든 층에 바이어스 유닛이 있습니다. 여기에 그리지 않고 추가 바이어스 유닛을 처리하기 위해 a^(2)0 을 표시하고, 이 값은 'a^(2)0 = 1'입니다. 이제 a^(2)는 4차원 벡터입니다.

And finally, to compute the actual value output of our hypotheses, we then simply need to compute z3. So z3 is equal to this term here that I'm just underlining. This inner term there is z3. And z3 is stated 2 times a2 and finally my hypotheses output h of x which is a3 that is the activation of my one and only unit in the output layer. So, that's just the real number. You can write it as a3 or as a(3)1 and that's g of z3. This process of computing h of x is also called forward propagation and is called that because we start of with the activations of the input-units and then we sort of forward-propagate that to the hidden layer and compute the activations of the hidden layer and then we sort of forward propagate that and compute the activations of the output layer, but this process of computing the activations from the input then the hidden then the output layer, and that's also called forward propagation and what we just did is we just worked out a vector wise implementation of this procedure. So, if you implement it using these equations that we have on the right, these would give you an efficient way or both of the efficient way of computing h of x.

마지막으로 가설 hθ(x)의 실제 출력 값을 계산하기 위해 z^(3)를 계산합니다. z^(3)는 가설 함수의 값과 같습니다.

a^(3)는 출력층의 유일한 유닛이자 활성화 함수입니다. 이것은 실수입니다. 지금까지 hθ(x)를 계산하는 방식을 순전파(Forward Propagation)라고 합니다. 입력층의 입력 유닛의 활성화 함수를 시작하여 은닉층으로 순전파하고 은닉층의 활성화 함수를 계산하고 출력층의 활성화 함수를 계산하기 때문입니다. 즉, 입력층에서 시작해서 순차적으로 은닉층을 거쳐 출력층까지 활성화 함수를 계산하는 것을 순전파 (Forward Propagation)라고합니다. hθ(x)를 계산하는 효율적인 방법은 벡터화 구현으로 계산하는 것입니다.

This forward propagation view also helps us to understand what Neural Networks might be doing and why they might help us to learn interesting nonlinear hypotheses. Consider the following neural network.

순전파 관점은 인공 신경망이 무엇을 하고 있는지, 왜 비선형 가설을 배우는 데 도움이 되는 지를 알 수 있습니다. 여기 인공신경망입니다.

And let's say I cover up the left path of this picture for now. If you look at what's left in this picture. This looks a lot like logistic regression where what we're doing is we're using that note, that's just the logistic regression unit and we're using that to make a prediction h of x. And concretely, what the hypotheses is outputting is h of x is going to be equal to g which is my sigmoid activation function times theta 0 times a0 is equal to 1 plus theta 1 plus theta 2 times a2 plus theta 3 times a3 whether values a1, a2, a3 are those given by these three given units. Now, to be actually consistent to my early notation. Actually, we need to, you know, fill in these superscript 2's here everywhere and I also have these indices 1 there because I have only one output unit, but if you focus on the blue parts of the notation. This is, you know, this looks awfully like the standard logistic regression model, except that I now have a capital theta instead of lower case theta. And what this is doing is just logistic regression. But where the features fed into logistic regression are these values computed by the hidden layer. Just to say that again, what this neural network is doing is just like logistic regression, except that rather than using the original features x1, x2, x3, is using these new features a1, a2, a3. Again, we'll put the superscripts there, you know, to be consistent with the notation.

여기 왼쪽의 층과 유닛을 모두 가린 신경망이 있습니다. 남은 것은 로지스틱 회귀와 비슷합니다. 각각의 유닛은 로지스틱 회귀 단위이고 hθ(x)의 예측 값을 계산합니다.

구체적으로 가설의 출력 값 hθ(x)입니다. g()는 시그모이드 활성화 함수입니다. 초기 표기법과 실제로 일치합니다. 위 첨자 2를 모든 값에 넣습니다. 아래 첨자로 인텍스 1도 있어야 합니다. 왜냐하면 출력 유닛이 하나뿐이기 때문입니다. 그러나, 표기법의 파라미터 Θ와 활성화 함수 a에 초첨을 맞추면 표준 로지스틱 회귀 모델과 매우 흡사합니다. 단 소문자 θ 대신에 대문자 Θ가 있습니다. 그러나 로지스틱 회귀에서 제공되는 피처는 은닉층이 계산한 값입니다. 즉, 신경망과 로지스틱 회귀의 차이점은 피처 x1, x2, x3를 사용하는 것이 아니라 새로운 피처 a^(2)1, a^(2)2, a^(2)3를 사용한다는 점입니다. 표기법과 일치하도록 위 첨자 2를 넣습니다.

And the cool thing about this, is that the features a1, a2, a3, they themselves are learned as functions of the input. Concretely, the function mapping from layer 1 to layer 2, that is determined by some other set of parameters, theta 1. So it's as if the neural network, instead of being constrained to feed the features x1, x2, x3 to logistic regression. It gets to learn its own features, a1, a2, a3, to feed into the logistic regression. and as you can imagine depending on what parameters it chooses for theta 1. You can learn some pretty interesting and complex features and therefore you can end up with a better hypotheses than if you were constrained to use the raw features x1, x2 or x3 or if you will constrain to say choose the polynomial terms, you know, x1, x2, x3, and so on.

But instead, this algorithm has the flexibility to try to learn whatever features at once, using these a1, a2, a3 in order to feed into this last unit that's essentially a logistic regression here. I realized this example is described as a somewhat high level and so I'm not sure if this intuition of the neural network, you know, having more complex features will quite make sense yet, but if it doesn't yet in the next two videos I'm going to go through a specific example of how a neural network can use this hidden there to compute more complex features to feed into this final output layer and how that can learn more complex hypotheses. So, in case what I'm saying here doesn't quite make sense, stick with me for the next two videos and hopefully out there working through those examples this explanation will make a little bit more sense.

정말 멋진 점은 피처 a^(2)1, a^(2)2, a^(2)3 가 입력 함수를 학습한다는 것입니다. 예를 들면, 1층에서 2층으로 함수 매핑은 다른 파라미터 행렬 Θ^(1)이 결정합니다. 신경망은 피처 x1, x2, x3를 로지스틱 회귀에 제공하는 것 대신에 스스로 학습한 피처 a^(2)1, a^(2)2, a^(2)3를 로지스틱 회귀에 제공합니다. 신경망은 매우 흥미롭고 복잡한 피처를 배울 수 있기 때문에 피처 x1, x2, x3 나 x1x2, x1x3와 같은 다항식을 사용하는 것보다 더 나은 가설을 만들 수 있습니다.

신경망 알고리즘은 본질적으로 로지스틱 회귀인 마지막 유닛 a^(3)1은 a^(2)1, a^(2)2, a^(2)3를 활용하여 한 번에 모든 피처를 학습할 수 있는 유연성이 있습니다. 이 예제가 다소 높은 수준으로 설명되었습니다. 여러분이 복잡한 피처를 가진 신경망에 대한 감각을 익혔는 지가 확실하지 않습니다. 다음 두 개의 강의에서 신경망이 어떻게 작동하는 지를 구체적인 예를 가지고 살펴보겠습니다. 인공 신경망이 마지막 출력에 제공하는 더 복잡한 피처를 계산하는 은닉층을 사용하는 방법과 복잡한 가설을 학습하는 방법을 설명할 것입니다. 지금까지의 내용이 이해되지 않을 경우에 다음 두 개의 강의가 의미가 있을 것입니다.

But just the point O. You can have neural networks with other types of diagrams as well, and the way that neural networks are connected, that's called the architecture. So the term architecture refers to how the different neurons are connected to each other. This is an example of a different neural network architecture and once again you may be able to get this intuition of how the second layer, here we have three heading units that are computing some complex function maybe of the input layer, and then the third layer can take the second layer's features and compute even more complex features in layer three so that by the time you get to the output layer, layer four, you can have even more complex features of what you are able to compute in layer three and so get very interesting nonlinear hypotheses. By the way, in a network like this, layer one, this is called an input layer. Layer four is still our output layer, and this network has two hidden layers. So anything that's not an input layer or an output layer is called a hidden layer.

여기에 다른 유형의 신경망이 있습니다. 신경망이 연결되는 방식을 아키택처라고 합니다. 두 번쨰 층은 입력층의 복잡한 함수를 계산하는 3 개의 유닛이 있습니다. 세 번째 층은 두 번째 층의 복잡한 피처를 계산합니다. 출력층인 네 번째 층에 도달할 때까지 세 번째 계층에서 더 복잡한 피처를 계산합니다. 매우 흥미로운 비선형 가설입니다. 이런 신경망에서 첫 번째 층은 입력층이고 마지막 네 번째 층은 출력층입니다. 두 개의 은닉층이 있습니다. 즉, 입력층과 출력층이 아닌 모든 층이 은닉층입니다.

So, hopefully from this video you've gotten a sense of how the feed forward propagation step in a neural network works where you start from the activations of the input layer and forward propagate that to the first hidden layer, then the second hidden layer, and then finally the output layer. And you also saw how we can vectorize that computation. In the next, I realized that some of the intuitions in this video of how, you know, other certain layers are computing complex features of the early layers. I realized some of that intuition may be still slightly abstract and kind of a high level. And so what I would like to do in the two videos is work through a detailed example of how a neural network can be used to compute nonlinear functions of the input and hope that will give you a good sense of the sorts of complex nonlinear hypotheses we can get out of Neural Networks.

이 강의에서 인공 신경망의 순전파에 대해 설명했습니다. 입력층, 첫 번째 은닉층, 두 번째 은닉층, 마지막 출력층으로 활성화 함수를 계산하여 출력값을 전달합니다. 그리고 벡터화 구현도 다루었습니다. 다음으로 이 강의에서 다른 특정 층이 초기 층의 복잡한 기능을 어떻게 계산하지를 설명했습니다. 직감으로 여전히 약간 추상적이고 높은 수준의 일 수 있습니다. 두 개의 강의에서 입력의 비선형 함수를 계산하는 데 인공 신경망을 어떻게 사용할 수 있는지에 대해 설명할 것이고, 복잡한 비선형 가설의 종류에 대해 좋은 감각을 얻을 수 있을 것입니다.

앤드류 응의 머신러닝 동영상 강의

정리하며

각 층에 있는 뉴런인 논리 유닛을 표현하기 위해 다음 표현을 씁니다.

a^(1)1 : 첫 번째 층의 첫 번째 유닛의 활성화 함수 (첫 번째 층은 입력층을 의미)

a^(2)1 : 두 번째 층의 첫 번째 유닛의 활성화 함수

a^(3)4 : 세 번째 층의 4 번째 유닛의 활성화 함수

2 번째 은닉층에 3개의 유닛이 있다고 가정하면, 활성화 함수는 다음과 같이 표현합니다. 여기서 z는 특정 뉴런으로 들어가는 입력 값 x0, x1, x2, x3의 선형 조합, 즉 가중치 θ를 부여한 선형 조합입니다.

따라서, 인공신경망은 가설 함수 hθ(x)를 정의하고 hθ(x)는 x를 입력받아 y를 예측하는 함수입니다. 가설 함수는 파라미터 또는 가중치 θ 에 의해 조절됩니다. 입력에서 활성화를 계산하고 은닉 계층에서 활성화를 계산하고 다음 층으로 전달하여 출력층까지 진행하는 방식을 순전파 알고리즘이라고 합니다.

keyword