brunch

You can make anything
by writing

C.S.Lewis

by 라인하트 Oct 29. 2020

앤드류 응의 머신러닝(9-6): 역전파 랜덤 초기화

온라인 강의 플랫폼 코세라의 창립자인 앤드류 응 (Andrew Ng) 교수는 인공지능 업계의 거장입니다. 그가 스탠퍼드 대학에서 머신 러닝 입문자에게 한 강의를 그대로 코세라 온라인 강의 (Coursera.org)에서 무료로 배울 수 있습니다. 이 강의는 머신러닝 입문자들의 필수코스입니다. 인공지능과 머신러닝을 혼자 공부하면서 자연스럽게 만나게 되는 강의입니다.

Neural Networks : Learning

인공 신경망 : 학습

Backpropagation in Practice

(역전파 실습)

Random Initialization (랜덤 초기화)

In the previous video, we've put together almost all the pieces you need in order to implement and train in your network. There's just one last idea I need to share with you, which is the idea of random initialization.

지난 강의에서 인공신경망에서 구현하고 학습하기 위해 필요한 거의 모든 것을 설명했습니다. 마지막 남은 것은 랜덤 초기화입니다.

When you're running an algorithm of gradient descent, or also the advanced optimization algorithms, we need to pick some initial value for the parameters theta. So for the advanced optimization algorithm, it assumes you will pass it some initial value for the parameters theta. Now let's consider a gradient descent. For that, we'll also need to initialize theta to something, and then we can slowly take steps to go downhill using gradient descent. To go downhill, to minimize the function j of theta. So what can we set the initial value of theta to? Is it possible to set the initial value of theta to the vector of all zeros? Whereas this worked okay when we were using logistic regression, initializing all of your parameters to zero actually does not work when you are trading on your own network.

경사 하강법 알고리즘이나 고급 최적화 알고리즘은 파라미터 θ의 초기값을 결정합니다. 고급 최적화 알고리즘은 파라미터 θ에 대한 초기 값을 전달받는다고 가정합니다. 경사 하강법은 파라미터 θ를 특정 값으로 초기화합니다. 그리고 경사 하강법을 사용하여 내리막 길을 천천히 내려갑니다. 경사 하강법은 내리막 길을 가면서 파라미터 θ에 관한 비용 함수 J(θ)를 최소화합니다. 그렇다면 θ의 초기값은 무엇으로 설정해야 할까요? θ의 초기값을 모두 0의 벡터로 설정할 수 있습니까? 로지스틱 회귀에서는 제대로 동작했지만, 신경망에서 모든 파라미터를 0으로 초기화하면 제대로 동작하지 않습니다.

Consider trading the follow Neural network, and let's say we initialize all the parameters of the network to 0. And if you do that, then what you, what that means is that at the initialization, this blue weight, colored in blue is gonna equal to that weight, so they're both 0. And this weight that I'm coloring in in red, is equal to that weight, colored in red, and also this weight, which I'm coloring in green is going to equal to the value of that weight. And what that means is that both of your hidden units, A1 and A2, are going to be computing the same function of your inputs. And thus you end up with for every one of your training examples, you end up with A 2 1 equals A 2 2. And moreover because I'm not going to show this in too much detail, but because these outgoing weights are the same you can also show that the delta values are also gonna be the same. So concretely you end up with delta 1 1, delta 2 1 equals delta 2 2, and if you work through the map further, what you can show is that the partial derivatives with respect to your parameters will satisfy the following, that the partial derivative of the cost function with respected to breaking out the derivatives respect to these two blue waves in your network. You find that these two partial derivatives are going to be equal to each other.

And so what this means is that even after say one gradient descent update, you're going to update, say, this first blue rate was learning rate times this, and you're gonna update the second blue rate with some learning rate times this. And what this means is that even after one gradient descent update, those two blue rates, those two blue color parameters will end up the same as each other. So there'll be some nonzero value, but this value would equal to that value. And similarly, even after one gradient descent update, this value would equal to that value. There'll still be some non-zero values, just that the two red values are equal to each other. And similarly, the two green ways. Well, they'll both change values, but they'll both end up with the same value as each other. So after each update, the parameters corresponding to the inputs going into each of the two hidden units are identical. That's just saying that the two green weights are still the same, the two red weights are still the same, the two blue weights are still the same, and what that means is that even after one iteration of say, gradient descent and descent. You find that your two hidden units are still computing exactly the same functions of the inputs. You still have the a1(2) = a2(2). And so you're back to this case. And as you keep running gradient descent, the blue waves, the two blue waves, will stay the same as each other. The two red waves will stay the same as each other and the two green waves will stay the same as each other.

And what this means is that your neural network really can compute very interesting functions, right? Imagine that you had not only two hidden units, but imagine that you had many, many hidden units. Then what this is saying is that all of your hidden units are computing the exact same feature. All of your hidden units are computing the exact same function of the input. And this is a highly redundant representation because you find the logistic progression unit. It really has to see only one feature because all of these are the same. And this prevents you and your network from doing something interesting. In order to get around this problem, the way we initialize the parameters of a neural network therefore is with random initialization.

여기 신경망이 있습니다.

인공 신경망의 모든 가중치 파라미터 Θ를 0으로 초기화합니다. 모든 가중치 파라미터 Θ의 값은 0이고 같습니다. 은닉층의 a^(2)1과 a^(2)2는 같은 입력값으로 활성화 함수를 계산하기 때문에 동일한 값을 가집니다. 모든 학습 데이터 셋에 대해 a^(2)1과 a^(2)2는 같습니다. 자세히 설명하지는 않겠지만 가중치 파라미터 Θ 가 같기 때문에 오차 δ의 값도 같습니다. 구체적으로 은닉 유닛 a^(2)1과 a^(2)2의 활성화 함수 값이 같기 때문에 비용 함수에 대한 편미분의 값도 같습니다.

경사 하강 업데이트는 파라미터 Θ^(1)10 에 대한 비용 함수 J(Θ)에 대한 미분항과 학습률 α를 곱한 후 업데이트하고, 파라미터 Θ^(1)20에 대한 비용 함수 J(Θ)에 대한 미분항과 학습률 α을 곱한 후 업데이트합니다. 두 경사 하강 업데이트 값은 동일합니다. 파라미터 Θ^(1)10와 Θ^(1)20의 값이 0 이거나 같은 값일 때 경사 하강 업데이트는 동일한 값입니다. Θ^(1)11와 Θ^(1)21도 0이 아닌 동일한 값이면 경사 하강 업데이트는 동일합니다. Θ^(1)12와 Θ^(1)22도 0이 아닌 동일한 값이면 경사 하강 업데이트는 동일합니다. 따라서, 은닉층의 a^(2)1 유닛과 a^(2)2 유닛은 동일한 값을 가집니다.

자세히 설명하지는 않겠지만 가중치 파라미터 Θ 가 같기 때문에 오차 δ의 값도 같습니다. 구체적으로 은닉 유닛 a^(2)1과 a^(2)2의 활성화 함숫값이 같기 때문에 비용 함수에 대한 편미분의 값도 같습니다. 경사 하강 업데이트를 계속하더라도 가중치 파라미터가 동일하기 때문에 제대로 동작하지 않습니다.

신경망은 매우 흥미로운 함수를 다룹니다. 은닉 유닛이 두 개가 아니라 아주 많다고 생각해 봅시다. 모든 은닉 유닛은 똑같은 피처를 계산합니다. 모든 은닉 유닛이 똑같은 입력을 가지고 똑같은 로지스틱 회귀 함수를 계산합니다. 똑같은 결과를 계산하는 은닉 유닛들이 중첩된 것과 마찬가지입니다. 똑같은 파라미터 행렬 Θ를 가진 신경망은 흥미로운 함수를 다룰 수 없습니다. 이 문제를 해결하기 위해 인공 신경망의 파라미터는 무작위 초기화해야 합니다.

Concretely, the problem was saw on the previous slide is something called the problem of symmetric ways, that's the ways are being the same. So this random initialization is how we perform symmetry breaking. So what we do is we initialize each value of theta to a random number between minus epsilon and epsilon. So this is a notation to b numbers between minus epsilon and plus epsilon. So my weight for my parameters are all going to be randomly initialized between minus epsilon and plus epsilon. The way I write code to do this in octave is I've said Theta1 should be equal to this. So this rand 10 by 11, that's how you compute a random 10 by 11 dimensional matrix. All the values are between 0 and 1, so these are going to be raw numbers that take on any continuous values between 0 and 1. And so if you take a number between zero and one, multiply it by two times INIT_EPSILON then minus INIT_EPSILON, then you end up with a number that's between minus epsilon and plus epsilon. And the so that leads us, this epsilon here has nothing to do with the epsilon that we were using when we were doing gradient checking. So when numerical gradient checking, there we were adding some values of epsilon and theta. This is your unrelated value of epsilon. We just wanted to notate init epsilon just to distinguish it from the value of epsilon we were using in gradient checking. And similarly if you want to initialize theta2 to a random 1 by 11 matrix you can do so using this piece of code here.

이런 문제를 대칭 문제라고 합니다. 랜덤 초기화는 대칭을 파괴하는 것입니다. 대칭을 파괴하기 위해 파라미터 θ의 값을 - ε과 +ε 사이의 난수로 초기화합니다. 따라서, 파라미터 또는 가중치 θ는 - ε과 +ε 사이의 값으로 랜덤 초기화합니다. 옥타브 프로그램에서 코드를 작성하기 위해 rand() 함수를 사용합니다.

rand(10,11) % 10 X 11 행렬을 생성하고 행렬 성분은 랜덤 숫자

모든 값은 0과 1 사이이므로 0과 1 사이의 연속 값을 취하는 초기 숫자가 될 것입니다.

Theta1 = rand(10,11) * (2*INIT_EPSILON) - INIT_EPSILON;

Theta2 = rand(1,11) * (2*INIT_EPSILON) - INIT_EPSILON;

결국, - ε과 +ε 사이의 숫자가 됩니다. 여기에 있는 ε은 경사도 검사의 ε과 전혀 상관없습니다. 경사도 검사(Gradient Checking)도 ε와 θ를 사용합니다. 경사도 검사의 ε과 구별하기 위해 변수 INIT_EPSILON를 사용합니다. Theta2 도 마찬가지입니다.

So to summarize, to create a neural network what you should do is randomly initialize the waves to small values close to zero, between -epsilon and +epsilon say. And then implement back propagation, do great in checking, and use either great in descent or 1b advanced optimization algorithms to try to minimize j(theta) as a function of the parameters theta starting from just randomly chosen initial value for the parameters. And by doing symmetry breaking, which is this process, hopefully great gradient descent or the advanced optimization algorithms will be able to find a good value of theta.

요약하면, 신경망에서 가중치 파라미터 θ가 0이거나 같은 값일 때 대칭 문제를 일으킵니다. 대칭 문제를 해결하기 위해 가중치 파라미터 θ의 값을 0에 가까운 - ε과 +ε 사이의 값으로 랜덤 초기화합니다. 그런 다음 역전파와 경사도 검사를 수행합니다. 그리고 경사 하강법과 고급 최적화 알고리즘을 사용하여 파라미터 θ에 랜덤 초기값에서 시작하여 파라미터 θ에 대한 비용 함수 J(θ)를 최소화합니다. 가중치 파라미터 θ의 초기값이 대칭 문제를 일으키지 않는다면, 경사 하강법이나 고급 최적화 알고리즘이 좋은 θ 값을 찾을 수 있습니다.