brunch

You can make anything
by writing

C.S.Lewis

by 라인하트 Oct 29. 2020

앤드류 응의 머신러닝(9-4):역전파 언롤링 파라미터

온라인 강의 플랫폼 코세라의 창립자인 앤드류 응 (Andrew Ng) 교수는 인공지능 업계의 거장입니다. 그가 스탠퍼드 대학에서 머신 러닝 입문자에게 한 강의를 그대로 코세라 온라인 강의 (Coursera.org)에서 무료로 배울 수 있습니다. 이 강의는 머신러닝 입문자들의 필수코스입니다. 인공지능과 머신러닝을 혼자 공부하면서 자연스럽게 만나게 되는 강의입니다.

Neural Networks : Learning

인공 신경망 : 학습

Backpropagation in Practice

(역전파 실습)

Implementation Note: Unrolling Parameters

(구현 참고 사항 : 언롤링 파라미터)

In the previous video, we talked about how to use back propagation to compute the derivatives of your cost function. In this video, I want to quickly tell you about one implementational detail of unrolling your parameters from matrices into vectors, which we need in order to use the advanced optimization routines.

지난 강의에서 비용 함수의 미분을 계산하기 위해 역전파 알고리즘을 사용하는 방법을 설명했습니다. 이번 강의에서 고급 최적화 알고리즘을 사용하기 위한 파라미터 행렬에서 벡터로 전개하는 언롤링 파라미터를 세부적으로 구현하는 방법을 설명합니다.

Concretely, let's say you've implemented a cost function that takes this input, you know, parameters theta and returns the cost function and returns derivatives. Then you can pass this to an advanced optimizaiton algorithm by fminunc and fminunc isn't the only one by the way. There are also other advanced optimization algorithms. But what all of them do is take those input pointedly the cost function, and some initial value of theta. nd both, and these routines assume that theta and the initial value of theta, that these are parameter vectors, maybe Rn or Rn plus 1. But these are vectors and it also assumes that, you know, your cost function will return as a second return value this gradient which is also Rn and Rn plus 1. So also a vector. This worked fine when we were using logistic progression but now that we're using a neural network our parameters are no longer vectors, but instead they are these matrices where for a full neural network we would have parameter matrices theta 1, theta 2, theta 3 that we might represent in Octave as these matrices theta 1, theta 2, theta 3. And similarly these gradient terms that were expected to return. Well, in the previous video we showed how to compute these gradient matrices, which was capital D1, capital D2, capital D3, which we might represent an octave as matrices D1, D2, D3.

파라미터 행렬 Θ(theta)에 대한 비용 함수를 구현합니다. costFunction 함수는 다음 두 개의 변수를 반환합니다.

변수 jVal : 비용 함수의 계산 결과

변수 gradient : 비용 함수를 미분한 기울기이자 파라미터 행렬의 최소값

fminunc() 함수는 고급 최적화 알고리즘을 호출합니다. 물론 fminunc()가 유일한 고급 최적화 호출 함수는 아닙니다. fminunc() 함수에 대해 간략하게 정리합니다.

'GradObj', 'on' : GradObj 옵션을 활성화하는 설정

fminunc 함수는 비용과 기울기를 모두 반환

fminunc 함수를 최소화할 때 기울기를 활용

'MaxIter', 400 : MaxIter 옵션을 400으로 설정

fminunc가 400 단계를 실행

@(costFunction(t, X, y)) : costFunction(t,X,y)를 호출

fminunc는 costFunction 함수를 호출합니다.

initialTheta : 파라미터 θ의 초기값을 정의

fminunc() 함수는 theta와 initialTheta 를 R^(n) 또는 R^(n+1) 파라미터 벡터로 인지하고, costFunction() 함수는 gradient을 R^(n) 또는 R^(n+1) 벡터로 반환합니다. 로지스틱 회귀에서 문제없이 동작합니다. 하지만, 신경망에서 파라미터 Θ는 더 이상 벡터가 아닌 행렬입니다. 신경망은 파라미터 행렬 Θ^(1), Θ^(2), Θ^(3)를 옥타브 프로그램에서 파라미터 행렬을 Theta1, Theta2, Theta3로 표현합니다.

costFunction() 함수가 반환하는 변수 gradient는 지난 강의에서 대문자 D^(1), D^(2), D^(3)는 옥타브 프로그램에서 D1, D2, D3로 표현합니다.

In this video I want to quickly tell you about the idea of how to take these matrices and unroll them into vectors. So that they end up being in a format suitable for passing into as theta here off for getting out for a gradient there.

이 강의에서 행렬을 벡터로 언롤링하는 방법을 설명합니다. 행렬들은 gradient를 얻기 위해 Theta와 initialTheta 가 적합한 형식으로 구성되어야 합니다.

Concretely, let's say we have a neural network with one input layer with ten units, hidden layer with ten units and one output layer with just one unit, so s1 is the number of units in layer one and s2 is the number of units in layer two, and s3 is a number of units in layer three. In this case, the dimension of your matrices theta and D are going to be given by these expressions. For example, theta one is going to a 10 by 11 matrix and so on. So in if you want to convert between these matrices and vectors.

예를 들면, 다음과 같은 신경망이 있습니다. 10개의 유닛이 있는 입력층, 10개의 유닛이 있는 은닉층, 그리고 1 개의 출력 유닛이 있는 출력층이 있습니다. s1은 첫 번째 층의 유닛 수이고, s2는 두 번째 층의 유닛 수이고, s3는 세 번째 층의 유닛 수입니다. 행렬 Θ 와 D의 차원은 다음 식과 같습니다. 예를 들면, Θ^(1)은 10 X 11 행렬이고, Θ^(2)는 10 X 11 행렬이고, Θ^(3)은 1 X 11입니다. 파라미터 Θ^(l)에 대한 미분인 D^(l)은 동일한 행렬들과 벡터의 형태를 가집니다.

What you can do is take your theta 1, theta 2, theta 3, and write this piece of code and this will take all the elements of your three theta matrices and take all the elements of theta one, all the elements of theta 2, all the elements of theta 3, and unroll them and put all the elements into a big long vector. Which is thetaVec and similarly the second command would take all of your D matrices and unroll them into a big long vector and call them DVec.

Theta1, Theta2, Theta3을 가지고 코드를 만드는 것입니다. Theta1의 모든 성분, Theta2의 모든 성분, Theta3의 모든 성분을 가져옵니다. 세 개의 행렬 Θ에 모든 성분을 펼쳐서 크고 긴 벡터로 만듭니다. 이것이 thetaVec입니다. DVec은 모든 D 행렬을 가져와서 크고 긴 벡터로 만듭니다.

And finally if you want to go back from the vector representations to the matrix representations. What you do to get back to theta one say is take thetaVec and pull out the first 110 elements. So theta 1 has 110 elements because it's a 10 by 11 matrix so that pulls out the first 110 elements and then you can use the reshape command to reshape those back into theta 1. And similarly, to get back theta 2 you pull out the next 110 elements and reshape it. And for theta 3, you pull out the final eleven elements and run reshape to get back the theta 3.

마지막으로 벡터에서 다시 행렬로 되돌릴 수 있습니다. Theta1 행렬을 다시 만들기 위해 thetaVec에서 처음 110개 성분을 꺼냅니다. Theta1은 10 X 11 행렬이기 때문에 110개의 성분을 추출한 후 reshape() 명령어로 10 X 11 행렬로 변경합니다. Theta2 행렬을 다시 만들기 위해 thetaVec에서 다음 110개의 성분을 추출한 후 reshape() 명령어로 10 X 11 행렬로 변경합니다. Theta3 행렬은 마지막 110개 성분을 추출한 후 reshape() 명령어로 10 X 11 행렬로 변경합니다.

Here's a quick Octave demo of that process. So for this example let's set theta 1 equal to be ones of 10 by 11, so it's a matrix of all ones. And just to make this easier seen, let's set that to be 2 times ones, 10 by 11 and let's set theta 3 equals 3 times 1's of 1 by 11.

여기 옥타브 프로그램 창이 있습니다. Theta1을 모든 성분이 1인 10 X 11 행렬로 설정합니다. Theta2를 모든 성분이 2인 10 X 11 행렬로 설정합니다. Theta3을 모든 성분이 3인 10 X 11 행렬로 설정합니다.

Theta1 = ones(10,11)

Theta2 = 2 * ones(10,11)

Theta3 = 3 * ones(10,11)

So this is 3 separate matrices: theta 1, theta 2, theta 3. We want to put all of these as a vector. ThetaVec equals theta 1; theta 2 theta 3. Right, that's a colon in the middle and like so and now thetavec is going to be a very long vector. That's 231 elements. If I display it, I find that this very long vector with all the elements of the first matrix, all the elements of the second matrix, then all the elements of the third matrix.

여기 3개의 행렬 Theta1, Theata2, Theta3 가 있습니다. 3개의 행렬을 하나의 벡터로 만듭니다.

thetaVec = [Theat1(:); Theta2(:); Theta3(:)]

thetaVec은 매우 긴 벡터이고 231개의 성분을 가지고 있습니다. thetaVec을 표시한다면 매우 긴 벡터입니다. thetaVec은 첫 행렬의 모든 성분, 두 번째 행렬의 모든 성분, 세 번째 행렬의 모든 성분을 가지고 있기 때문입니다.

And if I want to get back my original matrices, I can do reshape thetaVec. Let's pull out the first 110 elements and reshape them to a 10 by 11 matrix. This gives me back theta 1. And if I then pull out the next 110 elements. So that's indices 111 to 220. I get back all of my 2's. And if I go from 221 up to the last element, which is element 231, and reshape to 1 by 11, I get back theta 3.

원래대로 되돌리기 위해 thetaVec을 재구성할 수 있습니다. reshape() 명령어는 행렬의 차원을 재구성합니다.

reshape(thetaVec(1:110,10,11)

reshape(thetaVec(111:220,10,11)

reshape(thetaVec(221:231,10,11)

Theta1의 모든 성분, Theta2의 모든 성분, Theta3의 모든 성분을 반환합니다.

To make this process really concrete, here's how we use the unrolling idea to implement our learning algorithm. Let's say that you have some initial value of the parameters theta 1, theta 2, theta 3. What we're going to do is take these and unroll them into a long vector we're gonna call initial theta to pass in to fminunc as this initial setting of the parameters theta.

이것이 신경망 학습 알고리즘을 구현하는 방법입니다. 파라미터 Theta1, Theta2, Theta3의 초기값이 있다고 가정합니다. 따라서, 이 값을 가져다가 initialTheta 값으로 긴 벡터로 풀어서 넣습니다. fminunc() 함수는 Theta들의 값을 초기화하기 위해 initialTheta을 호출합니다.

The other thing we need to do is implement the cost function. Here's my implementation of the cost function. The cost function is going to give us input, thetaVec, which is going to be all of my parameters vectors that in the form that's been unrolled into a vector. So the first thing I'm going to do is I'm going to use thetaVec and I'm going to use the reshape functions. So I'll pull out elements from thetaVec and use reshape to get back my original parameter matrices, theta 1, theta 2, theta 3. So these are going to be matrices that I'm going to get. So that gives me a more convenient form in which to use these matrices so that I can run forward propagation and back propagation to compute my derivatives, and to compute my cost function j of theta. And finally, I can then take my derivatives and unroll them, to keeping the elements in the same ordering as I did when I unroll my thetas. But I'm gonna unroll D1, D2, D3, to get gradientVec which is now what my cost function can return. It can return a vector of these derivatives.

다음으로 비용 함수를 구현합니다. 여기 비용 함수를 구현하는 방법입니다. costFuction() 함수의 입력값은 thetaVec입니다. thetaVec은 벡터의 형태로 변경한 모든 파라미터들을 성분으로 하는 매우 긴 벡터입니다. 제일 먼저 thetaVec에서 원래 파라미터 행렬 Theta1, Theta2, Theta3로 변환하기 위해 reshape() 함수를 사용합니다. Theta1, Theta2, Theta3 행렬을 만듭니다. 이 행렬이 순전파와 역전파를 실행하여 미분항을 계산하고 비용 함수 J(Θ)를 계산할 수 있는 더 편리한 형태입니다.

마지막으로 계산된 미분의 벡터를 언롤링합니다. Θ를 전개할 때 동일한 순서로 모든 성분을 유지할 수 있습니다. D1, D2, D3를 전개하여 비용 함수가 반환하는 fradientVec도 얻을 수 있습니다. 미분의 벡터를 반환합니다.

So, hopefully, you now have a good sense of how to convert back and forth between the matrix representation of the parameters versus the vector representation of the parameters.

The advantage of the matrix representation is that when your parameters are stored as matrices it's more convenient when you're doing forward propagation and back propagation and it's easier when your parameters are stored as matrices to take advantage of the, sort of, vectorized implementations. Whereas in contrast the advantage of the vector representation, when you have like thetaVec or DVec is that when you are using the advanced optimization algorithms. Those algorithms tend to assume that you have all of your parameters unrolled into a big long vector. And so with what we just went through, hopefully you can now quickly convert between the two as needed.

지금까지 파라미터 행렬과 파라미터 벡터를 서로 변환하는 방법을 공부했습니다. 행렬 표현의 장점은 파라미터가 행렬로 저장될 때 순전파와 역전파를 수행할 때 더 편리하다는 것입니다. 파라미터가 행렬로 저장하고 일정의 벡터화된 구현을 활용하는 것이 더 쉽습니다. 반면에 벡터 표현의 장점은 thetaVec 또는 DVec과 같이 고급 최적화 알고리즘을 사용할 때 더 편리하다는 것입니다. 이러한 알고리즘은 모든 파라미터가 크고 긴 벡터로 언롤링할려는 경향이 있습니다. 그래서 방금 살펴본 내용을 통해 둘 사이를 빠르게 전환할 수 있기를 바랍니다.

앤드류 응의 머신러닝 동영상 강의

정리하며

고급 최적화 알고리즘은 우선 학습 비율 알파를 수동으로 선택할 필요 없이 자동으로 계산하고, 경사 하강법보다 훨씬 빠르게 최저값에 수렴합니다. 고급 최적화 알고리즘이 어떻게 동작하는 지를 알지 못해도 실제로 적용하여 더 나은 결과를 얻을 수 있습니다. 복잡한 최적화 알고리즘은 직접 구현할 필요 없이 소프트웨어 라이브러리를 활용합니다. 옥타브 프로그램에서 고급 최적화를 사용하는 방법은 다음과 같습니다.

J(θ) = (θ1 -5)^2 + (θ2 - 5)^2이라고 가정할 때, costFunction.m 함수 파일은 다음과 같이 작성합니다.

function [jVal, gradient] = costFunction(theta)

jVal = (theta(1)-5)^2 + (theta(2)-5)^2;

gradient = zeros(2,1);

gradient(1) = 2 * (theta(1)-5);

gradient(2) = 2 * (theta(2)-5);

end

costFunction 함수는 두 변수의 값을 반환합니다.

jVal : 비용 함수 J(θ)를 계산하는 코드

gradient : 알고리즘이 도달할 최소값, 2 X1 벡터,

옥타브 프로그램은 costFunction.m 함수 파일을 고급 최적화 알고리즘에 적용하기 위해 fminunc() 함수를 호출합니다.

options = optimset('GradObj', 'on', 'MaxIter', '100');

initialTheta = zeros(2,1). %θ 파라미터 벡터의 두 값을 [0;0]으로 초기화

[optTheta, functionVal, exitFlag] = fminunc(@costFunction, initialTheta, options)

fminunc() : 옥타브에서 특정 함수에서 최적화 문제를 해결하는 함수

optimset ('GradObj', 'on') : 전역 최소값을 구하는 함수의 gradient 값을 반환한다는 의미

optimset ('MzxIter', 'on') : 최대 반복 횟수를 100으로 설정

exitFlag : 알고리즘이 최적 값에 수렵되었는지 여부를 표시

정교한 최적화 라이브러리를 사용하는 고급 최적화 알고리즘은 디버깅이 어렵지만, 경사 하강법보다 훨씬 더 빠르게 동작합니다.

인공 신경망에서 똑같이 고급 최적화 기법을 사용할 수 있습니다.

파라미터는 더 이상 벡터가 아닌 행렬입니다. 옥타브 프로그램에서 파라미터 행렬을Θ^(1), Θ^(2), Θ^(3)을 Theta1, Theta2, Theta3로 표현하고, 비용함수 J(Θ)를 미분한 D^(l)ij 의 값인 D^(1), D^(2), D^(3)를 D1, D2, D3로 나타냅니다. 또한, 인공 신경망이 10개의 유닛이 있는 입력층, 10개의 유닛이 있는 은닉층 그리고 1 개의 출력 유닛이 있는 출력층이 있다고 가정할 때, Θ^(1)은 10 X 11 행렬이고, Θ^(2)는 10 X 11 행렬이고, Θ^(3)은 1 X 11입니다. 파라미터 Θ에 대한 경사 하강법 업데이트 공식 D^(1)은 10 X 11 행렬이고, D^(2)는 10 X 11 행렬이고, D^(3)은 1 X 11입니다.

하지만, 고급 최적화 알고리즘은 벡터를 처리하기 때문에 행렬을 벡터로 전환해야 합니다. 예를 들어, Theta1, Theta2, Theta3 가 모두 10 X 11 행렬일 때 231 X 1 벡터로 전환해야 합니다.

thetaVec = [Theat1(:); Theta2(:); Theta3(:)]

그리고, 반환된 결과값은 다시 Theta1, Theta2, Theta3에 맞게 행렬로 변환합니다.

reshape(thetaVec(1:110,10,11)

reshape(thetaVec(111:220,10,11)

reshape(thetaVec(221:231,10,11)

같은 방식으로 D1, D2, D3 행렬을 DVec 벡터로 전환하고, 고급 최적화 연산 후에 다시 D1, D2, D3 행렬로 전환합니다.