brunch

You can make anything
by writing

C.S.Lewis

by 라인하트 Oct 27. 2020

앤드류 응의 머신러닝(9-1): 신경망의 비용 함수

온라인 강의 플랫폼 코세라의 창립자인 앤드류 응 (Andrew Ng) 교수는 인공지능 업계의 거장입니다. 그가 스탠퍼드 대학에서 머신 러닝 입문자에게 한 강의를 그대로 코세라 온라인 강의 (Coursera.org)에서 무료로 배울 수 있습니다. 이 강의는 머신러닝 입문자들의 필수코스입니다. 인공지능과 머신러닝을 혼자 공부하면서 자연스럽게 만나게 되는 강의입니다.

Neural Networks : Learning

신경망 : 학습

Cost Function and Backpropagation

(비용 함수와 역전파)

Cost Function (비용 함수)

Neural networks are one of the most powerful learning algorithms that we have today. In this and in the next few videos, I'd like to start talking about a learning algorithm for fitting the parameters of a neural network given a training set. As with the discussion of most of our learning algorithms, we're going to begin by talking about the cost function for fitting the parameters of the network.

신경망은 오늘날 가장 강력한 학습 알고리즘 중 하나입니다. 이번 강의부터 학습 데이터 셋에 적합한 신경망의 파라미터 Θ를 찾는 학습 알고리즘을 설명합니다. 지금까지 다른 학습 알고리즘과 마찬가지로 파라미터 행렬 Θ 를 찾기 위해 비용 함수 J(Θ)에서 시작합니다.

I'm going to focus on the application of neural networks to classification problems. So suppose we have a network like that shown on the left. And suppose we have a training set like this is x I, y I pairs of M training example. I'm going to use upper case L to denote the total number of layers in this network. So for the network shown on the left we would have capital L equals 4. I'm going to use S subscript L to denote the number of units, that is the number of neurons. Not counting the bias unit in their L of the network. So for example, we would have a S one, which is equal there, equals S three unit, S two in my example is five units. And the output layer S four, which is also equal to S L because capital L is equal to four. The output layer in my example under that has four units.

여기 분류 문제에 적용할 신경망이 있습니다.

여기 m개의 학습 예제가 있습니다. 대문자 L은 신경망의 층의 수이고, 여기서 L = 4입니다. sl은신경망의 뉴런 또는 유닛의 수이고 l은 아래 첨자입니다. sl은 유닛의 수를 계산할 때 바이어스 유닛을 고려하지 않습니다. 그림의 신경망에서 s1 = 3, s2 = 5, s3 = 5, s4 = 4입니다. 신경의 층 수를 나타내는 L은 4이므로 sL은 출력층을 의미하 s4 입니다. s4는 4개의 유닛이 있습니다.

We're going to consider two types of classification problems. The first is Binary classification, where the labels y are either 0 or 1. In this case, we will have 1 output unit, so this Neural Network unit on top has 4 output units, but if we had binary classification we would have only one output unit that computes h(x). And the output of the neural network would be h(x) is going to be a real number. And in this case the number of output units, S L, where L is again the index of the final layer. Cuz that's the number of layers we have in the network so the number of units we have in the output layer is going to be equal to 1. In this case to simplify notation later, I'm also going to set K=1 so you can think of K as also denoting the number of units in the output layer.

두 가지 유형의 분류 문제가 있습니다. 첫 번째 유형은 레이블이 y=0 또는 y=1인 이진 분류입니다. 이진 분류는 hθ(x)를 계산하는 단 하나의 출력 유닛이 있고 출력값 hθ(x)는 실수 0 또는 1입니다. 위의 신경망은 4개의 출력 유닛이 있습니다. L은 신경망의 모든 층의 수를 가리키는 인덱스이므로 sL은 마지막층을 가리킵니다. 표기를 단일화하기 위해 이진 분류 신경망은 K = 1로 설정합니다. K는 출력층의 유닛 수를 나타냅니다.

The second type of classification problem we'll consider will be multi-class classification problem where we may have K distinct classes. So our early example had this representation for y if we have 4 classes, and in this case we will have capital K output units and our hypothesis or output vectors that are K dimensional. And the number of output units will be equal to K. And usually we would have K greater than or equal to 3 in this case, because if we had two causes, then we don't need to use the one verses all method. We use the one verses all method only if we have K greater than or equals V classes, so having only two classes we will need to use only one output unit.

두 번째 유형은 K 개의 고유한 클래스가 있는 멀티클래스 문제입니다. 클래스의 수가 K일 때 출력 유닛의 수도 K개입니다. 학습 예제의 y값과 출력값 hθ(x)는 R^(K) 차원 벡터입니다. K가 3보다 크거나 같은 두 번째 유형의 멀티 클래스 문제는 one-versus-all 기법을 사용합니다. 이진분류는 단 지 하나의 출력 유닛이 있기 때문에 one-versus-all 기법을 사용할 필요가 없습니다.

Now let's define the cost function for our neural network. The cost function we use for the neural network is going to be a generalization of the one that we use for logistic regression. For logistic regression we used to minimize the cost function J(theta) that was minus 1/m of this cost function and then plus this extra regularization term here, where this was a sum from J=1 through n, because we did not regularize the bias term theta0.

신경망의 비용 함수를 정의합니다. 신경망에서 사용하는 비용 함수는 로지스틱 회귀에 사용하는 비용 함수를 일반화합니다. 로지스틱 회귀의 가설과 정규화된 비용 함수 J(θ)는 다음과 같습니다.

정규화 항은 j = 1부터 n까지입니다. 항상 1의 값을 가지는 바이어스 항 θ0는 정규화지 않기 때문입니다.

For a neural network, our cost function is going to be a generalization of this. Where instead of having basically just one, which is the compression output unit, we may instead have K of them. So here's our cost function. Our new network now outputs vectors in R K where R might be equal to 1 if we have a binary classification problem. I'm going to use this notation h(x) subscript i to denote the ith output. That is, h(x) is a k-dimensional vector and so this subscript i just selects out the ith element of the vector that is output by my neural network. My cost function J(theta) is now going to be the following. Is - 1 over M of a sum of a similar term to what we have for logistic regression, except that we have the sum from K equals 1 through K. This summation is basically a sum over my K output. A unit. So if I have four output units, that is if the final layer of my neural network has four output units, then this is a sum from k equals one through four of basically the logistic regression algorithm's cost function but summing that cost function over each of my four output units in turn. And so you notice in particular that this applies to Yk Hk, because we're basically taking the K upper units, and comparing that to the value of Yk which is that one of those vectors saying what cost it should be.

신경망의 비용 함수는 로지스틱 회귀의 비용 함수를 활용합니다. 학습 문제는 출력 유닛이 하나인 이진 분류 또는 K개인 멀티클래스 분류문제일 수 있습니다. 이진 분류 문제를 다루는 신경망은 K=1이고 R^(K)는 R^(1) 차원이고 0 또는 1의 값을 가지는 실수입니다. 신경망의 출력값 (hθ(x))i는 i 번째 출력 유닛의 활성화 함수 결과를 나타냅니다. hθ(x)i는 K 차원의 벡터이고 아래 첨차 i는 인공 신경망에서 출력하는 벡터의 i 번째 성분이 1임을 의미합니다. 다음은 신경망의 비용 함수 J(Θ)입니다.

로지스틱 회귀의 비용 함수와 유사합니다. 두 번째 시그마 k = 1에서 K까지이고, 이 합계는 K 출력에 대한 합계입니다. 예를 들면, 신경망의 출력층이 4개의 출력 유닛을 가질 경우 k는 1에서 4까지의 로지스틱 회귀 알고리즘의 비용 함수를 합산합니다. k를 사용하는 y^(i)k와 hΘ(x)k에 그대로 적용합니다. 출력층의 k 번째 유닛과 비용이 얼마인지를 나타내는 벡터 y^(i)k값과 비교합니다.

And finally, the second term here is the regularization term, similar to what we had for the logistic regression. This summation term looks really complicated, but all it's doing is it's summing over these terms theta j i l for all values of i j and l. Except that we don't sum over the terms corresponding to these bias values like we have for logistic regression.

마지막으로 두 번째 항은 로지스틱 회귀와 유사한 정규화 항입니다. 정규화 항은 정말 복잡해 보이지만 i, j와 l의 모든 값에 대해 θ^(l)ji 을 합산하는 것뿐입니다. 로지스틱 회귀와 마찬가지로 바이어스 항은 합산하지 않습니다.

Completely, we don't sum over the terms responding to where i is equal to 0. So that is because when we're computing the activation of a neuron, we have terms like these. Theta i 0. Plus theta i1, x1 plus and so on. Where I guess put in a two there, this is the first hit in there. And so the values with a zero there, that corresponds to something that multiplies into an x0 or an a0. And so this is kinda like a bias unit and by analogy to what we were doing for logistic progression, we won't sum over those terms in our regularization term because we don't want to regularize them and string their values as zero. But this is just one possible convention, and even if you were to sum over i equals 0 up to Sl, it would work about the same and doesn't make a big difference. But maybe this convention of not regularizing the bias term is just slightly more common.

확실히 i = 0에 대응하는 항은 합산하지 않습니다. 유닛의 활성화 함수를 계산할 때 이런 항이 있습니다.

위 첨자 2는 두 번째 층을 의미합니다. x0는 a0로 나타낼 수 있고, x0와 a0는 바이어스 유닛으로 로지스틱 회귀와 비슷합니다. 정규화가 필요 없는 바이어스 유닛은 정규화할 때 합산하지 않습니다. 이것은 관습적으로 그런 것일 뿐입니다. 바이어스 항을 정규화하더라도 거의 동일하게 작동합니다. 하지만, 바이어스 항을 정규화하지 않는 것이 일반적입니다.

So that's the cost function we're going to use for our neural network. In the next video we'll start to talk about an algorithm for trying to optimize the cost function.

이것이 신경망에 사용하는 비용 함수입니다. 다음 강의에서는 비용 함수를 최적화하기 위한 알고리즘을 설명합니다.