brunch

라이킷 7 댓글

You can make anything
by writing

C.S.Lewis

계정을 잊어버리셨나요?

by lawtech Apr 20. 2023

Fitting a Model to Data

Chapter4. 데이터사이언스개론

Summary

1. Model fitting (or parametric modeling)

The model is a partially specified numeric function, with some unspecified numeric parameters. We “fit” the model to the data by finding the best parameters. The meaning of “best” can be different for different applications.

2. Linear modeling

The form of the model: a simple weighted sum of attributes. and It includes support vector machine, linear regression, logistic regression.

The difference is
“What exactly do we mean by best fitting the data?”

• Support vector machine: a line that minimizes the margin

• Linear regression: a line that minimizes the sum of the squares of errors

• Logistic regression: a line that maximizes the likelihood of class probability

CONTENTS

I. Predictive Modeling

1. Nonparametric Modeling

(1) Space Parition of Classification Trees

(2) Another Way to Partition the Space

(3) Linear Classifier

2. Parametric Modeling(=Parametric Learning)

3. Classification Tree VS. Linear Classifier

4. Linear Classifier

(1) Shape of Decision Boundary

(2) Goal of General Linear Classifier

(3) Finding the "Best" Line

II. SVM(Support Vector Machine)

1. SVM vs. Logistic Regression

2. Support Vector Machine(SVM)

3. Basic Idea for the Best Line in SVM

4. Misclassification

5. SVM's Apporoach to Misclassification

III. Regression

1. Linear Regression

2. Least Squares Linear Regression

3. Logistic Regression

4. Using a Logistic Regression Model

IV. Objective(Cost, Loss) Function

1. Linear Classifier in Binary Classification

2. Logistic Regression in Binary Classification

I. Predictive Modeling

It is to find a model of the target variable in terms of other attributes. It separates two types of predictive modeling, one is Nonparametric modeling, and the other is Parametric modeling.

① Nonparametric modeling

• The structure of a model is not fixed

• The structure of a model is determined from the data

② Parametric modeling

• The structure of a model is fixed

• The structure of a model is specified by a data analyst

1. Nonparametric Modeling

We do not specify the structure of the model, and its structure of the model is learned from the data. for example, Classification Tree determing automatically from the data is a kind of Nonparametric Modeling.

(1) Space Parition of Classification Trees

If we use the classification tree, the space is divided into regions by decision boundaries. If a new, unseen and untested data fallls into a segment, we determine that the target value of the instance as the target value of the segment.

분류할 때 결정트리를 사용한다면 이렇게 balance와 age를 기준으로 네 개의 영역을 만드는데, 이 네 개의 영역 중 어느 영역에 속하느냐에 따라 분류 결과가 나오게 된다.

(2) Another Way to Partition the Space

Here is another way to sperate the instances, Linear classifier, note that the line is not perpendicular to the axis.

똑같은 데이터분포를 각 영역에 대해서 쪼갰는데, 다른 방식으로 직선의 방정식을 구해서 직선의 방정식을 기준으로 위면 +양의 class고 아래면 -의 class라고 보고 찾는 것, 선형 방정식으로 나누는 것을 linear classifier라고 한다.

(3) Linear Classifier

속성이 두 개(Blance, Age) 있으니까 평면을 둘로 나누려면 직선의 방정식이 하나만 있으면 되는데, x3라는 속성이 추가되면, 그래프는 공간이 되고, 평면의 방정식이 필요한데, 그것도 linear하게 표현된다. 역시 decision boundary에 의해 위에 있으면 양, 아래면 음의 값을 갖는다고 본다.

f(x)=1.5 Balance + 60=0에서, 1.5와 1.0는 parameter고, x3라는 속성이 추가된다면 그에 맞는 parameter를 찾는 것이 필요하다. 여기서도 함수가 0보다 크면 +, 작으면 0이 된다.

변수에 각각 어떤 값을 넣어서 결과가 0보다 크면 양의 class라고 판정할 수 있다. 그래서 이러한 함수를 찾아서 이 함수를 기준으로 0보다 크고 작은 것을 알아보고자 하는 것이다. 만약 0이면 어느 클라스로 보더라도 상관없다. 결정하기 나름이다.

Linear Classifier splits the space using a linear combination that is a weighted sum of the attribututes.

만약 f(x)=1.5Balance+1.0Age-60=0이라는 것을 구했다면, 이는 속성들에 대해서 weighted sum, 가중치값이 있는 것이다. Balance는 1.5배를 했고, Age는 1.0배를 했으니까 balance가 비중이 더 큰 것이고, 이 값에 좌우되는 것이 더 크다는 것을 나타낸다. 이렇게 계수로 어느 것이 더 informative하다는 것까지 알 수가 있다.

2. Parametric Modeling(=Parametric Learning)

It specifies the model's form with certain parameters unspecified to find the best parameter values given a training dataset. You can so easily find it in your math class, It is a parameterized mathematical function "Y = aX+b", and here, a and b is called parameter. they can be changed to tune the parameters so that the model fits the data best, such as "Y=1.39X+4.78".

a, b를 찾는 게 parametric이다. 최적의 값을 찾는 것이다. 직선의 기울기랑 절편을 구하면 우리가 찾는 parameter를 찾은 것. 여기서 가장 적합한 optimal한 최적의 값이 뭘까를 고민해서 찾아야 하는데, 그러려면 가장 잘 분류되는 모델을 찾아야한다. 이는 linear regression이 된다. 선형회귀라고 부른다.

The form of the model is usually chosen based on Domain Knowledge, or other data mining techniques.

Using Parametric Model, our goal is to find the "optimal" values of the model parameters, so, What exactly do we mean when we say a model fits the data well?

우리의 목적은 학습되지 않은 데이터가 들어왔을 때 class를 맞추면 좋겠다는 것이다. w는 가중치를 의미하는데, 이러한 가중치를 어떻게 설정하느냐가 영향을 미치게 된다.

Which line is the best and why?

best line을 찾고자 한다. 굉장히 많은 결정경계가 있을 수 있는데, 어떤 라인으로 쪼개도 학습데이터는 100% 맞출 수 있을 것이고, 새로운 데이터가 들어왔을 때 어느 class냐를 묻는 것은 상황에 따라 달라질 수 있다. 새로운 데이터에 대해서도 어느 정도의 정확성을 유지하기 위해 margin을 두게 된다.

모델이 학습데이터를 완벽하게 나눈다고 해서 테스트 데이터를 완벽하게 나눈다는 보장은 없기 때문에, 그와 같은 best line을 찾는 것이 쉽지 않다.

3. Classification Tree VS. Linear Classifier

Both have the same goal that separates into regions with different values of the target variable. only the form is different.

목표는 같은데 형태가 다르다. 결정트리는 leaf node가 내려가는 형태로, if then으로 만들 수 있고 linear classifier는 linear 하게, 속성들을 선형으로 결합한 것이 직선의 방정식처럼 보이는 것이다.

그러면 모델마다 이 식, parameter를 어떻게 구할까, 하는 것이 남아있는 숙제다.

일반화시켜서 속성이 어떻게 되는지 n개로 표현한 것이 표가 된다. 백터는 점, 항상 점, 어떤 차원에 속하든 점이다. 속성은 feature라고도 하고 attribute라고도 한다.

역시 f(x)를 만들어서 이것이 0보다 크면 양의 class, 작으면 음의 class.

First, consider the following data. and then, to classify x, use a general classifier in the following function.

W는 가중치, 우리가 구하고자 하는 parameter 값이고, 이를 찾기 위해 어떠한 목적함수 y(x)를 만들고 값을 계속 학습시키면서 목적함수의 값이 최대가 되거나, 최소가 되는 parameter 값을 찾아본다.

4. Linear Classifier

(1) Shape of Decision Boundary

f(x)=W0+W1X1+W2X2+W3X3+...=0

이를 Binary classification이라고 한다(후술). 속성이 두 개라면 linear classification은 line이 된다. 속성이 세 개면 면의 방정식이 되어 면이 만들어진다. 속성이 네 개면 차원이 3인 초평면으로 공간을 나눈다. 10개면 9차원... 이런 식이 된다.

속성을 늘리면 WnXn 이렇게 속성이 n개가 만들어진다. 우리는 w1-n까지 값을 찾으려고 하는데, 찾아서 식을 만들었다면 f(x)=0이므로 결정경계에 있는 것인데, 양이냐 음이냐로 나눠서

w1부터 n까지 parameter 값을 찾는 것이 학습에 의한 결과다.

(2) Goal of General Linear Classifier

Given a training data, find the best values of W0,W1,...,Wn, classify the training data well and predict a new data as accurately as possible.

It shows us some meaning, for example, The larger Wi, the more important Xi for classifying the target, and If Wi is near zero, Xi can usually be ignored or discarded.

(3) Finding the "Best" Line

Unfortunately, It's not trival to choose the "best" line to separate the classes.

Among many candidate lines, which one is the best?

이러한 직선이 존재할 때 각 데이터가 어느 곳에 위치하는가를 볼 수 있는데, margin이 어느 정도 존재해야 best line이라고 할 수 있을 것이다.

(4) What Weights Should We Choose?

General procedure

1) Define an objective function that represents our goal.

2) Find the optimal value for the weights the maximize(or minimize) the objective function.

The following models have the same form, but use different objective functions such as a. support vector machine (SVM), b. Linear regression, and c. Logistic regression to find the best values of W0, W1, ..., Wn.

여기서 weight이란 앞서 본 것처럼 가중치이다. 학습데이터를 사용해서 모델의 수식, f(x)를 찾는 데, 계산으로만 하면 좋겠지만 그게 쉽지 않으므로 비용함수, 손실함수와 같은 목적함수를 사용하는 것이다.

목적함수를 cost함수로 놓고 parameter를 찾아가는 경우 parameter를 잘 찾으면 cost값이 줄어들게 된다. 그런데 parameter를 한 번에 찾기 어렵기 때문에 어떤 방향으로 좀 바꿔보면서 목적함수의 값을 볼 수 있고, 그런 방식을 반복하다 보면 목적함수값이 어딘가에 머무르게 된다. 그것을 해로써 찾자. 최대경사법같은 것을 사용하여 학습시키는 방향으로 찾기도 한다.

또한 어떤 측정치를 하나 만들어 maximize, minimize 역수관계를 찾아볼 수 있다. 목적함수를 만들고 자연스럽게 감소하는 방향이 좋다고 할 수도 있고 자연스럽게 증가하는 것이 좋은 경우에는 maximize하면 된다. 결국은 같은 의미가 된다.

II. SVM(Support Vector Machine)

1. SVM vs. Logistic Regression

The two classification methods produce different boundaries since they're optimizing different objective functions.

Which separator looks better?

Briefly, SVMs are linear classifiers which classify based on a linear combination of the features.

What is the decision function that is obtained by fitting data for an SVM? That is, which line is the best that SVM thinks?

linear regression!=linear classifier 둘이 다른데, logistic regression=classification이다. linear classifiers=SVM, linear regression, logistic regression 이 셋이 같은 형태를 가지고 있는데 다른 함수를 사용한다. 그래서 가중치가 서로 다르게 나온다. 목적은 같지만 함수가 다르기 때문에 결과적으로 만들어 주는 paramater값은 다를 수 있다.

3. Basic Idea for the Best Line in SVM

The best linear classifier is a line that maximizes the margin which is the distance between the dashed lines. In these lines, the center line is the linear classifier used by SVM. That is, the best line is one that is far from both classes.

예를 들어 여기 데이터가 이만큼 있다. iris 데이터가 있다. 세 개의 클라스 중 두 개의 클라스를 표현한 것. 직선의 방정식을 찾는데, 서로 함수가 다른 이유는 목적함수가 다르니까. 어느 saparate가 더 좋아 보이는가.

이 수직거리, margin이 가장 큰 게 좋다. 두 class를 떨어뜨려놓기 때문이다. 중간의 선을 결정경계라고 하면, 이때 최대가 되는 margin을 주는 데이터들이 vector 인데, support vector 기 때문에 SVM이라고 한다.

4. Misclassification

There are many situation in which a single line cannot perfectly separate the data into classes. There may be no such a perfect separating line!

그런데 앞서 본 경우는 데이터가 잘 분류되어 있는 경우이고, SVM 안 쓰고 아무거나 써도 분류가 잘 되는 경우이다. 반면 missclassification이 현실에서 볼 수 있는 대표적인 경우, 오분류가 심한 경우이다. single line으로 분류되지 않으 네 영역, 세 영역으로 쪼갤 수도 있다. SVM에서 miss classification 된 것을 어떻게 구할 수 있을까.

5. SVM's Apporoach to Misclassification

f(x) = cost = _____(Original function)
+
penalty(New objective function)

Our main idea is to add a new objective function that means a penalty for each misclassified instance in the training data. Thus, our f(x) is combined with the original function which measures the size of the margin of the line and the new objective function called the penalty.

새로운 목적함수에는 패널티를 주자. 이 cost function 은 값이 클수록 좋은데, 패널티 항을 추가했다면 negative 방향으로 주면 이 항을 생각할 필요가 없는 것인데, 패널티가 없으면, 즉 miss classified intances가 없으면 이 항을 생각할 필요가 없다. 그런데 마진은 큰데 miss classified 인 게 여러 개 있다면 패널티 값은 점점 커지게 된다. 패널티가 커지면 이 항의 값을 뺀다, 줄여준다. ___를 키우는 것은 좋지만 오분류값이 많으니 패널티를 이용해서 오분류값을 빼주는 것이다. 따라서 best가 아니라 optimal 최적화된 값을 찾는 것이다.

How much penalty for each misclassified point? the penalty is proportional to the distance from the margin boundary. This type of penalty is called the hinge loss function.

panalty를 데이터별로 다르게 설정해서 오분류된 것을 다르게 설정해 본다. margin까지는 loss 값을 주지 않다가, 그 경계를 넘으면 loss값을 주는 hinge loss 항을 하나 추가할 수 있을 것이다. 그래서 항끼리 서로 경쟁하게 하자.

III. Regression

1. Linear Regression

Find a linear function used to predict the value of a target attribute that best describes the data.

What is the selected function used for linear regression?

parameter찾을 때 value를 predict하는 것을 찾으면 된다. 다음과 같은 데이터에서 6개의 데이터 중 어떤 것을 선택할까.

There are many different linear regression produres that use different objective functions.

General procedure of linear regression is, first, to compute the error for each individual point in the training data. This error is the distance between the line and the point. and then, sum up the errors to find W0, W1, ..., Wn that minimize the sum of the errors.

6개의 점이 있는데, 이렇게 모델링을 했다면 error가 있다. error는 line과 point사이의 거리다. 먼저 각 데이터에 대해서 error를 계산하고 그 error를 sum하자. 우리는 전체 데이터가 어느 정도 들어맞는지 찾아야 하기때문에 sum한 error를 최소화시키는 parameter를 찾아야 한다.

2. Least Squares Linear Regression

The most common ("standard") linear regression procedure is the Least Squares Linear Regression. This objective function is to find W0, W1, W2,...,Wn that minimize the sum of the squares of the errors.

Why do we use "Squares"?

Because If we minimize the sum of the errors, this objective function does not yield a unique line. In other words, an un-unique line contains lots of values, so we focus on finding a unique line. but If we minimize the sum of the absolute values of the errors, it is hard to find W0, W1, ..., Wn mathematically(It's kind of overfitting)

Outlying data points can severely skew the result linear function. Thus to overcome this, there exist other objective functions.

Objective function: the sum of the squares of the errors.

우리가 찾고자 하는 값은, least square, 즉 제곱을 최소화하는 것이다. 예측값과 참값의 차이를 제곱한 것을 전부 더해서 n으로 나눌 때를 구하는데, mean square error 방식과 같다.

Least square=Mean Square Error

왜 square를 쓰는가? 그냥 error를 sum하면 unique한 답을 구할 수 없다. 여러 개의 값이 나오기 때문이다. 이러한 어려움이 있어서 원래 절댓값을 쓰려고 하는데, 절댓값 쓰면 이 차이가 전부 양의 값이 되기 때문이다. 그런데 W0 ~ Wn까지 찾는 게 수학적으로 어렵고 절댓값을 구하나 제곱을 구하나 차이도 없기 때문에 수학적으로 쉽고 unique한 solution으로 square를 사용한다.

제곱이 최소화되면 절댓값도 최소화될 거니까 제곱한 것을 최소화시켜도 별 문제가 없다. 그런데, 간단한 만큼 단점이 있다. 이상치가 있다면, 이것이 편견을 준다. bias outlier가 생길 수 있다.

한 번 구해보자. 세 개의 (1, 1), (2, 2), (3, 3) 이라는 데이터가 있는데 목적함수값이 누가 더 작으냐 하는 것, 즉 the sum of the squares of the errors 가 가장 작은 목적함수를 찾고자 한다.

데이터가 세 개 있으니까 1에서부터 3까지, ypredict의 직선이 그려지는데, 이 직선 세개가 전부 0을 지나고 있다. 즉, y=ax 형태의 일차식 함수가 된다. 우리가 구하고자 하는 것은 W 즉, weight, 가중치 paremeter 값이기 때문에 a대신에 W 쓰고, 원점을 지나니까 절편이 없는 형태로 생각하면 된다. 여기서 cost함수는 x나 y의 함수가 아니고 W의 함수이다. W 값에 따라 cost값이 바뀌는 것이다.

1부터 해보면 W1일때=0이면 이것은 y=x를 의미하고 error가 없는 함수가 된다. W=0이면 오차가 있는데 제곱해서 계산하면 잘 fitting한 것보다 error값이 크다.

이와 같이 실질적으로 주어진 데이터가 11이 아니라 (1,1.2) (2,1.8) (3,2.7) 이 정해졌다고 하면 그때도 1일 때가 최저가 되겠는데 0은 아니다. 오차가 있기 때문이다.

이때 오차가 최소가 되는 어떤 W를 구할 수 있다. 미분해서 구할 수도 있고 방법이 있다. 이러한 목적함수 하나를 놓고 이 값이 어디를 지나느냐에 따라 오차가 적은 W값을 찾아서 그 지날 때의 최적인 직선의 방정식을 찾는 것이다.

이러한 W의 함수 cost를 활용해서 주어진 데이터에서 W값을 끌어오는 것을 할 수 있어야 한다. 만약 y의 함수만 존재하고 x값만 존재한다면 x를 집어넣고 y를 구해서 cost 함수에 집어넣거나 하는 방식으로 구하면 된다.

3. Logistic Regression

Use a linear model f(x) to estimate the probability that a new instance belongs to the class of interest.

For example, f(x)=0.85 => the probability of x belongs to this class is 0.85.

W를 잘 구해서 어떤 데이터가 들어오면 어떤 값이 나오는 함수를 찾아냈는데, 이 값을 확률처럼 쓰고 싶다, 어떤 probablity냐면 새로운 값이 단순히 양이다 음이다가 아니라 0.2면 85%의 함수로 여기 속하는구나 이런 것을 알고 싶다는 것이다.

Can we use f(x) directly to estimate
the class probability of x?

No!
f(x) ranges from -infinite and infinite,
but a probability should range within 0 to 1.

Fortunately, we can use f(x) to produce a model designed to give accurate estimates of class probability.

And instead of using f(x) directly, we use logistic function p(x) to estimate the class probability of x.

f(x) 값은 산술적으로 -무한대부터 무한대까지 값을 가질 수 있다. 그런데 우리가 확률을 생각한다면 확률은 0~1까지의 범위가 있다. 0과 1의 값으로 바꾸려고 f(x) 수식을 변형시킬 수 있다.

That is, p(x) converts the value of f(x) to the value within 0 to 1.

f(x)가 커지면 p(x)가 1에 수렴하고, f(x)가 0에 가까워지면 p(x)가 0에 수렴한다. f(x)=0인 결정경계에서 p(x)는 0.5가 된다. 이러한 형태의 그래프를 logistic function 이라고 한다. 이를 바꿔서 p(x)를 가지고 결정하려고 한다. f(x)가 0보다 작다는 것을 p(x)가 0.5보다 작다고 해석하고, f(x)가 0보다 크다는 것을 p(x)가 0.5보다 크다고 해석하여도 같은 얘기가 된다. 이렇게 바꿔놓으면 p(x)의 범위가 0에서부터 1까지기 때문에 확률처럼 생각할 수 있게 된다. 그래서 우리가 수식을 억지로 바꿔서 사용하는 것이다.