brunch

라이킷 25 댓글

You can make anything
by writing

C.S.Lewis

계정을 잊어버리셨나요?

머신러닝 모델링 알고리즘

by 김범준 Jun 03. 2017

There will be a final exam this Saturday (3rd June, 13:00 - 14:30).

Some of the following topics will be asked in the exam.

기말시험에서 머신러닝에서의 모델링 내용을 다루는데.. 공부한게 아까워서 글로 남겨봅니다.

이 과목은 정말 얕고 넓게 배우는 과목이라 간단히만 정리했습니다. 전공자가 아니라서 틀린 부분이 있을 수 있고..(사실 시험 이틀전부터 공부 시작함ㅋ) 세부적인건 구글에 검색하면 훨씬 친절하게 나옵니당

*Perceptron

- 회귀식(wTx) -> activation function(unit step function)에 넣어서 0보다 큰지 아닌지 확인 -> classify하는 방식

- linearly separable, small learning rate를 가정

- 알고리즘은 결국 w를 얻는 것이 중요한데(선형 경계가 w로 결정됨)

- 1) w를 초기화(0이나 랜덤으로)

- 2) output y를 추정해보고 실제 y랑 비교해서 w를 update. 2)를 반복.

*Gradient descent and Stochastic Gradientdescent

- Gradient descent : cost function을 minimize하는건데 쉽게 생각하면 미분해서 0되는걸 찾는것과 비슷함. parameter를 gradient의 반대 방향으로 조금씩 이동시키면 minimum이 되는데 그렇게 되도록 update하는 과정을 거침

- Stochastic Gradientdescent : input data 개수가 많으면 Gradient descent를 하나씩 돌리는게 비효율적이라서, 대신에 일부의 데이터로 gradient를 계산하여 parameter를 update하는 과정을 반복

*Logistic Linear Regression (Sigmoid Function)

- 회귀식(wTx) -> activation function(sigmoid function)에 넣어서 probability를 얻음 -> 0.5로 기준으로 classify

- odd ratio = p/(1-p)

- logit function = log(p/(1-p)) 이때 logit(p=sigmoid(wTx))=wTx임

- sigmoid function : s자 함수고(1/1+exp(-x))

*Support Vector Machines

- margin(hyperplane과 (가장 가까운쪽의) data points 사이의 거리)을 maximize하는 방식.

- support vector가 hyperplane에 가장 가까운 data point. 이때 그 외의 다른 data point는 무시 가능.

- nonlinear problem에 대해서는 kernel을 쓰기도 함

*Decision Tree

- question 과정처럼 생각해서 직관적

- overfitting 이슈 : max depth를 설정해서 tree를 pruning

- purity(얼마나 homogeneous한지)를 maximize하는 split를 정하는 과정을 반복

- GINI index, Entropy index로 불순도(impurity)를 계산해서 최소화하고자 함

- CHIAD, CART 등의 알고리즘이 있음

*Random Forest

- DT의 ensemble : weak learner를 결합해서 strong learner를 만들어 error를 줄이고 overfitting을 피한다

- 1) bootstrap으로 sample을 랜덤으로 뽑아서 -> 2) DT를 만들고 -> 3) 이때 1) 2)를 반복 -> 4) 예측은 voting으로함. 이 과정을 Bagging(Bootstrap Aggregation)이라고 함

- randomness 덕분에 model의 variance가 적고 이에 따라 개별적인 DT보다 성능이 좋음

- outlier에 덜민감함

*K-nearest neighbors

- lazy learner : traning으로 학습하는게 아니라(모델이 없음) training dataset을 저장해두기만 함

- 1) k와 distance metric을 정하고 -> 2) 분류할 data point에 가장 가까운 point를 k개 찾아서(k-means는 cluster가 k개 있다는 의미고 kNN은 k개의 point로 voting하는거임) -> 3) voting

- 장점: classifier가 새로운 data에 바로 적응할 수 있다, 단점 : 계산이 복잡

*Regression Analysis

- continuous valued target을 예측

- variable 관계 이해, 트렌드 추정, 예측에 사용

- (single/multiple) linear regression : weights를 학습하는 것이 목표, linear fitting line을 찾게됨

*RANSAC

- RANdom SAmple Consensus

- linear regression은 outlier에 취약 -> inlier에 대해 regression하고자 하는 배경

- sample을 random으로 뽑아서 -> 모델링했을 때 오차가 threshold 이하인(즉 모델을 지지하는) data point의 개수를 세고 -> 이러한 iteration을 N번 반복해서 -> 최대의 consensus가 형성된(가장 많은 수의 데이터 points로부터 지지를 받는) 모델을 선택

*Random Forest Regression

- regression tree(예측 결과가 class가 아니라 continuous value인 DT)에 대한 random forest의 mean prediction으로 regression

*Clustering Analysis - K-means, Hierarchical Tree, DBSCAN

- Clustering : natural grouping을 찾는게 목표

[K-means]

- centroid를 찾아 clustering하는 방법

- 1) data point에서 random하게 centroid를 k개 선택 -> 2) 각각의 point에 대해 가장 가까운 centroid를 찾고 이들을 assign -> 3) centroid를 update -> 4) 이때 2) 3)을 반복

- k를 정하는 방법중 하나는 elbow method인데 distortion(SSE: sum of squared errors)의 기울기가 완만해지는 점을 elbow point로 잡아 이를 k로 선택

[hierarchical cluster tree]

- cluster의 개수를 정하지 않음

- Agglomerative(bottom up, 즉 observation에서 cluster로 점점 합치는 방법)와 Divisive(top down, 즉 cluster에서 하나씩 나누는 방법)의 2가지 방식이 있음

[DBSCAN]

- Density-Based Spatial Clustering of Applications with Noise

- core points: 주변에(특정 반경안에) n개 이상의 reachable points가 있는 점들

- reachable points: core는 아니지만 reach 가능

- outliers: core도 아니도 reach도 안됨

- 장점: 모양을 임의로 할 수 있고(비선형 경계의 cluster 가능) cluster수도 설정하지 않아도 되고 noise도 버릴 수 있음

- 단점: density 차이가 크면 cluster가 힘들고, threshold를 잡는게 어려울 수 있음

*Artificial Neural Network

- 그냥 3 layer 말하는건가..

*Deep Learning

- Deep neural networks: 여러 layer로 구성된 NN

- MLP(multilayer perceptrons) : three layer: input layer, hidden layer, output layer

- 1) Starting at the input layer: training data -> output

- 2) calculate the error: output 기준

- 3) back propagate the error: model을 update

- Convolutional Neural Network: input image -> feature maps -> convolutional layer -> pooling layer -> fully connected MLP

- Recurrent Neural Network: input layer, hidden layer, output layer에서 output layer->input layer로 recurrence

*Sentiment Analysis

- Natural Language Processing(NLP) 의 한 분야

- 작성자의 attitude를 다룸. 어떤 주제에 대한 '주관적'인 인상, 감정, 태도, 개인의 의견과 같은 것들을 찬성/반대, 좋음/싫음 같은 2진형식의 polarity를 얻거나, 혹은 감정 상태를 분석

- opinion mining이라고도 함

- (예) the classification of documents based on the expressed opinions or emotions of the authors with regard to a particular topic.

- Cleaning and preparing text data -> Building feature vectors from text documents -> Training a machine learning model to classify positive and negative movie reviews -> Working with large text datasets using out-of-core learning

keyword