ㅣogistic regression 결과 비교
오픈 데이터셋인 “bostonhousing_ord”와 “abalone_ord”에 multinomial logistic regression과 ordinal logistic regression를 적용하여 분류 결과를 비교해보았다.
한글이라는 언어로 무언가를 작성하는 업에서 이제는 추가로 파이썬이라는 언어로 무언가를 작성하는 일이 주어지고 있다. 앞으로는 더욱 많아 질지 모른다.
우리 모두가 이해하는 언어로 표현하여 의사를 전달하는 것이 신이 우리에게 준 축복 중 하나가 아닐까?
그것이 없었다면 제2의 언어도 제3의 언어도 없었을테고, 우리는 서로 소통하는데 있어 많은 어려움이 있을 뿐 아니라, 지금의 글로벌 경제/문화와 비대면 문화도 없었으리~
※ 아래 내용은 제 글을 구독해주시는 분들과 서로 이해하지 못하는 언어로 쓰는 글이지만, 이 코드를 작성하기 위해 상당한 창조적 고뇌가 있었기에 글로 남겨봅니다.
데이터 셋
1. bostonhousing_ord 데이터셋 결과
- Multinomial Logistic Regression Accuracy : 0.618421052631579
- Ordinal Logistic Regression Accuracy : 0.756578947368421
- 결론 : Multinomial Logistic Regression 보다 Ordinal Logistic Regression의 정확도가 좋다는 것을 확인할 수 있음
2. abalone_ord 데이터셋 결과
- Multinomial Logistic Regression Accuracy : 0.9577352472089314
- Ordinal Logistic Regression Accuracy : 0.7543859649122807
- 결론 : Bostonhousing 데이터와 다르게 abalone 데이터셋은 Ordinal Logistic Regression 보다 Multinomial Logistic Regression의 정확도가 좋다는 것을 확인할 수 있음
1. Data Loading & Description
# data loading
import pandas as pd
bostonhousing = pd.read_csv("bostonhousing_ord.csv")
# data 속성을 보기위한 변수 설정
data = pd.DataFrame(bostonhousing)
#target = data.response # print(target)
#print(data.shape) #data.head(3)
# data 상관계수 확인
data.info
data.corr = data.corr()
# data 시각화
import matplotlib.pyplot as plt
import seaborn as sns
plt.figure(figsize=(10,10))
sns.set(font_scale=1)
sns.heatmap(data.corr, annot=True, cbar=False, cmap="YlGnBu")
plt.show()
# 상관계수 정렬
##data.corr = data.corr()
##corr.order = data.corr().loc[:'V13', 'response'].abs().sort_values(ascending = False)
2. Model building in Scikit-learn
#split dataset in features and target variable
import numpy as np
feature_cols = ['response', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10', 'V11', 'V12', 'V13']
X = bostonhousing[feature_cols] # Features
y = bostonhousing.response # Target variable은 1열의 임의 값인 response 데이터로 하였음8
# split X and y into training and testing sets
from sklearn.model_selection import train_test_split
from sklearn import model_selection
from sklearn import metrics
from sklearn.linear_model import LogisticRegression
# 데이터셋을 7:3 비율로 데이터셋을 나누고 tratify 사용 층화추출
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=0, stratify=y)
print(X_train)
print(X_test)
print(y_train)
print(y_test)
. . .
3. Model Development and Prediction
3.1 ordinal logistic regression
# import the class
from sklearn.linear_model import LogisticRegression
# ordinal logistic regression instantiate the model
logreg_ordinal= LogisticRegression(solver='liblinear')
# ordinal logistic regression
## fit the model with data
logreg_ordinal.fit(X_train,y_train)
## prediction
y_pred_ord=logreg_ordinal.predict(X_test)
3.2 multinomial logistic regression
# multinomial logistic regression instantiate the model
logreg_multinomial= LogisticRegression(solver='sag' , multi_class = 'multinomial' )
## solver 설명
#For small datasets, ‘liblinear’ is a good choice, whereas ‘sag’ and ‘saga’ are faster for large ones.
#For multiclass problems, only ‘newton-cg’, ‘sag’, ‘saga’ and ‘lbfgs’ handle multinomial loss; ‘liblinear’ is limited to one-versus-rest schemes.
# -‘newton-cg’, ‘lbfgs’, ‘sag’ and ‘saga’ handle L2 or no penalty
# -‘liblinear’ and ‘saga’ also handle L1 penalty
# -‘saga’ also supports ‘elasticnet’ penalty
# ordinal logistic regression
## fit the model with data
logreg_multinomial.fit(X_train,y_train)
## prediction
y_pred_multi=logreg_multinomial.predict(X_test)
4. Model Evaluation using Confusion Matrix
4.1 Ordinal logistic regression cnf matrix
In [6]:
# import the metrics class
from sklearn import metrics
cnf_matrix_ord = metrics.confusion_matrix(y_test, y_pred_ord)
cnf_matrix_ord
Out[6]:
array([[21, 2, 0, 0, 0],
[ 5, 67, 0, 0, 0],
[ 0, 16, 20, 1, 0],
[ 0, 0, 8, 2, 1],
[ 0, 0, 0, 3, 6]], dtype=int64)
4.2 Multinomial logistic regression cnf matrix
# import the metrics class
from sklearn import metrics
cnf_matrix_multi = metrics.confusion_matrix(y_test, y_pred_multi)
cnf_matrix_multi
Out[7]:
array([[15, 8, 0, 0, 0],
[ 6, 62, 4, 0, 0],
[ 0, 21, 16, 0, 0],
[ 0, 2, 9, 0, 0],
[ 0, 4, 4, 1, 0
5. Visualizing Confusion Matrix using Heatmap
5.1 Ordinal Logistic Regression Result
# import required modules
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
In [9]:
class_names=[1,2,3,4,5] # name of classes
fig, ax = plt.subplots()
sns.set(font_scale=1.1)4
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks)
plt.yticks(tick_marks)
# create heatmap
sns.heatmap(pd.DataFrame(cnf_matrix_ord), annot=True, cmap="YlGnBu" ,fmt='g', xticklabels=True)
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Ordinary Logistic Regression Confusion matrix', fontsize=18, color = 'blue')
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
Out[9]:
Text(0.5, 257.44, 'Predicted label')
# 정확도 출력
print("Ordinary Logistic Regression Accuracy :",metrics.accuracy_score(y_test, y_pred_ord))
# test set, train set 정확도
print('TrainData Accuracy : ', "{:.5f}".format(logreg_ordinal.score(X_train, y_train)))
print('TestData Accuracy : ', "{:.5f}".format(logreg_ordinal.score(X_test, y_test)))
Ordinary Logistic Regressoin Accuracy : 0.7631578947368421
TrainData Accuracy : 0.84181
TestData Accuracy : 0.76316
5.2 Multinomial Logistic Regression Result
class_names=[1,2,3,4,5] # name of classes
fig, ax = plt.subplots()
sns.set(font_scale=1.1)
tick_marks = np.arange(len(class_names)) +1
plt.xticks(tick_marks)
plt.yticks(tick_marks)
# create heatmap
sns.heatmap(pd.DataFrame(cnf_matrix_multi), annot=True, cmap="YlGnBu" ,fmt='g', xticklabels=True)
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Multinomial Logistic Regression Confusion matrix', fontsize=18, color = 'blue')
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
Out[11]:
Text(0.5, 257.44, 'Predicted label')
# 정확도 출력
print("Multinomial Logistic Regression Accuracy :", metrics.accuracy_score(y_test, y_pred_multi))
# test set, train set 정확도
print('TrainData Accuracy : ', "{:.5f}".format(logreg_multinomial.score(X_train, y_train)))
print('TestData Accuracy : ', "{:.5f}".format(logreg_multinomial.score(X_test, y_test)))
Multinomial Logistic Regression Accuracy : 0.6118421052631579
TrainData Accuracy : 0.62147
TestData Accuracy : 0.61184
6. Comparison Multinomial & ordinal Logistic regression
X=['Ordinal Logistic Regression','Multinomial Logistic Regression']
ord = metrics.accuracy_score(y_test, y_pred_ord)
multi = metrics.accuracy_score(y_test, y_pred_multi)
y = [ord, multi]
df = pd.DataFrame(y, X)
ax = df.plot(kind='barh')
for p in ax.patches:
x, y, width, height = p.get_bbox().bounds
ax.text(width*1.02, y+height/2, "%.3f"%(width), va='center')
plt.title('Comparison Multinomial & ordinal Logistic regression of bostonhousing dataset', fontsize=15, y=1.1, color = 'blue')
plt.box(False); ax.get_legend().remove(); plt.show()
Multinomial Logistic Regression Accuracy : 0.618421052631579
Ordinal Logistic Regression Accuracy : 0.756578947368421
결론 : Multinomial Logistic Regression 보다 Ordinal Logistic Regression의 정확도가 좋다는 것을 확인할 수 있음
abalone_ord 데이터 셋의 분석 소스코드는 위 내용과 거의 유사하므로 생략 함