관람자 데이터 탐색

by 보나벤투라

Nov 9. 2017

http://files.grouplens.org/datasets/movielens/ml-100k.zip

실습데이터 : 사용자들이 영화에 대해 매긴 점수와 관련된 100k 데이터집합

[root@client_server ~]# vi $SPARK_HOME/conf/spark-env.sh

export PYSPARK_DRIVER_PYTHON=/root/anaconda3/envs/py35/bin/ipython3

[root@client_server ~]# $SPARK_HOME/bin/pyspark --master spark://master:7077

pyspark를 아이파이썬 콘솔기능으로 사용 가능!

# 1) The way to use 'Transformation and Action,RDD'

>> rdd=sc.textFile("dataset/ml-100k/u.user")

>> rdd.take(1)

>> users_data=rdd.map(lambda x : x.split("|")) #sep="|"

>> users_num1=rdd.map(lambda x : x[0]).count() ; users_num1 #Number of user

>> users_occupation1=users_data.map(lambda x : x[3]

).distinct().count() ; users_occupation1 #Number of occupation

>> count_by_gender1=users_data.map(lambda x : (x[2],1)

).reduceByKey(lambda x,y:x+y).collect() ; count_by_gender1 #freq by sex

>> count_by_age1=users_data.map(lambda x : (x[1],1)

).reduceByKey(lambda x,y:x+y).collect() ; count_by_age1[:5] #freq by ages

>> count_by_occupation1=users_data.map(lambda x : (x[3],1)

).reduceByKey(lambda x,y:x+y).collect() ; count_by_occupation1[:5] #freq by jobs

위 결과를 통해, 사용자 수, 직업의 수, 성별 빈도수는 쉽게 확인할 수 있습니다. 그러나, 나이별 및 직업별 빈도수는 자료 양이 많기 때문에 쉽게 확인할 수 없습니다.

따라서, python의 matplotlib 라이브러리를 활용하여 분포를 확인해 보겠습니다.

나이별 영화관람 빈도수 분포 탐색

>> import matplotlib.pyplot as plt

>> users_data=rdd.map(lambda x : x.split("|")) #sep="|"

>> count_by_age1=users_data.map(lambda x : (x[1],1)

).reduceByKey(lambda x,y:x+y).collect() ; count_by_age1[:5] #freq by ages

>> users_age1=users_data.map(lambda x : int(x[1])).collect()

>> plt.hist(users_age1,bins=len(count_by_age1),color="lightblue", normed=True) #dist of ages

>> fig=plt.gcf() #활성화된 figure을 참조

>> fig.show()

화면에 보이는 2번째 array는 bins=len(count_by_age1)에 따라 나뉘어진 구간을 보여준다.

정확한 나이별 분포가 어떻게 분포하는지 궁금하여 이러한 히스토그램을 도출했으며, 구간 수를 줄여 나이 구간별 분포를 파악할 수 있습니다.

>> plt.hist(users_age1,bins=20,color="lightblue", normed=True)

>> fig=plt.gcf()

>> fig.show()

위와 같이, matplotlib을 활용하여 히스토그램을 사용해 분포를 확인할 수 있지만, seaborn을 활용해 러그와 커널밀도 표시까지 할 수 있습니다. matplotlib을 대체해서 많이 사용합니다. 다만 seaborn을 임포트하면 바탕화면, axis, 색상 등 matplotlib에서 제공하는 것과 다른 디폴트 스타일을 지정하게 됩니다.

10대 후반~30세까지의 연령대가 영화를 많이 관람한다는 사실을 알 수 있다.

>> import seaborn as sns

>> sns.distplot(users_age1,kde=True, rug=True)

>> plt.show()

직업별 영화관람 빈도수 분포 탐색

>> import matplotlib.pyplot as plt

>> import numpy as np

>> x_axis1=np.array([c[0] for c in count_by_occupation1])

>> y_axis1=np.array([c[1] for c in count_by_occupation1])

>> x_axis=x_axis1[np.argsort(y_axis1)] #직업

>> y_axis=y_axis1[np.argsort(y_axis1)] #빈도수

>> position=np.arange(len(x_axis)) # [0:21)

>> width=1.0

>> plt.bar(position, y_axis, width, color="lightblue")

>> ax=plt.axes()

>> ax.set_xticks(position) #check the position of occupation

>> ax.set_xticklabels(x_axis) #check occupation on the position

>> plt.xticks(rotation=30) #rotate labels

>> fig=plt.gcf()

>> fig.set_size_inches(16,10)

>> plt.show()

여러 직업중에서도 '학생'층이 가장 많이 영화를 관람한다는 사실을 알 수 있다.

데이터를 처리하는데 주로 RDD의 Transformation과 Action을 사용하였지만, 물론 시작부터 python의 pandas로도 가능하다.

# 2) The way to use only 'pandas library of python'

>> import pandas as pd

>> colnames=['user_id','age','gender', 'occupation', 'zip']

>> users_df=pd.read_table("dataset/ml-100k/u.user", sep="|", header=None, names=colnames)

>> users_df.head()

>> users_num2=len(users_df['user_id'].value_counts()) ;users_num2 #Number of user

>> users_occuaption2=len(users_df['occupation'].value_counts()) ; users_occupation2 #Number of occupation

>>a=dict(users_df["gender"].value_counts())

>>count_by_gender2=list(zip(a.keys(),a.values())) ; count_by_gender2 #freq by sex (series인 경우)

>> b=dict(users_df["age"].value_counts())

>> count_by_age2=list(zip(b.keys(),b.values())) ; count_by_age2[:5] #freq by age

#Series(df의 선택된 1column)형태는 Index값을 포함하는데, dict로 형변환 하는 경우 Index->Key

#zip는 동일한 갯수를 갖는 시퀀스 자료형을 묶어주는 역할

>> c=dict(users_df["occupation"].value_counts())

>> count_by_occupation2=list(zip(c.keys(),c.values())) ; count_by_occupation2[:5] #freq by jobs

탐색을 통해 얻은 결과

- 나이별 영화관람 빈도수 분포

: 10대 후반~30세까지의 연령대가 영화를 많이 관람한다는 사실을 알 수 있다.

[꼬리 분석질문] "영화의 재관람 횟수는 10대 후반~30세 사이 어느 연령대가 가장 높을까?"

- 직업별 영화관람 빈도수 분포 탐색

: 여러 직업중에서도 '학생'층이 가장 많이 영화를 관람한다는 사실을 알 수 있다.

[꼬리 분석질문] "학생 개인별로도 영화를 많이 관람할까? 즉, 마니아층이 가장 많이 분포하는 직업일까?"

keyword

보나벤투라

"마크툽!"

팔로워 35

매거진의 이전글Start Pyspark영화데이터 탐색매거진의 다음글