brunch

You can make anything
by writing

C.S.Lewis

by Master Seo Nov 20. 2023

38탄-17. EKS DB -Demo:TrinoTaxi


<1> 설치 및 기본 확인 : Analysis with a data lake

<2> Inspect data in S3

<3> Use Trino Web Interface

<4> Use Superset Web Interface

<5> Demo 삭제






<1> 설치 및 기본 확인 : Analysis with a data lake


1

설치 및 기본 확인 : Analysis with a data lake - LinkDemo



시스템 사양 요구 조건

To run this demo, your system needs at least:

cpu units (core/hyperthread)

      16GiB memory

28GiB disk storage



https://docs.stackable.tech/home/stable/demos/trino-taxi-data


https://docs.stackable.tech/home/stable/demos/


The data was put into the S3 storage → Trino enables you to query the data via SQL → Superset was used as a web-based frontend to execute SQL statements and build dashboards.




2

# Demo 정보 확인

stackablectl demo list

stackablectl demo list -o json | jq


stackablectl demo describe trino-taxi-data

 Demo             trino-taxi-data                                                                                              

 Description      Demo loading 2.5 years of New York taxi data into S3 bucket, creating a Trino table and a Superset dashboard 

 Documentation    https://docs.stackable.tech/stackablectl/stable/demos/trino-taxi-data.html     Stackable stack  trino-superset-s3                                                                                            

 Labels           trino, superset, minio, s3, ny-taxi-data



Superset 은 시각화

Trino를 통해 확인한다.





3

Emissary Ingress 와 OPA 소개 및 연동 실습 가이드


OPA

Open Policy Agent : 클라우드 네이티브 환경에 적합한 오픈 소스 정책 엔진, 다양한 플랫폼과 소프트웨어와 연동 가능 - 추천글   

OPA (발음 �  “oh-pa”) 는 정책에 대한 질의 시, 해당 정책을 평가하여 돌려줌



https://gasidaseo.notion.site/Emissary-Ingress-OPA-766c1bd473954054b5046c9e52494605#6811d428351d4d17a948325d37c888d2




4

Load testdata into S3. It contains 2.5 years of New York City taxi trips



5

Create Superset dashboards for visualization of the data


# [터미널] 모니터링

watch -d "kubectl get pod -n stackable-operators;echo;kubectl get pod,job,svc,pvc"



6

# 데모 설치 : 데이터셋 다운로드 job 포함 8분 정도 소요


stackablectl demo install trino-taxi-data


(8분 소요)



# 설치 확인

helm list -n stackable-operators



helm list



kubectl top node

kubectl top pod -A


kubectl get-all -n default



kubectl get deploy,sts,pod


스테이트 풀셋으로 설치된거  hive 등

디플로이로 설치된 minio






kubectl get job

보라색이 job



kubectl get job load-ny-taxi-data -o yaml | kubectl neat | cat -l yaml

kubectl get job create-ny-taxi-data-table-in-trino -o yaml | kubectl neat | cat -l yaml

kubectl get job setup-superset -o yaml | kubectl neat | cat -l yaml

kubectl get job superset -o yaml | kubectl neat | cat -l yaml



# 저장소 확인

kubectl get sc,pvc,pv

kubectl get pv |grep gp3

kubectl get sc secrets.stackable.tech -o yaml | kubectl neat | cat -l yaml

kubectl df-pv


# 서비스 확인

# 트리노는 노드 포트 사용 등

kubectl get svc,ep,endpointslices




kubectl get cm,secret


kubectl get cm minio -o yaml | kubectl neat | cat -l yaml

kubectl describe cm minio


kubectl get cm hive-metastore-default -o yaml | kubectl neat | cat -l yaml

kubectl get cm hive -o yaml | kubectl neat | cat -l yaml

kubectl get cm postgresql-hive-extended-configuration -o yaml | kubectl neat | cat -l yaml



#트리노 보기

kubectl get cm trino-coordinator-default -o yaml | kubectl neat | cat -l yaml

kubectl get cm trino-coordinator-default-catalog -o yaml | kubectl neat | cat -l yaml

kubectl get cm trino-worker-default -o yaml | kubectl neat | cat -l yaml

kubectl get cm trino-worker-default-catalog -o yaml | kubectl neat | cat -l yaml

kubectl get cm create-ny-taxi-data-table-in-trino-script -o yaml | kubectl neat | cat -l yaml


kubectl get cm superset-node-default -o yaml | kubectl neat | cat -l yaml

kubectl get cm superset-init-db -o yaml | kubectl neat | cat -l yaml

kubectl get cm setup-superset-script -o yaml | kubectl neat | cat -l yaml


kubectl get secret minio -o yaml | kubectl neat | cat -l yaml

kubectl get secret minio-s3-credentials -o yaml | kubectl neat | cat -l yaml

kubectl get secret postgresql-hive -o yaml | kubectl neat | cat -l yaml

kubectl get secret postgresql-superset -o yaml | kubectl neat | cat -l yaml

kubectl get secret trino-users -o yaml | kubectl neat | cat -l yaml

kubectl get secret trino-internal-secret -o yaml | kubectl neat | cat -l yaml

kubectl get secret superset-credentials -o yaml | kubectl neat | cat -l yaml

kubectl get secret superset-mapbox-api-key -o yaml | kubectl neat | cat -l yaml



# CRD 설치됨

kubectl get crd | grep stackable


(11-15-access@myeks:N/A) root@myeks-bastion-EC2:~# kubectl get crd | grep stackable

authenticationclasses.authentication.stackable.tech   2023-11-21T09:28:03Z

druidconnections.superset.stackable.tech              2023-11-21T09:29:01Z

hiveclusters.hive.stackable.tech                      2023-11-21T09:28:20Z

opaclusters.opa.stackable.tech                        2023-11-21T09:28:38Z

s3buckets.s3.stackable.tech                           2023-11-21T09:28:03Z

s3connections.s3.stackable.tech                       2023-11-21T09:28:03Z

secretclasses.secrets.stackable.tech                  2023-11-21T09:28:56Z

supersetclusters.superset.stackable.tech              2023-11-21T09:29:01Z

supersetdbs.superset.stackable.tech                   2023-11-21T09:29:01Z

trinocatalogs.trino.stackable.tech                    2023-11-21T09:29:20Z

trinoclusters.trino.stackable.tech                    2023-11-21T09:29:20Z

(11-15-access@myeks:N/A) root@myeks-bastion-EC2:~#




kubectl explain trinoclusters

kubectl describe trinoclusters.trino.stackable.tech


kubectl get hivecluster,opacluster,s3connection

kubectl get supersetcluster,supersetdb

kubectl get trinocluster,trinocatalog


kubectl get hivecluster -o yaml | kubectl neat | cat -l yaml

kubectl get s3connection -o yaml | kubectl neat | cat -l yaml

kubectl get supersetcluster -o yaml | kubectl neat | cat -l yaml

kubectl get supersetdb -o yaml | kubectl neat | cat -l yaml

kubectl get trinocluster -o yaml | kubectl neat | cat -l yaml

kubectl get trinocatalog -o yaml | kubectl neat | cat -l yaml



7

# 배포 스택 정보 확인 : 바로 확인 하지 말고, 설치 완료 후 아래 확인 할 것


배포 된것들 확인 할수 있다.

링크도 클릭해보자~


stackablectl stacklet list




minio 링크 접속 



8

# 배포 스택의 product 접속 계정 정보 확인 : 대부분 admin / adminadmin 계정 정보 사용



stackablectl stacklet credentials superset superset

stackablectl stacklet credentials minio minio-console  # admin / adminadmin 계정 정보 출력 안됨... 아직은 rc 단계라 그런듯



# 배포 오퍼레이터 확인

stackablectl operator installed




9

기본 작업 CMD에서 미 설정 시 : 집 PC에서 직접 ENDPOINTS(워커 노드의 NodePort)로 접속 설정 → 아래 보안 그룹 추가 후 접속 가능



# 워커노드의 '#-nodegroup-ng1-remoteAccess' 보안 그룹에 자신의 집 공인IP 접속 허용 추가

NGSGID=$(aws ec2 describe-security-groups --filters Name=group-name,Values='*ng1*' --query "SecurityGroups[*].[GroupId]" --output text)

aws ec2 authorize-security-group-ingress --group-id $NGSGID --protocol '-1' --cidr $(curl -s ipinfo.io/ip)/32





<2> Inspect data in S3


https://docs.stackable.tech/home/stable/demos/trino-taxi-data#_inspect_data_in_s3


https://min.io/


https://min.io/docs/minio/kubernetes/upstream/operations/installation.html#minio-operator-installation



PRODUCT minio ENDPOINTS console-http 접속 : admin / adminadmin



Click on the blue button Browse on the bucket demo and open the folders ny-taxi-data → raw


필드 : 승차 및 하차 날짜/시간, 승차 및 하차 위치, 이동 거리, 항목별 요금, 요금 유형, 지불 유형 및 운전자가 보고한 승객 수 - Link


You can see the file size (and therefore the number of rides) decrease drastically because of the Covid-19 pandemic starting from 2020-03


Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval.

https://parquet.apache.org/images/FileLayout.gif




#

minio ?


kubectl get svc,ep minio-console


(11-15-access@myeks:N/A) root@myeks-bastion-EC2:~# kubectl get svc,ep minio-console

NAME                    TYPE       CLUSTER-IP      EXTERNAL-IP   PORT(S)          AGE

service/minio-console   NodePort   10.100.108.99   <none>        9001:32029/TCP   36m

NAME                      ENDPOINTS            AGE

endpoints/minio-console   192.168.3.105:9001   36m





# 데이터셋 다운로드 동작 확인


kubectl get job load-ny-taxi-data -o yaml | kubectl neat | cat -l yaml -p

...

spec:

      containers:

      - command:

        - bash

        - -c

        - cd /tmp && for month in 2020-01 2020-02 2020-03 2020-04 2020-05 2020-06

          2020-07 2020-08 2020-09 2020-10 2020-11 2020-12 2021-01 2021-02 2021-03

          2021-04 2021-05 2021-06 2021-07 2021-08 2021-09 2021-10 2021-11 2021-12

          2022-01 2022-02 2022-03 2022-04; do curl -O https://repo.stackable.tech/repository/misc/ny-taxi-data/yellow_tripdata_$month.parquet

          && mc --insecure alias set minio http://minio:9000/ $(cat /minio-s3-credentials/accessKey)

          $(cat /minio-s3-credentials/secretKey) && mc cp yellow_tripdata_$month.parquet minio/demo/ny-taxi-data/raw/; done

...


// 잡으로 외부 택시 데이터를  스토리지에 올린것이다.




## 샘플 다운로드 : '연도-월' 만 바꿔서 다운로드 가능

curl -O https://repo.stackable.tech/repository/misc/ny-taxi-data/yellow_tripdata_2023-01.parquet



# 추가 정보

kubectl get cm minio -o yaml | kubectl neat | cat -l yaml

kubectl describe cm minio

kubectl get secret minio -o yaml | kubectl neat | cat -l yaml

kubectl get secret minio-s3-credentials -o yaml | kubectl neat | cat -l yaml





<3> Use Trino Web Interface


트리노

분산형 SQL 쿼리 엔진 = 분산해서 병렬로 처리해서 빠르게 처리한다.


stackablectl stacklet list






https://docs.stackable.tech/home/stable/demos/trino-taxi-data#_use_trino_web_interface


https://trino.io/



1

Trino, a query engine that runs at ludicrous speed : Fast distributed SQL query engine for big data analytics - Link


https://blog.treasuredata.com/blog/2015/03/20/presto-versus-hive/



2

Trino offers SQL access to the data within S3



3


PRODUCT trino ENDPOINTS coordinator-https 접속 : admin / adminadmin




4

#

kubectl get svc,ep trino-coordinator



#

kubectl get job create-ny-taxi-data-table-in-trino -o yaml | kubectl neat | cat -l yaml

kubectl get trinocluster,trinocatalog

kubectl get trinocluster -o yaml | kubectl neat | cat -l yaml -p

kubectl get trinocatalog -o yaml | kubectl neat | cat -l yaml -p

...

spec:

    connector:      # hive, s3

      hive:

        metastore:

          configMap: hive

        s3:

          reference: minio

...



#

kubectl get cm trino-coordinator-default -o yaml | kubectl neat | cat -l yaml

kubectl get cm trino-coordinator-default-catalog -o yaml | kubectl neat | cat -l yaml -p

...

data:

  hive.properties: |

    connector.name=hive

    hive.metastore.uri=thrift\://hive-metastore-default-0.hive-metastore-default.default.svc.cluster.local\:9083

    hive.s3.aws-access-key=${ENV\:CATALOG_HIVE_HIVE_S3_AWS_ACCESS_KEY}

    hive.s3.aws-secret-key=${ENV\:CATALOG_HIVE_HIVE_S3_AWS_SECRET_KEY}

    hive.s3.endpoint=http\://minio\:9000

    hive.s3.path-style-access=true

    hive.s3.ssl.enabled=false

    hive.security=allow-all

...



kubectl get cm trino-worker-default -o yaml | kubectl neat | cat -l yaml

kubectl get cm trino-worker-default-catalog -o yaml | kubectl neat | cat -l yaml

kubectl get cm create-ny-taxi-data-table-in-trino-script -o yaml | kubectl neat | cat -l yaml

#



kubectl get secret trino-users -o yaml | kubectl neat | cat -l yaml

kubectl get secret trino-internal-secret -o yaml | kubectl neat | cat -l yaml



5

trino-operator를 통해서 trino worker 2대로 증설


# 오퍼레이터 로깅 수준을 높여보자..

다른 터미널 1

kubectl logs -n stackable-operators -l app.kubernetes.io/instance=trino-operator -f



# trino worker 2대로 증설

kubectl get trinocluster trino -o json | cat -l json -p

kubectl patch trinocluster trino --type='json' -p='[{"op": "replace", "path": "/spec/workers/roleGroups/default/replicas", "value":2}]'


시간이 좀 걸림 10분


워커나 늘어남!!

2로 변경됨







6.다시 trino worker 1대로 축소..


kubectl patch trinocluster trino --type='json' -p='[{"op": "replace", "path": "/spec/workers/roleGroups/default/replicas", "value":1}]'






<4> Use Superset Web Interface



Superset  =가사화 툴






https://docs.stackable.tech/home/stable/demos/trino-taxi-data#_use_superset_web_interface



https://superset.apache.org/


https://superset.apache.org/docs/databases/trino/


1

Apache Superset™ is an open-source modern data exploration and visualization platform

Superset gives the ability to execute SQL queries and build dashboards



stackablectl stacklet list



 http://54.180.8.76:31194


Taxi data 대시보드  클릭하면 트리노가 연산 시작한다!!!!  연결되어 있다.



PRODUCT superset ENDPOINTS external-superset 접속 : admin / adminadmin





On the top click on the tab Dashboards → Click on the dashboard called Taxi data


It might take some time until the dashboards renders all the included charts. ⇒ 다소 시간 걸림(새로 고침),Trino 같이 확인


Taxi data 대시보드  클릭하면 트리노가 연산 시작한다!!!!  연결되어 있다.


2

You can clearly see the impact of Covid-19 on the taxi business → ‘20.3월 이후 부터 급격히 감소했음을 확인 할 수 있다



결과

#

kubectl get svc,ep superset-external



#

kubectl get job setup-superset -o yaml | kubectl neat | cat -l yaml

kubectl get job superset -o yaml | kubectl neat | cat -l yaml

kubectl get supersetcluster,supersetdb

kubectl get supersetcluster -o yaml | kubectl neat | cat -l yaml -p

kubectl get supersetdb -o yaml | kubectl neat | cat -l yaml -p



#

kubectl get cm superset-init-db -o yaml | kubectl neat | cat -l yaml -p

kubectl get cm superset-node-default -o yaml | kubectl neat | cat -l yaml -p

kubectl get cm setup-superset-script -o yaml | kubectl neat | cat -l yaml -p



#

kubectl get secret superset-credentials -o yaml | kubectl neat | cat -l yaml

kubectl get secret superset-mapbox-api-key -o yaml | kubectl neat | cat -l yaml




3

Execute arbitrary SQL statements


Within Superset you can not only create dashboards but also run arbitrary SQL statements. 

On the top click on the tab SQL Lab → SQL Editor






On the left select the database Trino, the schema demo and set See table schema to ny_taxi_data




select

  format_datetime(tpep_pickup_datetime, 'YYYY/MM') as month,

  count(*) as trips,

  sum(total_amount) as sales,

  avg(duration_min) as avg_duration_min

from ny_taxi_data

group by 1

order by 1






트리노가 연산 시작한다!!!!  연결되어 있다.

트리너 워커 노드를 늘리면 더 빠르게 처리 된다.~~




4

How many taxi trips there where in the year 2021?


select

  count(*) as trips

from ny_taxi_data

where year(tpep_pickup_datetime) = 2021



5

What was the maximum amount of passengers?


select

  max(passenger_count) as max_passenger_count

from ny_taxi_data;



# Returns 112 passengers. Well that’s weird. Let’s examine the passengers distribution.

select

  passenger_count,

  count(*) as frequency

from ny_taxi_data

group by 1

order by 1 desc

limit 100



6

What was the highest tip (measured in percentage of the original fee) ever given?


select

  total_amount as fee,

  tip_amount as tip,

  tip_amount / total_amount * 100 as tip_percentage

from ny_taxi_data

where total_amount > 0

order by 3 desc

limit 5


7

(옵션/참고) Where to go from here - Link




<5> Demo 삭제


1

#

kubectl delete supersetcluster,supersetdb superset

kubectl delete trinocluster trino && kubectl delete trinocatalog hive

kubectl delete hivecluster hive

kubectl delete s3connection minio

kubectl delete opacluster opa



#

helm uninstall postgresql-superset

helm uninstall postgresql-hive

helm uninstall minio



#

kubectl delete job --all

kubectl delete pvc --all



#

kubectl delete cm create-ny-taxi-data-table-in-trino-script setup-superset-script trino-opa-bundle

kubectl delete secret minio-s3-credentials secret-provisioner-tls-ca superset-credentials superset-mapbox-api-key trino-users

kubectl delete sa superset-sa



# operator 삭제

stackablectl operator uninstall superset trino hive secret opa commons



# 남은 리소스 확인

kubectl get-all -n stackable-operators




다음

https://brunch.co.kr/@topasvga/3539



감사합니다.

keyword
매거진의 이전글 38탄-16. EKS DB -Stackable
브런치는 최신 브라우저에 최적화 되어있습니다. IE chrome safari