1
설치 및 기본 확인 : Analysis with a data lake - LinkDemo
시스템 사양 요구 조건
To run this demo, your system needs at least:
7 cpu units (core/hyperthread)
16GiB memory
28GiB disk storage
https://docs.stackable.tech/home/stable/demos/trino-taxi-data
https://docs.stackable.tech/home/stable/demos/
The data was put into the S3 storage → Trino enables you to query the data via SQL → Superset was used as a web-based frontend to execute SQL statements and build dashboards.
2
# Demo 정보 확인
stackablectl demo list
stackablectl demo list -o json | jq
stackablectl demo describe trino-taxi-data
Demo trino-taxi-data
Description Demo loading 2.5 years of New York taxi data into S3 bucket, creating a Trino table and a Superset dashboard
Documentation https://docs.stackable.tech/stackablectl/stable/demos/trino-taxi-data.html Stackable stack trino-superset-s3
Labels trino, superset, minio, s3, ny-taxi-data
Superset 은 시각화
Trino를 통해 확인한다.
3
Emissary Ingress 와 OPA 소개 및 연동 실습 가이드
Open Policy Agent : 클라우드 네이티브 환경에 적합한 오픈 소스 정책 엔진, 다양한 플랫폼과 소프트웨어와 연동 가능 - 추천글
OPA (발음 � “oh-pa”) 는 정책에 대한 질의 시, 해당 정책을 평가하여 돌려줌
4
Load testdata into S3. It contains 2.5 years of New York City taxi trips
5
Create Superset dashboards for visualization of the data
# [터미널] 모니터링
watch -d "kubectl get pod -n stackable-operators;echo;kubectl get pod,job,svc,pvc"
6
stackablectl demo install trino-taxi-data
(8분 소요)
# 설치 확인
helm list -n stackable-operators
helm list
kubectl top node
kubectl top pod -A
kubectl get-all -n default
kubectl get deploy,sts,pod
스테이트 풀셋으로 설치된거 hive 등
디플로이로 설치된 minio
kubectl get job
보라색이 job
kubectl get job load-ny-taxi-data -o yaml | kubectl neat | cat -l yaml
kubectl get job create-ny-taxi-data-table-in-trino -o yaml | kubectl neat | cat -l yaml
kubectl get job setup-superset -o yaml | kubectl neat | cat -l yaml
kubectl get job superset -o yaml | kubectl neat | cat -l yaml
# 저장소 확인
kubectl get sc,pvc,pv
kubectl get pv |grep gp3
kubectl get sc secrets.stackable.tech -o yaml | kubectl neat | cat -l yaml
kubectl df-pv
# 서비스 확인
# 트리노는 노드 포트 사용 등
kubectl get svc,ep,endpointslices
kubectl get cm,secret
kubectl get cm minio -o yaml | kubectl neat | cat -l yaml
kubectl describe cm minio
kubectl get cm hive-metastore-default -o yaml | kubectl neat | cat -l yaml
kubectl get cm hive -o yaml | kubectl neat | cat -l yaml
kubectl get cm postgresql-hive-extended-configuration -o yaml | kubectl neat | cat -l yaml
#트리노 보기
kubectl get cm trino-coordinator-default -o yaml | kubectl neat | cat -l yaml
kubectl get cm trino-coordinator-default-catalog -o yaml | kubectl neat | cat -l yaml
kubectl get cm trino-worker-default -o yaml | kubectl neat | cat -l yaml
kubectl get cm trino-worker-default-catalog -o yaml | kubectl neat | cat -l yaml
kubectl get cm create-ny-taxi-data-table-in-trino-script -o yaml | kubectl neat | cat -l yaml
kubectl get cm superset-node-default -o yaml | kubectl neat | cat -l yaml
kubectl get cm superset-init-db -o yaml | kubectl neat | cat -l yaml
kubectl get cm setup-superset-script -o yaml | kubectl neat | cat -l yaml
kubectl get secret minio -o yaml | kubectl neat | cat -l yaml
kubectl get secret minio-s3-credentials -o yaml | kubectl neat | cat -l yaml
kubectl get secret postgresql-hive -o yaml | kubectl neat | cat -l yaml
kubectl get secret postgresql-superset -o yaml | kubectl neat | cat -l yaml
kubectl get secret trino-users -o yaml | kubectl neat | cat -l yaml
kubectl get secret trino-internal-secret -o yaml | kubectl neat | cat -l yaml
kubectl get secret superset-credentials -o yaml | kubectl neat | cat -l yaml
kubectl get secret superset-mapbox-api-key -o yaml | kubectl neat | cat -l yaml
# CRD 설치됨
kubectl get crd | grep stackable
(11-15-access@myeks:N/A) root@myeks-bastion-EC2:~# kubectl get crd | grep stackable
authenticationclasses.authentication.stackable.tech 2023-11-21T09:28:03Z
druidconnections.superset.stackable.tech 2023-11-21T09:29:01Z
hiveclusters.hive.stackable.tech 2023-11-21T09:28:20Z
opaclusters.opa.stackable.tech 2023-11-21T09:28:38Z
s3buckets.s3.stackable.tech 2023-11-21T09:28:03Z
s3connections.s3.stackable.tech 2023-11-21T09:28:03Z
secretclasses.secrets.stackable.tech 2023-11-21T09:28:56Z
supersetclusters.superset.stackable.tech 2023-11-21T09:29:01Z
supersetdbs.superset.stackable.tech 2023-11-21T09:29:01Z
trinocatalogs.trino.stackable.tech 2023-11-21T09:29:20Z
trinoclusters.trino.stackable.tech 2023-11-21T09:29:20Z
(11-15-access@myeks:N/A) root@myeks-bastion-EC2:~#
kubectl explain trinoclusters
kubectl describe trinoclusters.trino.stackable.tech
kubectl get hivecluster,opacluster,s3connection
kubectl get supersetcluster,supersetdb
kubectl get trinocluster,trinocatalog
kubectl get hivecluster -o yaml | kubectl neat | cat -l yaml
kubectl get s3connection -o yaml | kubectl neat | cat -l yaml
kubectl get supersetcluster -o yaml | kubectl neat | cat -l yaml
kubectl get supersetdb -o yaml | kubectl neat | cat -l yaml
kubectl get trinocluster -o yaml | kubectl neat | cat -l yaml
kubectl get trinocatalog -o yaml | kubectl neat | cat -l yaml
7
# 배포 스택 정보 확인 : 바로 확인 하지 말고, 설치 완료 후 아래 확인 할 것
배포 된것들 확인 할수 있다.
링크도 클릭해보자~
stackablectl stacklet list
minio 링크 접속
8
# 배포 스택의 product 접속 계정 정보 확인 : 대부분 admin / adminadmin 계정 정보 사용
stackablectl stacklet credentials superset superset
stackablectl stacklet credentials minio minio-console # admin / adminadmin 계정 정보 출력 안됨... 아직은 rc 단계라 그런듯
# 배포 오퍼레이터 확인
stackablectl operator installed
9
기본 작업 CMD에서 미 설정 시 : 집 PC에서 직접 ENDPOINTS(워커 노드의 NodePort)로 접속 설정 → 아래 보안 그룹 추가 후 접속 가능
# 워커노드의 '#-nodegroup-ng1-remoteAccess' 보안 그룹에 자신의 집 공인IP 접속 허용 추가
NGSGID=$(aws ec2 describe-security-groups --filters Name=group-name,Values='*ng1*' --query "SecurityGroups[*].[GroupId]" --output text)
aws ec2 authorize-security-group-ingress --group-id $NGSGID --protocol '-1' --cidr $(curl -s ipinfo.io/ip)/32
https://docs.stackable.tech/home/stable/demos/trino-taxi-data#_inspect_data_in_s3
PRODUCT minio ENDPOINTS console-http 접속 : admin / adminadmin
Click on the blue button Browse on the bucket demo and open the folders ny-taxi-data → raw
필드 : 승차 및 하차 날짜/시간, 승차 및 하차 위치, 이동 거리, 항목별 요금, 요금 유형, 지불 유형 및 운전자가 보고한 승객 수 - Link
You can see the file size (and therefore the number of rides) decrease drastically because of the Covid-19 pandemic starting from 2020-03
Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval.
https://parquet.apache.org/images/FileLayout.gif
#
kubectl get svc,ep minio-console
(11-15-access@myeks:N/A) root@myeks-bastion-EC2:~# kubectl get svc,ep minio-console
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/minio-console NodePort 10.100.108.99 <none> 9001:32029/TCP 36m
NAME ENDPOINTS AGE
endpoints/minio-console 192.168.3.105:9001 36m
# 데이터셋 다운로드 동작 확인
kubectl get job load-ny-taxi-data -o yaml | kubectl neat | cat -l yaml -p
...
spec:
containers:
- command:
- bash
- -c
- cd /tmp && for month in 2020-01 2020-02 2020-03 2020-04 2020-05 2020-06
2020-07 2020-08 2020-09 2020-10 2020-11 2020-12 2021-01 2021-02 2021-03
2021-04 2021-05 2021-06 2021-07 2021-08 2021-09 2021-10 2021-11 2021-12
2022-01 2022-02 2022-03 2022-04; do curl -O https://repo.stackable.tech/repository/misc/ny-taxi-data/yellow_tripdata_$month.parquet
&& mc --insecure alias set minio http://minio:9000/ $(cat /minio-s3-credentials/accessKey)
$(cat /minio-s3-credentials/secretKey) && mc cp yellow_tripdata_$month.parquet minio/demo/ny-taxi-data/raw/; done
...
// 잡으로 외부 택시 데이터를 스토리지에 올린것이다.
## 샘플 다운로드 : '연도-월' 만 바꿔서 다운로드 가능
curl -O https://repo.stackable.tech/repository/misc/ny-taxi-data/yellow_tripdata_2023-01.parquet
# 추가 정보
kubectl get cm minio -o yaml | kubectl neat | cat -l yaml
kubectl describe cm minio
kubectl get secret minio -o yaml | kubectl neat | cat -l yaml
kubectl get secret minio-s3-credentials -o yaml | kubectl neat | cat -l yaml
트리노
분산형 SQL 쿼리 엔진 = 분산해서 병렬로 처리해서 빠르게 처리한다.
stackablectl stacklet list
https://docs.stackable.tech/home/stable/demos/trino-taxi-data#_use_trino_web_interface
1
Trino, a query engine that runs at ludicrous speed : Fast distributed SQL query engine for big data analytics - Link
https://blog.treasuredata.com/blog/2015/03/20/presto-versus-hive/
2
Trino offers SQL access to the data within S3
3
PRODUCT trino ENDPOINTS coordinator-https 접속 : admin / adminadmin
4
#
kubectl get svc,ep trino-coordinator
#
kubectl get job create-ny-taxi-data-table-in-trino -o yaml | kubectl neat | cat -l yaml
kubectl get trinocluster,trinocatalog
kubectl get trinocluster -o yaml | kubectl neat | cat -l yaml -p
kubectl get trinocatalog -o yaml | kubectl neat | cat -l yaml -p
...
spec:
connector: # hive, s3
hive:
metastore:
configMap: hive
s3:
reference: minio
...
#
kubectl get cm trino-coordinator-default -o yaml | kubectl neat | cat -l yaml
kubectl get cm trino-coordinator-default-catalog -o yaml | kubectl neat | cat -l yaml -p
...
data:
hive.properties: |
connector.name=hive
hive.metastore.uri=thrift\://hive-metastore-default-0.hive-metastore-default.default.svc.cluster.local\:9083
hive.s3.aws-access-key=${ENV\:CATALOG_HIVE_HIVE_S3_AWS_ACCESS_KEY}
hive.s3.aws-secret-key=${ENV\:CATALOG_HIVE_HIVE_S3_AWS_SECRET_KEY}
hive.s3.endpoint=http\://minio\:9000
hive.s3.path-style-access=true
hive.s3.ssl.enabled=false
hive.security=allow-all
...
kubectl get cm trino-worker-default -o yaml | kubectl neat | cat -l yaml
kubectl get cm trino-worker-default-catalog -o yaml | kubectl neat | cat -l yaml
kubectl get cm create-ny-taxi-data-table-in-trino-script -o yaml | kubectl neat | cat -l yaml
#
kubectl get secret trino-users -o yaml | kubectl neat | cat -l yaml
kubectl get secret trino-internal-secret -o yaml | kubectl neat | cat -l yaml
5
# 오퍼레이터 로깅 수준을 높여보자..
다른 터미널 1
kubectl logs -n stackable-operators -l app.kubernetes.io/instance=trino-operator -f
# trino worker 2대로 증설
kubectl get trinocluster trino -o json | cat -l json -p
kubectl patch trinocluster trino --type='json' -p='[{"op": "replace", "path": "/spec/workers/roleGroups/default/replicas", "value":2}]'
시간이 좀 걸림 10분
워커나 늘어남!!
2로 변경됨
6.다시 trino worker 1대로 축소..
kubectl patch trinocluster trino --type='json' -p='[{"op": "replace", "path": "/spec/workers/roleGroups/default/replicas", "value":1}]'
https://docs.stackable.tech/home/stable/demos/trino-taxi-data#_use_superset_web_interface
https://superset.apache.org/docs/databases/trino/
1
Apache Superset™ is an open-source modern data exploration and visualization platform
Superset gives the ability to execute SQL queries and build dashboards
stackablectl stacklet list
PRODUCT superset ENDPOINTS external-superset 접속 : admin / adminadmin
On the top click on the tab Dashboards → Click on the dashboard called Taxi data
It might take some time until the dashboards renders all the included charts. ⇒ 다소 시간 걸림(새로 고침),Trino 같이 확인
2
You can clearly see the impact of Covid-19 on the taxi business → ‘20.3월 이후 부터 급격히 감소했음을 확인 할 수 있다
결과
#
kubectl get svc,ep superset-external
#
kubectl get job setup-superset -o yaml | kubectl neat | cat -l yaml
kubectl get job superset -o yaml | kubectl neat | cat -l yaml
kubectl get supersetcluster,supersetdb
kubectl get supersetcluster -o yaml | kubectl neat | cat -l yaml -p
kubectl get supersetdb -o yaml | kubectl neat | cat -l yaml -p
#
kubectl get cm superset-init-db -o yaml | kubectl neat | cat -l yaml -p
kubectl get cm superset-node-default -o yaml | kubectl neat | cat -l yaml -p
kubectl get cm setup-superset-script -o yaml | kubectl neat | cat -l yaml -p
#
kubectl get secret superset-credentials -o yaml | kubectl neat | cat -l yaml
kubectl get secret superset-mapbox-api-key -o yaml | kubectl neat | cat -l yaml
3
Execute arbitrary SQL statements
Within Superset you can not only create dashboards but also run arbitrary SQL statements.
On the top click on the tab SQL Lab → SQL Editor
On the left select the database Trino, the schema demo and set See table schema to ny_taxi_data
select
format_datetime(tpep_pickup_datetime, 'YYYY/MM') as month,
count(*) as trips,
sum(total_amount) as sales,
avg(duration_min) as avg_duration_min
from ny_taxi_data
group by 1
order by 1
트리너 워커 노드를 늘리면 더 빠르게 처리 된다.~~
4
How many taxi trips there where in the year 2021?
select
count(*) as trips
from ny_taxi_data
where year(tpep_pickup_datetime) = 2021
5
What was the maximum amount of passengers?
select
max(passenger_count) as max_passenger_count
from ny_taxi_data;
# Returns 112 passengers. Well that’s weird. Let’s examine the passengers distribution.
select
passenger_count,
count(*) as frequency
from ny_taxi_data
group by 1
order by 1 desc
limit 100
6
What was the highest tip (measured in percentage of the original fee) ever given?
select
total_amount as fee,
tip_amount as tip,
tip_amount / total_amount * 100 as tip_percentage
from ny_taxi_data
where total_amount > 0
order by 3 desc
limit 5
7
(옵션/참고) Where to go from here - Link
1
#
kubectl delete supersetcluster,supersetdb superset
kubectl delete trinocluster trino && kubectl delete trinocatalog hive
kubectl delete hivecluster hive
kubectl delete s3connection minio
kubectl delete opacluster opa
#
helm uninstall postgresql-superset
helm uninstall postgresql-hive
helm uninstall minio
#
kubectl delete job --all
kubectl delete pvc --all
#
kubectl delete cm create-ny-taxi-data-table-in-trino-script setup-superset-script trino-opa-bundle
kubectl delete secret minio-s3-credentials secret-provisioner-tls-ca superset-credentials superset-mapbox-api-key trino-users
kubectl delete sa superset-sa
# operator 삭제
stackablectl operator uninstall superset trino hive secret opa commons
# 남은 리소스 확인
kubectl get-all -n stackable-operators
https://brunch.co.kr/@topasvga/3539
감사합니다.