brunch

라이킷 7 댓글

You can make anything
by writing

C.S.Lewis

계정을 잊어버리셨나요?

by 유윤식 Jun 12. 2024

Python: cProfiler

#cProfiler #pstats #pandas #polars

프로파일러.

어떤 함수가 어떻게 사용되었고 시간은 얼마나 걸렸는지

함수가 다른 함수를 콜하는 행위와 수행시간 등을 분석한다.

간단하 예시로 기억해두면 될 것 같다.

우선 Polars vs. pandas 비교를 위해 각각 데이터를 만드는데

억지스럽게(?) 데이터의 양을 늘려서 생성해본다.

import pandas as pd

import polars as pl

# 예제 데이터 생성

data = {

"name": ["Alice", "Bob", "Alice", "Bob", "Alice", "Bob", "David"] * 10000,

"year": [2020, 2020, 2021, 2021, 2022, 2022, 2023] * 10000,

"value": [10, 20, 15, 25, 30, 35, 78] * 10000,

}

pandas_df = pd.DataFrame(data)

polars_df = pl.DataFrame(data)

더 복잡한 데이터(예를 들어, nyc)를 사용하면 더 좋은 분석을 할 수 있을 것 같다.

cProfiler 를 contextmanager 로 생성 후,

다양한 분석 함수(sort, groupby, rank, etc,.) 를 사용하는 함수를 작성한다.

import cProfile

import pstats

from io import StringIO

from contextlib import contextmanager

def pandas_groupby_partition():

result = pandas_df.groupby('name').apply(lambda x: x.sort_values('year')).reset_index(drop=True)

result['rank'] = result.groupby('name').cumcount() + 1

return result

def polars_groupby_partition():

result = polars_df.sort(by=['name', 'year'])

result = result.with_columns(pl.col('year').rank().over('name').alias('rank'))

return result

@contextmanager

def profile_function():

profiler = cProfile.Profile()

profiler.enable()

yield

profiler.disable()

s = StringIO()

sortby = 'cumulative'

ps = pstats.Stats(profiler, stream=s).sort_stats(sortby)

ps.print_stats()

print()

print(s.getvalue())

# pandas 성능 측정

print("Pandas Groupby Partition Profiling:")

with profile_function():

pandas_groupby_partition()

# polars 성능 측정

print("\nPolars Groupby Partition Profiling:")

with profile_function():

polars_groupby_partition()

만들어두면 두고두고 사용할 수 있을 것 같은데,

결과는...

분명 같은 결과를 뱉어내는데

호출하는 함수의 갯수 차이가 어마어마하다...

브런치는 최신 브라우저에 최적화 되어있습니다. IE chrome safari