brunch

You can make anything
by writing

C.S.Lewis

by Qscar Feb 28. 2023

[AI]Optimization Technique 01

Intro & Quantization

Intro

There are several ways to opimize model to reduce memory usage and latency. It's especially necessary to business like e-commerce.

Generally ML Engineers use below technique to optimize model.


Optimization Techniques

1. Quantization

2. Pruning

3. Distilation


To cut to the chase, Distilation is a best way to opimize model because of ONNX. Quantization is a way what almost every library support with simple code, Pruning is hardest way to use.

Generally, above optimizing techniques are not used seperately. In Roblox medium posting[1], they show the result of opimization effect with combinating distilation and quantization like below image.

Performance Improvement with Model Optimization - Image by Roblox

They showed mixed opimizing model is 30x faster than normal one. It means they could service 30x times more people without any more machine.

So I write on these ways with simple description and code. Let's start.




1. Quantization


What is Quantization?

Quantization is dicretization of floating-point numbers by mapping Original range [fmax, fmin] with [qmax, qmin] with linear distribution. 

quantization effect on a transformer attention layer's weight

Above image is result of showing effect quantization. I selected a transformer attention layer's weight of official distilBert model in huggingface and quantizate it with torch.quantize_per_tensor function. 


Quantized Tensor Performance

Then how is it faster than before one? Check collapsed time and let's compare them. Test code is like this.

code to compare latency between normal and quantized one


And its result is this.

result of test code(quantized one is about 5,500 times faster than normal one)

As you could see above, qauntized one is about 5,500x faster than normal one. Of coursely, its performance depends on your machine. In my laptop, it shows 100 times faster speed. In addition, quantization strategy is also good in reducing memory usage because INT8 use 4x lower bits than FP32.


Two Quantization Strategies

There are two ways in Quantization. One is Dynamic Quantization, the other is Static Quantization. 


Dynamic Quantization occurs after training. So there is no change in training process, change in inference process only. Model's weight and activation are qunatized instantly by being transfered to INT8. Due to this, Dynamic Quantization is the simplest way though somtimes there is performance bottleneck because of conversion between integers and floating-point numbers.


Static Quantization doesn't occur on the spot, but in advance. It investigates activation pattern from sample dataset and calculates quantization schema before inference process to avoid conversion between integers and floating-point numbers and make faster inference speed. But this way also has problem. Static Quantization is anchored sample dataset. So if sample dataset isn't representative dataset or some anomaly case occurs, Static Quantization show low performance.


Quantized Model vs Not Quantized Model

Because of above reason, Dynamic Quantization is usually used despite of its bottle neck. We could use it easily with almost every DL libraries like torch, tensorflow, transformers. Let's check performance of quantized model.


Test code to benchmark each models performance is below.

benchmark code

Above benchmark code estimate model size(mb), model performance(accuracy) and latency with optim_type names to save several options' names. Test model is finetuned classification model with the dataset clinc_oos what is consist of pair with text and intent.


Loading Model code is below;

Load Model Code


Let's check performance and compare between quantinzed and not one.

Check Benchmark

As you can see, quantized model is 2 times more smaller size but 20% faster with similar performance. If you anticipate 100~10,000x faster, sorry to say it's impossible because there are some bottlenecks in model structure. We could quantized model's part what could be quantized only and this is why we should do consider other strategy at the same time!



# reference

[1] https://medium.com/@quocnle/how-we-scaled-bert-to-serve-1-billion-daily-requests-on-cpus-d99be090db26

[2] https://huggingface.co/transformersbook

작품 선택
키워드 선택 0 / 3 0
댓글여부
afliean
브런치는 최신 브라우저에 최적화 되어있습니다. IE chrome safari