Intro & Quantization
There are several ways to opimize model to reduce memory usage and latency. It's especially necessary to business like e-commerce.
Generally ML Engineers use below technique to optimize model.
To cut to the chase, Distilation is a best way to opimize model because of ONNX. Quantization is a way what almost every library support with simple code, Pruning is hardest way to use.
Generally, above optimizing techniques are not used seperately. In Roblox medium posting[1], they show the result of opimization effect with combinating distilation and quantization like below image.
They showed mixed opimizing model is 30x faster than normal one. It means they could service 30x times more people without any more machine.
So I write on these ways with simple description and code. Let's start.
Quantization is dicretization of floating-point numbers by mapping Original range [fmax, fmin] with [qmax, qmin] with linear distribution.
Above image is result of showing effect quantization. I selected a transformer attention layer's weight of official distilBert model in huggingface and quantizate it with torch.quantize_per_tensor function.
Then how is it faster than before one? Check collapsed time and let's compare them. Test code is like this.
And its result is this.
As you could see above, qauntized one is about 5,500x faster than normal one. Of coursely, its performance depends on your machine. In my laptop, it shows 100 times faster speed. In addition, quantization strategy is also good in reducing memory usage because INT8 use 4x lower bits than FP32.
There are two ways in Quantization. One is Dynamic Quantization, the other is Static Quantization.
Dynamic Quantization occurs after training. So there is no change in training process, change in inference process only. Model's weight and activation are qunatized instantly by being transfered to INT8. Due to this, Dynamic Quantization is the simplest way though somtimes there is performance bottleneck because of conversion between integers and floating-point numbers.
Static Quantization doesn't occur on the spot, but in advance. It investigates activation pattern from sample dataset and calculates quantization schema before inference process to avoid conversion between integers and floating-point numbers and make faster inference speed. But this way also has problem. Static Quantization is anchored sample dataset. So if sample dataset isn't representative dataset or some anomaly case occurs, Static Quantization show low performance.
Because of above reason, Dynamic Quantization is usually used despite of its bottle neck. We could use it easily with almost every DL libraries like torch, tensorflow, transformers. Let's check performance of quantized model.
Test code to benchmark each models performance is below.
Above benchmark code estimate model size(mb), model performance(accuracy) and latency with optim_type names to save several options' names. Test model is finetuned classification model with the dataset clinc_oos what is consist of pair with text and intent.
Loading Model code is below;
Let's check performance and compare between quantinzed and not one.
As you can see, quantized model is 2 times more smaller size but 20% faster with similar performance. If you anticipate 100~10,000x faster, sorry to say it's impossible because there are some bottlenecks in model structure. We could quantized model's part what could be quantized only and this is why we should do consider other strategy at the same time!
# reference