Skip to content

Introduction to Quantization on PyTorch

Properties
authors Raghuraman Krishnamoorthi, James Reed, Min Ni, Chris Gottbrath, Seth Weidman
year 2020
url https://pytorch.org/blog/introduction-to-quantization-on-pytorch/

Notes

Quantization aware training is typically only used in CNN models when post training static or dynamic quantization doesn’t yield sufficient accuracy. This can occur with models that are highly optimized to achieve small size (such as Mobilenet).

Currently, operator coverage is limited and may restrict the choices listed in the table below: The table below provides a guideline.

Model Type Preferred scheme Why
LSTM/RNN Dynamic Quantization Throughput dominated by compute/memory bandwidth for weights
BERT/Transformer Dynamic Quantization Throughput dominated by compute/memory bandwidth for weights
CNN Static Quantization Throughput limited by memory bandwidth for activations
CNN Quantization Aware Training In the case where accuracy can't be achieved with static quantization

Does the Transformer row apply also for vision transformers? Since the number of tokens is quite large.

Model Float Latency (ms) Quantized Latency (ms) Inference Performance Gain Device Notes
BERT 581 313 1.8x Xeon-D2191 (1.6GHz) Batch size = 1, Maximum sequence length= 128, Single thread, x86-64, Dynamic quantization
Resnet-50 214 103 2x Xeon-D2191 (1.6GHz) Single thread, x86-64, Static quantization
Mobilenet-v2 97 17 5.7x Samsung S9 Static quantization, Floating point numbers are based on Caffe2 run-time and are not optimized

So I should expect something around ~2x latency improvement with dynamic quantization