PyTorch Quantization
Properties | |
---|---|
authors | PyTorch Quantization for TensorRT |
year | 2024 |
url | https://pytorch.org/docs/main/quantization.html#prototype-pytorch-2-export-quantization |
Backend/Hardware Support¶
Hardware | Kernel Library | Eager Mode Quantization | FX Graph Mode Quantization | Quantization Mode Support |
---|---|---|---|---|
server CPU | fbgemm/onednn | Supported | All Supported | |
mobile CPU | qnnpack/xnnpack | |||
server GPU | TensorRT (early prototype) | Not support this it requires a graph | Supported | Static Quantization |
Today, PyTorch supports the following backends for running quantized operators efficiently:
- x86 CPUs with AVX2 support or higher (without AVX2 some operations have inefficient implementations), via x86 optimized by fbgemm and onednn (see the details at RFC)
- ARM CPUs (typically found in mobile/embedded devices), via qnnpack
- (early prototype) support for NVidia GPU via TensorRT through fx2trt (to be open sourced)
Note:
- This is a bit old, as fx2trt is already available in torch-tensorrt. However, there