PyTorch Quantization for TensorRT
There seems to be quite a few possible ways to do this:
- PyTorch Eager Mode Quantization TensorRT Acceleration , seems a bit cumbersome:
1. torchao quantization
2. ONNX conversion
3. Graph Surgery (changing some ops in the onnx graph)
4. tensorrt conversion
- Not sure if it works, but would be ideal
1. torch.export
2. torchao quantization
3. tensorrt conversion
- Less ideal would be:
1. torchao quantization
2. torch.export
3. tensorrt conversion
- I've already sort of tried this using the vgg ptq example from tensorrt, but torch.export complained that it couldn't translate the quantized operations