Benchmarks and Results

Benchmarks and Results#

Running the benchmarks#

Before running the benchmarks, make sure you have downloaded the trained model (see Downloading trained model (for compilation and evaluation)) and compiled it (see Compiling the model).

We’ll assume that the output directory of export_tensorrt’compilation was outputs/2024-10-31/10-43-31.

There are three possible runtimes to benchmark, examples are shown below:

Python Runtime, no TensorRT

This mode takes the uncompiled model and runs it with mixed precision (fp16 or bf16) or full precision (fp32).

python -m scripts.benchmark_gpu compile_run_path=outputs/2024-10-31/10-43-31 n_iter=100 load_ts=False amp_dtype=fp16

Python Runtime with TensorRT

python -m scripts.benchmark_gpu compile_run_path=outputs/2024-10-31/10-43-31 n_iter=100 load_ts=True

C++ Runtime with TensorRT

Make sure you have built the C++ runtime (see Installation).

./build/benchmark --model outputs/2024-10-31/10-43-31/model.ts --n_iter=100

Results#

Benchmarking was done on a NVIDIA RTX 4060 Ti GPU with 16GB of VRAM. Results are shown below.

Table 3 Python Runtime, no TensorRT#

model’s precision

amp_dtype

latency (ms)

fp32

fp32+fp16

66.322 ± 0.927

fp32

fp32+bf16

66.497 ± 1.052

fp32

fp32

76.275 ± 0.587

Max memory usage for all configurations is ~1GB.

Table 4 Python Runtime, with TensorRT#

model’s precision

trt.enabled_precisions

latency (ms)

fp32+fp16

fp32+bf16+fp16

15.369 ± 0.023

fp32

fp32+bf16+fp16

23.164 ± 0.031

fp32

fp32+bf16

25.148 ± 0.030

fp32

fp32

38.381 ± 0.022

Max memory usage for all configurations is ~500MB except for fp32+fp32 which is ~770MB.

Table 5 C++ Runtime, with TensorRT#

model’s precision

trt.enabled_precisions

latency (ms)

fp32+fp16

fp32+bf16+fp16

15.433 ± 0.029

fp32

fp32+bf16+fp16

23.263 ± 0.027

fp32

fp32+bf16

25.255 ± 0.014

fp32

fp32

38.465 ± 0.029

Max memory usage for all configurations is ~500MB except for fp32+fp32 which is ~770MB.

Note

For some unknown reason, bfloat16 precision is not working well and it’s not achieving the previously measured performance of (13-14ms) and/or failing compilation in the latest version of torch_tensorrt.

We include the previous results for completeness, in case the issue is resolved in the future.

Table 6 C++ Runtime, with TensorRT (previous results)#

model’s precision

trt.enabled_precisions

latency

fp32

fp32+bf16+fp16

13.898

fp32

fp32+fp16

13.984

fp32

fp32+bf16

17.261

fp32+bf16

fp32+bf16

22.913

fp32

fp32

37.639

Observations#

Some observations we can gather from Table 3, Table 4, Table 5 and Table 6 are:

  • Compared to the baseline (76 ms) we have achieved a 5x speedup (15 ms).

  • The C++ runtime is negligibly faster than the Python runtime (<1ms) when using TensorRT.

  • Depending on torch_tensorrt’s version, either manually set the precision to fp16 with torch.amp.autocast or let torch_tensorrt handle mixed precision, for the best performance.

  • The memory usage is reduced by half when using TensorRT with mixed precision, compared to full precision in Eager Python.