Benchmarks and Results

Benchmarks and Results #

Running the benchmarks #

Before running the benchmarks, make sure you have downloaded the trained model (see Downloading trained model (for compilation and evaluation)) and compiled it (see Compiling the model).

We’ll assume that the output directory of export_tensorrt’compilation was outputs/2024-10-31/10-43-31.

There are three possible runtimes to benchmark, examples are shown below:

Python Runtime, no TensorRT

This mode takes the uncompiled model and runs it with mixed precision (fp16 or bf16) or full precision (fp32).

python -m scripts.benchmark_gpu compile_run_path=outputs/2024-10-31/10-43-31 n_iter=100 load_ts=False amp_dtype=fp16

Python Runtime with TensorRT

python -m scripts.benchmark_gpu compile_run_path=outputs/2024-10-31/10-43-31 n_iter=100 load_ts=True

C++ Runtime with TensorRT

Make sure you have built the C++ runtime (see Installation).

./build/benchmark --model outputs/2024-10-31/10-43-31/model.ts --n_iter=100

Results #

Benchmarking was done on a NVIDIA RTX 4060 Ti GPU with 16GB of VRAM. Results are shown below.

Table 3 **Python Runtime, no TensorRT**#
model’s precision	amp_dtype	latency (ms)
fp32	fp32+fp16	66.322 ± 0.927
fp32	fp32+bf16	66.497 ± 1.052
fp32	fp32	76.275 ± 0.587

Max memory usage for all configurations is ~1GB.

Table 4 **Python Runtime, with TensorRT**#
model’s precision	trt.enabled_precisions	latency (ms)
fp32+fp16	fp32+bf16+fp16	15.369 ± 0.023
fp32	fp32+bf16+fp16	23.164 ± 0.031
fp32	fp32+bf16	25.148 ± 0.030
fp32	fp32	38.381 ± 0.022

Max memory usage for all configurations is ~500MB except for fp32+fp32 which is ~770MB.

Table 5 **C++ Runtime, with TensorRT**#
model’s precision	trt.enabled_precisions	latency (ms)
fp32+fp16	fp32+bf16+fp16	15.433 ± 0.029
fp32	fp32+bf16+fp16	23.263 ± 0.027
fp32	fp32+bf16	25.255 ± 0.014
fp32	fp32	38.465 ± 0.029

Max memory usage for all configurations is ~500MB except for fp32+fp32 which is ~770MB.

Note

For some unknown reason, bfloat16 precision is not working well and it’s not achieving the previously measured performance of (13-14ms) and/or failing compilation in the latest version of torch_tensorrt.

We include the previous results for completeness, in case the issue is resolved in the future.

Table 6 **C++ Runtime, with TensorRT (previous results)**#
model’s precision	trt.enabled_precisions	latency
fp32	fp32+bf16+fp16	13.898
fp32	fp32+fp16	13.984
fp32	fp32+bf16	17.261
fp32+bf16	fp32+bf16	22.913
fp32	fp32	37.639

Observations #

Some observations we can gather from Table 3, Table 4, Table 5 and Table 6 are:

Compared to the baseline (76 ms) we have achieved a 5x speedup (15 ms).
The C++ runtime is negligibly faster than the Python runtime (<1ms) when using TensorRT.
Depending on torch_tensorrt’s version, either manually set the precision to fp16 with torch.amp.autocast or let torch_tensorrt handle mixed precision, for the best performance.
The memory usage is reduced by half when using TensorRT with mixed precision, compared to full precision in Eager Python.

Benchmarks and Results

Contents

Benchmarks and Results#

Running the benchmarks#

Results#

Observations#

Benchmarks and Results #

Running the benchmarks #

Results #

Observations #