This converter parses a Tensorflow protobuf file or graph_def object and creates a TensorRT network out of it. Everything is written in Python 3 and does not require the installation of additional packages.
There is no support for dynamic shapes. Computations based on shapes of tensors inside the network at runtime are therefore not possible. Every shape musst be fully specified at construction time in Tensorflow.
Tensorflow has a lot of custom made operations and not all of them are supported in TensorRT. Right now the converter is missing a lot of operations and attributes. Adding them is quite easy if they are supported by TensorRT.
The layout of the input data into the Tensorflow network should be channel-first (NCHW) making the conversion easier.
It is recommended to install TensorRT via Anaconda and the IBM repositories. The newest pycuda version can be installed with pypi or IBMs channel in Anaconda.
- Easy to extend
- No special 3rd party dependencies
- Pure Python
If you are interested in a full fledged converter try out ONNX-TensorRT
The shortest way to use the converter is just to call
import tensorrt as trt
from trt_importer import TRTImporter
importer = TRTImporter(trt.Logger.VERBOSE)
network = importer.from_tensorflow_graph_def(graph_def, ["input_tensor"], [[1, 3, 224, 224]], ["softmax"])
serialized_engine = importer.optimize_network(network, max_workspace_size=4 * (1 << 30))
The example also explains how to froze a Tensorflow graph and run a TensorRT engine.
The TensorRT network can be optimized for FP16 computation by adding a fp16_mode=True
parameter to the optimize_network(...)
method. Please note it is important to still provide a normal FP32 graph and do not have any manual FP16 casting operation in it.
An additional parameter max_batch_size=32
of the optimize_network(...)
method will define the maximal batch size the resulting engine will be able to execute. Optimizing for higher batch sizes might reduce the performance of smaller ones. If you need perfect performance for several batch sizes, you need to create execution profiles for each one of them.
Since the optimization process is an exaustic search of implementations to run a particular layer in the least amount of time. We can specify how often each implementation should be repreated (min_find_iterations
, average_find_iterations
) in order to get an average computation time. Our API provides a simplified parameter for that importer.optimize_network(..., fast_pass=True)
which sets both iteration parameter to 1 if Fast Pass is True and otherwise to 5.
The resulting serialized engine contains the weights and structure of the network. When executing an inference pass all the weights will be copied into the memory of the GPU plus some additional space for intermediate calculations. The serialized_engine.device_memory_size
variables gives an approximation of how many memory will be consumed.
We love to get in contact with the community. Feel free to e-mail us or use the issue system to suggest new features and ask questions. Pull requests are always welcome, we try to incorporate them into the master branch as fast as possible. Not sure if that typo is worth a pull request? Do it! We will appreciate it.
This project is maintained by the Visual Computing Group at HTW Berlin. Some parts of the source code are based on methods of the ONNX-TensorRT converter.