Optimising Deep Learning using TensorRT for NVIDIA Jetson
I have been working on a project where i needed to build a computer vision solution for real-time processing and tracking of fast-moving objects. We used a 120 fps video camera and chose the NVIDIA Jetson platform to host the solution. The algorithm strongly depended on fast tracking/detection capabilities as there was a need for an efficient and fast deep neural network. In this post, I would like to share with you how to optimize a specific deep network for real-time performance using a TensorRT framework by NVIDIA.
TensorRT
TensorRT is a programming framework which allows efficient model optimization, like layer fusion, variable type change for DNN. It is a hardware-dependent framework - it means that you cannot create an optimized model with a given system configuration and use it for another configuration. On the other end, for a targeted hardware setup, you can get x4 or even x5 inference acceleration versus non-optimized Tensorflow implementation. You can find additional information about the framework here or at the developer’s official website. Another benefit of using TensorRT is that it has an integrated interface for TensorFlow. It is a kind of integrated semi-framework for acceleration within Tensorflow session enabling the usage of TensorRT optimized operators, engines and segments mixed with Tensorflow nodes. Here you can find an introduction to the TF-TRT. Any additional info about the relations between Tensorflow and TensoRT can also be found in the official documentation.
For my project, I created a UNET-like segmentation network with fewer parameters and a slightly different architecture than the original one. TensorRT could not properly convert this model to a standalone TensorRT plan for inference on the CUDA engine. The model had a complex branching and some specific paddings not supported by TensorRT. We also chose not to drift too far away from Tensorflow, so we decided to implement a TF-TRT mixed optimization.
Implementation
Start with a frozen graph from the start. It is important to “freeze” the graph properly:
All these procedures will decrease the number of your nodes (in our case, from 275 to 209). You can find the code example below:
At this stage, I needed to tune conversion parameters, otherwise converter methods can miss optimization of some parts of the computation graph. If parameters are not suitable, the converter itself will fail to optimize the graph correctly or not optimize it at all.
The optimized graph will run inside the Tensorflow session, it will not be as fast as a pure TensorRT plan, but still much faster than the unoptimized Tensorflow graph.
It is very convenient that Tensorflow supports integration with TensorRT because Tensorflow can avoid pit optimization: if an optimizer method has found parts of the computational graph that cannot be optimized, the optimizer will just leave it and mark it for Tensorflow inferencing. Other optimizations will be computed by TensorRT.
Recommended by LinkedIn
If we look at the model via Tensorboard, we can see that the graph has reduced size, some nodes are fused/changed to "TRTEngineOP", "_ReLUTRT", "TRTEngineOP_native_segment", etc.
Here is a code with the TensorRT conversion and description:
As we can see the pipeline remains simple and it is still Tensorflow. Using this approach, we reached 95-110 frames per second where the non-optimized model had 5-25 frames per second, as illustrated in the plots below :
Hardware tips:
Conclusions
TF-TRT approach gave the best and fastest optimization results and i was able to increase the frame rate of the overall solution by 2.7 times compared to other approaches.