Optimising Deep Learning using TensorRT for NVIDIA Jetson

Optimising Deep Learning using TensorRT for NVIDIA Jetson

I have been working on a project where i needed to build a computer vision solution for real-time processing and tracking of fast-moving objects. We used a 120 fps video camera and chose the NVIDIA Jetson platform to host the solution. The algorithm strongly depended on fast tracking/detection capabilities as there was a need for an efficient and fast deep neural network. In this post, I would like to share with you how to optimize a specific deep network for real-time performance using a TensorRT framework by NVIDIA.

TensorRT

TensorRT is a programming framework which allows efficient model optimization, like layer fusion, variable type change for DNN. It is a hardware-dependent framework - it means that you cannot create an optimized model with a given system configuration and use it for another configuration. On the other end, for a targeted hardware setup, you can get x4 or even x5 inference acceleration versus non-optimized Tensorflow implementation. You can find additional information about the framework here or at the developer’s official website. Another benefit of using TensorRT is that it has an integrated interface for TensorFlow. It is a kind of integrated semi-framework for acceleration within Tensorflow session enabling the usage of TensorRT optimized operators, engines and segments mixed with Tensorflow nodes. Here you can find an introduction to the TF-TRT. Any additional info about the relations between Tensorflow and TensoRT can also be found in the official documentation.

For my project, I created a UNET-like segmentation network with fewer parameters and a slightly different architecture than the original one. TensorRT could not properly convert this model to a standalone TensorRT plan for inference on the CUDA engine. The model had a complex branching and some specific paddings not supported by TensorRT. We also chose not to drift too far away from Tensorflow, so we decided to implement a TF-TRT mixed optimization.

Implementation

  • Freezing

Start with a frozen graph from the start. It is important to “freeze” the graph properly:

  1. Remove/disable unused nodes.
  2. Change the train state in some nodes, set the “BatchNorm” to “False” condition...
  3. Remove/fuse duplicated nodes
  4. Fold constant nodes and fold batch normalization if possible

All these procedures will decrease the number of your nodes (in our case, from 275 to 209). You can find the code example below:

No alt text provided for this image

  • Converting to TF-TRT optimized graph:

At this stage, I needed to tune conversion parameters, otherwise converter methods can miss optimization of some parts of the computation graph. If parameters are not suitable, the converter itself will fail to optimize the graph correctly or not optimize it at all.

The optimized graph will run inside the Tensorflow session, it will not be as fast as a pure TensorRT plan, but still much faster than the unoptimized Tensorflow graph.

It is very convenient that Tensorflow supports integration with TensorRT because Tensorflow can avoid pit optimization: if an optimizer method has found parts of the computational graph that cannot be optimized, the optimizer will just leave it and mark it for Tensorflow inferencing. Other optimizations will be computed by TensorRT.

If we look at the model via Tensorboard, we can see that the graph has reduced size, some nodes are fused/changed to "TRTEngineOP", "_ReLUTRT", "TRTEngineOP_native_segment", etc.

Here is a code with the TensorRT conversion and description:

No alt text provided for this image

  1. max_workspace_size_bytes - allocated memory on the device to execute TensorRTs’ algorithms, if you allocate insufficient space, execution will fail with an error, or you will get no acceleration. It depends on the number of segments, their size, and the number of engines. You can pick this parameter iteratively. Our choice was 1 << 32 bytes.
  2. precision_mode - variable type that is used in engines and segments.
  3. minimum_segment_size - sets how many TF nodes can be packed into the TensorRT optimized segment.
  4. max_batch_size - number of images to feed on input. It also defines the particular size of the input shape.
  5. is_dynamic_op - dynamic input shape, it will create cached engines with different input shapes for different input values. The number of engines can be controlled by the max_cached_engines parameter.
  6. nodes_blacklist - selected nodes to avoid conversion: output nodes or others.

  • Inference:

No alt text provided for this image

As we can see the pipeline remains simple and it is still Tensorflow. Using this approach, we reached 95-110 frames per second where the non-optimized model had 5-25 frames per second, as illustrated in the plots below :

Hardware tips:

  1. NVidia Jetson uses shared memory: keep some part of it free, otherwise the system may crash (confirmed experimentally, solved by JetPack reinstallation) and it can slightly slow your model even if you have turned on reallocation flag in device settings for Tensorflow inference.
  2. Turn on the FAN: the module is very sensitive to temperature changes, so keep it cool.
  3. All performance tests were done at MAXP Core ARM power mode: it is an electricity consumption plan for maximum performance.

Conclusions

TF-TRT approach gave the best and fastest optimization results and i was able to increase the frame rate of the overall solution by 2.7 times compared to other approaches.




To view or add a comment, sign in

More articles by Joinal Ahmed

Insights from the community

Others also viewed

Explore topics