TRecViT: A Recurrent Video Transformer

TRecViT: A Recurrent Video Transformer

Today's paper introduces TRecViT, a new architecture for video understanding that combines gated linear recurrent units (LRUs) with Vision Transformers (ViT). The architecture efficiently processes video data by separating temporal, spatial, and channel dimensions, leading to significant improvements in memory usage and computational efficiency compared to existing approaches.

Method Overview

TRecViT works by processing videos through a series of specialized identical blocks each performing a sequence of information mixing steps across the different dimensions of the video signal: time, space, and channels.

The temporal processing is handled by gated linear recurrent units (LRUs) that operate on "temporal tubes" - sequences of patches at the same spatial location. These LRUs are efficient at processing long sequences and maintain a memory of past frames through their recurrent state.

The mixing over spatial and channel is managed by a Vision Transformers (ViT), allowing all patches within a frame to interact with each other using self attention and MLP channel mixing. The model can process videos frame by frame during inference, maintaining a constant memory footprint regardless of video length.

Article content

Results

The model demonstrates strong performance across multiple tasks and datasets. On the Something-Something-v2 (SSv2) dataset, TRecViT outperforms the ViViT-L baseline while using 3× fewer parameters. For Kinetics-400, it achieves comparable performance to ViViT-L with significantly lower computational requirements.

Article content

In self-supervised learning scenarios, TRecViT shows strong performance on both video classification and point tracking tasks.

Article content

Conclusion

TRecViT presents an efficient approach to video understanding by combining the strengths of linear recurrent units and vision transformers. The architecture achieves competitive or superior performance compared to larger models while being more computationally efficient and memory-friendly. For more information please consult the full paper.

Congrats to the authors for their work!

Pătrăucean, Viorica, et al. "TRecViT: A Recurrent Video Transformer." arXiv preprint arXiv:2412.14294 (2024).

To view or add a comment, sign in

More articles by Vlad Bogolin

Insights from the community

Others also viewed

Explore topics