TRecViT: A Recurrent Video Transformer

Vlad Bogolin

AI/ML Engineer & Researcher | Large Language Models (LLMs)

Published Dec 23, 2024

Today's paper introduces TRecViT, a new architecture for video understanding that combines gated linear recurrent units (LRUs) with Vision Transformers (ViT). The architecture efficiently processes video data by separating temporal, spatial, and channel dimensions, leading to significant improvements in memory usage and computational efficiency compared to existing approaches.

Method Overview

TRecViT works by processing videos through a series of specialized identical blocks each performing a sequence of information mixing steps across the different dimensions of the video signal: time, space, and channels.

The temporal processing is handled by gated linear recurrent units (LRUs) that operate on "temporal tubes" - sequences of patches at the same spatial location. These LRUs are efficient at processing long sequences and maintain a memory of past frames through their recurrent state.

The mixing over spatial and channel is managed by a Vision Transformers (ViT), allowing all patches within a frame to interact with each other using self attention and MLP channel mixing. The model can process videos frame by frame during inference, maintaining a constant memory footprint regardless of video length.

Results

The model demonstrates strong performance across multiple tasks and datasets. On the Something-Something-v2 (SSv2) dataset, TRecViT outperforms the ViViT-L baseline while using 3× fewer parameters. For Kinetics-400, it achieves comparable performance to ViViT-L with significantly lower computational requirements.

Conclusion

TRecViT presents an efficient approach to video understanding by combining the strengths of linear recurrent units and vision transformers. The architecture achieves competitive or superior performance compared to larger models while being more computationally efficient and memory-friendly. For more information please consult the full paper.

Congrats to the authors for their work!

Pătrăucean, Viorica, et al. "TRecViT: A Recurrent Video Transformer." arXiv preprint arXiv:2412.14294 (2024).

AI Paper of the Day

1,394 follower

+ Subscribe

To view or add a comment, sign in

More articles by Vlad Bogolin

ZeroSearch: Incentivize the Search Capability of LLMs without Searching

May 8, 2025

ZeroSearch: Incentivize the Search Capability of LLMs without Searching

Today's paper introduces ZEROSEARCH, a novel reinforcement learning framework that enhances the search capabilities of…
RM-R1: Reward Modeling as Reasoning

May 7, 2025

RM-R1: Reward Modeling as Reasoning

Today's paper introduces RM-R1, a new approach to reward modeling for large language models (LLMs) that incorporates…
Voila: Voice-Language Foundation Models for Real-Time Autonomous Interaction and Voice Role-Play

May 6, 2025

Voila: Voice-Language Foundation Models for Real-Time Autonomous Interaction and Voice Role-Play

Today's paper introduces Voila, a family of large voice-language foundation models designed for real-time, autonomous…
PixelHacker: Image Inpainting with Structural and Semantic Consistency

May 5, 2025

PixelHacker: Image Inpainting with Structural and Semantic Consistency

Today's paper introduces PixelHacker, a diffusion-based image inpainting model that addresses the challenges of…
ReasonIR: Training Retrievers for Reasoning Tasks

May 4, 2025

ReasonIR: Training Retrievers for Reasoning Tasks

Today's paper introduces ReasonIR-8B, the first retriever specifically designed for reasoning-intensive tasks. Unlike…
DeepCritic: Deliberate Critique with Large Language Models

May 3, 2025

DeepCritic: Deliberate Critique with Large Language Models

Today's paper introduces DeepCritic, a framework that enhances the ability of Large Language Models (LLMs) to critique…
WebThinker: Empowering Large Reasoning Models with Deep Research Capability

May 2, 2025

WebThinker: Empowering Large Reasoning Models with Deep Research Capability

Today's paper introduces WebThinker, a framework that enhances large reasoning models (LRMs) with deep research…
Phi-4-reasoning Technical Report

May 1, 2025

Phi-4-reasoning Technical Report

Today's paper introduces Phi-4-reasoning, a 14-billion parameter language model specifically designed for complex…

1 Comment
UniversalRAG: Retrieval-Augmented Generation over Multiple Corpora with Diverse Modalities and Granularities

Apr 30, 2025

UniversalRAG: Retrieval-Augmented Generation over Multiple Corpora with Diverse Modalities and Granularities

Today's paper introduces UniversalRAG, a novel framework for Retrieval-Augmented Generation (RAG) that can retrieve and…
Towards Understanding Camera Motions in Any Video

Apr 29, 2025

Towards Understanding Camera Motions in Any Video

Today's paper introduces CameraBench, a large-scale dataset and benchmark designed to assess and improve camera motion…

See all articles

TRecViT: A Recurrent Video Transformer

Vlad Bogolin

AI/ML Engineer & Researcher | Large Language Models (LLMs)

Method Overview

Results

Recommended by LinkedIn

Conclusion

AI Paper of the Day

1,394 follower

More articles by Vlad Bogolin

Insights from the community

Others also viewed

MICRO 2023: Artifact evaluation report for the 56th IEEE/ACM International Symposium on Microarchitecture

Transformer Architecture: Simplified (sort of).

Going Deeper with Convolutions (Inception | GoogLeNet)

Demystifying the U-Net: A Powerful Architecture for Image Segmentation

We Built an Operating System for Human Consciousness

YOLOv8 Architecture

PIANO Architecture

Breaking the Haze: DEA-Net’s Novel Approach to Single Image Dehazing

Connecting finite element analysis, graphs, intractable computational problems, and digital at Arup

U-Net (Convolutional Networks for Biomedical Image Segmentation)

Explore topics

Method Overview

Results

Recommended by LinkedIn

Conclusion

AI Paper of the Day

1,394 follower

More articles by Vlad Bogolin

ZeroSearch: Incentivize the Search Capability of LLMs without Searching

RM-R1: Reward Modeling as Reasoning

Voila: Voice-Language Foundation Models for Real-Time Autonomous Interaction and Voice Role-Play

PixelHacker: Image Inpainting with Structural and Semantic Consistency

ReasonIR: Training Retrievers for Reasoning Tasks

DeepCritic: Deliberate Critique with Large Language Models

WebThinker: Empowering Large Reasoning Models with Deep Research Capability

Phi-4-reasoning Technical Report

UniversalRAG: Retrieval-Augmented Generation over Multiple Corpora with Diverse Modalities and Granularities

Towards Understanding Camera Motions in Any Video

Insights from the community

Others also viewed

MICRO 2023: Artifact evaluation report for the 56th IEEE/ACM International Symposium on Microarchitecture

Transformer Architecture: Simplified (sort of).

Going Deeper with Convolutions (Inception | GoogLeNet)

Demystifying the U-Net: A Powerful Architecture for Image Segmentation

We Built an Operating System for Human Consciousness

YOLOv8 Architecture

PIANO Architecture

Breaking the Haze: DEA-Net’s Novel Approach to Single Image Dehazing

Connecting finite element analysis, graphs, intractable computational problems, and digital at Arup

U-Net (Convolutional Networks for Biomedical Image Segmentation)

Explore topics