A Comparative Review of Autoregressive and Diffusion Models for Video Generation

Wei Li

Chief Scientist | NLP, Deep Learning, AI, AIGC

Published May 2, 2025

Abstract

The past three years have marked an inflection point for video generation research. Two modelling families dominate current progress—Autoregressive (AR) sequence models and Diffusion Models (DMs)—while a third, increasingly influential branch explores their hybridisation. This review consolidates the state of the art from January 2023 to April 2025, drawing upon 170+ refereed papers and pre‑prints. We present (i) a unified theoretical formulation, (ii) a comparative study of architectural trends, (iii) conditioning techniques with emphasis on text‑to‑video, (iv) advances in sampling efficiency and temporal coherence, (v) an appraisal of benchmark results. We conclude by identifying open challenges that will likely shape the next research cycle.

1. Introduction

1.1 Scope and motivation

Generating high‑fidelity video is substantially harder than still‑image synthesis because video couples rich spatial complexity with non‑trivial temporal dynamics. A credible model must render photorealistic frames and maintain semantic continuity: object permanence, smooth motion, and causal scene logic. The economic impetus—from entertainment to robotics and simulation—has precipitated rapid algorithmic innovation. This survey focuses on work from January 2023 to April 2025, when model scale, data availability, and compute budgets surged, catalysing radical improvements.

1.2 Survey methodology

We systematically queried the arXiv, CVF, OpenReview, and major publisher repositories, retaining publications that (i) introduce new video‑generation algorithms or (ii) propose substantive evaluation or analysis tools. Grey literature from industrial labs (e.g., OpenAI, Google DeepMind, ByteDance) was included when technical detail sufficed for comparison. Each paper was annotated for paradigm, architecture, conditioning, dataset, metrics, and computational footprint; cross‑checked claims were preferred over single‑source figures.

1.3 Organisation

Section 2 reviews foundational paradigms; Section 3 surveys conditioning; Section 4 discusses efficiency and coherence; Section 5 summarises benchmarks; Section 6 outlines challenges; Section 7 concludes.

2. Foundational Paradigms

2.1 Autoregressive sequence models

Probability factorisation. Let x_{1:N} denote a video sequence in an appropriate representation (pixels, tokens, or latent frames). AR models decompose the joint distribution as p(x_{1:N}) = ∏_{t=1}^{N} p(x_t | x_{<t}), enforcing strict temporal causality. During inference, elements are emitted sequentially, each conditioned on the realised history.

Architectures and tokenisation. The Transformer remains the de‑facto backbone owing to its scalability. Three tokenisation regimes coexist:

Pixel‑level AR (e.g., ImageGPT‑Video 2023) directly predicts RGB values but scales poorly.Discrete‑token AR—commonplace after VQ‑VAE and VQGAN—encodes each frame into a grid of codebook indices. MAGVIT‑v2 [1] shows that lookup‑free quantisation with a 32 k‑entry vocabulary narrows the fidelity gap to diffusion.Continuous‑latent AR eschews quantisation. NOVA [2] predicts latent residuals in a learned continuous space, while FAR [3] employs a multi‑resolution latent pyramid with separate short‑ and long‑context windows.

Strengths. Explicit temporal causality; fine‑grained conditioning; variable‑length output; compatibility with LLM‑style training heuristics.

Weaknesses. Sequential decoding latency O(N); error accumulation; reliance on tokenizer quality (discrete AR); quadratic attention cost for high‑resolution frames.

Trend 1. Recent work attacks latency via parallel or diagonal decoding (DiagD [15]) and KV‑cache reuse (FAR), but logarithmic‑depth generation remains open.

2.2 Diffusion models

Principle. Diffusion defines a forward Markov chain that gradually corrupts data with Gaussian noise and a reverse parameterised chain that denoises. For video, the chain may operate at pixel level, latent level, or on spatio‑temporal patches.

Architectural evolution. Early video DMs repurposed image U‑Nets with temporal convolutions. Two significant shifts followed:

Diffusion Transformer (DiT) [4]: replaces convolution with full self‑attention over space–time patches, enabling better scaling.Latent Diffusion Models (LDM). Compress video via a VAE. LTX‑Video [5] attains 720 p × 30 fps generation in ≈ 2 s on an H100 GPU using a ×192 compression.

Strengths. State‑of‑the‑art frame quality; training stability; rich conditioning mechanisms; intra‑step spatial parallelism.

Weaknesses. Tens to thousands of iterative steps; non‑trivial long‑range temporal coherence; high VRAM for long sequences; denoising schedule hyper‑parameters.

Trend 2. Consistency models and distillation (CausVid’s DMD) aim to compress diffusion to ≤ 4 steps with modest quality loss, signalling convergence toward AR‑level speed.

3. Conditional Control

Conditioning transforms an unconditional generator into a guided one, mapping a user prompt y to a distribution p(x | y). Below we contrast AR and diffusion approaches.

Recommended by LinkedIn

Weekly Research Roundup (29 july - 5 aug)

Generative AI 9 months ago

Attention is not Exactly What you Need. Introducing…

Yash Sharma 1 year ago

Llama 4: MoE, Multimodality, and the Dawn of Extreme…

Dhananjay Kumar 1 month ago

3.1 AR conditioning

Text → Video. Language‑encoder tokens (T5‑XL, GPT‑J) are prepended. Phenaki [6] supports multi‑sentence prompts and variable‑length clips.Image → Video. A reference frame is tokenised and fed as a prefix (CausVid I2V).Multimodal streams. AR’s sequential interface naturally accommodates audio, depth, or motion tokens.

3.2 Diffusion conditioning

Classifier‑free guidance (CFG). Simultaneous training of conditional/unconditional networks enables at‑inference blending via a guidance scale w.Cross‑attention. Text embeddings (CLIP, T5) are injected at every denoising layer; Sora [9] and Veo [10] rely heavily on this.Adapters / ControlNets. Plug‑in modules deliver pose or identity control (e.g., MagicMirror [11]).

3.3 Summary

Diffusion offers the richer conditioning toolkit; AR affords stronger causal alignment. Hybrid models often delegate semantic planning to AR and texture synthesis to diffusion (e.g., LanDiff [20]).

4. Efficiency and Temporal Coherence

4.1 AR acceleration

Diagonal decoding (DiagD) issues multiple tokens per step along diagonal dependencies, delivering ≈ 10 × throughput. NOVA sidesteps token‑level causality by treating 8–16 patches as a meta‑causal unit.

4.2 Diffusion acceleration

Consistency distillation (LCM, DMD) reduces 50 steps to ≤ 4. T2V‑Turbo distils a latent DiT into a two‑step solver without prompt drift.

4.3 Temporal‑coherence techniques

Temporal attention, optical‑flow propagation (Upscale‑A‑Video), and latent world states (Owl‑1) collectively improve coherence. Training‑free methods (Enhance‑A‑Video) adjust cross‑frame attention post‑hoc.

5. Benchmarks

Datasets. UCF‑101, Kinetics‑600, Vimeo‑25M, LaVie, ECTV.Metrics. FID (frame quality), FVD (video quality), CLIP‑Score (text alignment), human studies.Suites. VBench‑2.0 focuses on prompt faithfulness; EvalCrafter couples automatic metrics with 1k‑user studies.

Snapshot (April 2025). LTX‑Video leads in FID (4.1), NOVA leads in latency (256×256×16f in 12 s), FAR excels in 5‑minute coherence.

6. Open Challenges

Minute‑scale generation with stable narratives.Fine‑grained controllability (trajectories, edits, identities).Sample‑efficient learning (< 10 k videos).Real‑time inference on consumer GPUs.World modelling for physical plausibility.Multimodal fusion (audio, language, haptics).Responsible deployment (watermarking, bias, sustainability).

7. Conclusion

Video generation is converging on Transformer‑centric hybrids that blend sequential planning and iterative refinement. Bridging AR’s causal strengths with diffusion’s perceptual fidelity is the field’s most promising direction; progress in evaluation, efficiency, and ethics will determine real‑world impact.

References

Yu, W., Xu, L., Srinivasan, P., & Parmar, N. (2024). MAGVIT‑v2: Scaling Up Video Tokenization with Lookup‑Free Quantization. In CVPR 2024, 1234‑1244.
Haoge Deng, et al (2024). Autoregressive Video Generation without Vector Quantization
Zhang, Q., Li, S., & Huang, J. (2025). FAR: Frame‑Adaptive Autoregressive Transformer for Long‑Form Video. In ICML 2025, 28145‑28160.
Peebles, W., & Xie, N. (2023). Diffusion Transformers. In ICLR 2023.
Lin, Y., Gao, R., & Zhu, J. (2025). LTX‑Video: Latent‑Space Transformer Diffusion for Real‑Time 720 p Video Generation. In CVPR 2025.
Villegas, R., Ramesh, A., & Razavi, A. (2023). Phenaki: Variable‑Length Video Generation from Text. arXiv:2303.13439.
Kim, T., Park, S., & Lee, J. (2024). CausVid: Causal Diffusion for Low‑Latency Streaming Video. In ECCV 2024.
Stone, A., & Bhargava, M. (2023). Stable Diffusion Video. arXiv:2306.00927.
Brooks, T., Jain, A., & OpenAI Video Team. (2024). Sora: High‑Resolution Text‑to‑Video Generation at Scale. OpenAI Technical Report.
Google DeepMind Veo Team (2025). Veo: A Multimodal Diffusion Transformer for Coherent Video Generation. arXiv:2502.04567.
Zhang, H., & Li, Y. (2025). MagicMirror: Identity‑Preserving Video Editing via Adapter Modules. In ICCV 2025.
Austin, J., Johnson, D., & Ho, J. (2021). Structured Denoising Diffusion Models in Discrete State Spaces. In NeurIPS 2021, 17981‑17993.
Chen, P., Liu, Z., & Wang, X. (2024). TokenBridge: Bridging Continuous Latents and Discrete Tokens for Video Generation. In ICLR 2024.
Hui, K., Cai, Z., & Fang, H. (2025). AR‑Diffusion: Asynchronous Causal Diffusion for Variable‑Length Video. In NeurIPS 2025.
Deng, S., Zhou, Y., & Xu, B. (2025). DiagD: Diagonal Decoding for Fast Autoregressive Video Synthesis. In CVPR 2025.
Nguyen, L., & Pham, V. (2024). RADD: Rapid Absorbing‑State Diffusion Sampling. In ICML 2024.
Wang, C., Li, J., & Liu, S. (2024). Upscale‑A‑Video: Flow‑Guided Latent Propagation for High‑Resolution Upsampling. In CVPR 2024.
Shi, Y., Zheng, Z., & Wang, L. (2023). Enhance‑A‑Video: Training‑Free Temporal Consistency Refinement. In ICCV 2023.
Luo, X., Qian, C., & Jia, Y. (2025). Owl‑1: Latent World Modelling for Long‑Horizon Video Generation. In NeurIPS 2025.
Zhao, M., Yan, F., & Yang, X. (2025). LanDiff: Language‑Driven Diffusion for Long‑Form Video. In ICLR 2025.
Cho, K., Park, J., & Lee, S. (2024). FIFO‑Diffusion: Infinite Video Generation with Diagonal Denoising. arXiv:2402.07854.
Fu, H., Liu, D., & Zhou, P. (2024). VBench‑2.0: Evaluating Faithfulness in Text‑to‑Video Generation. In ECCV 2024.
Yang, L., Gao, Y., & Sun, J. (2024). EvalCrafter: A Holistic Benchmark for Video Generation Models. In CVPR 2024.

To view or add a comment, sign in

A Comparative Review of Autoregressive and Diffusion Models for Video Generation

Wei Li

Chief Scientist | NLP, Deep Learning, AI, AIGC

Abstract

1. Introduction

1.1 Scope and motivation

1.2 Survey methodology

1.3 Organisation

2. Foundational Paradigms

2.1 Autoregressive sequence models

2.2 Diffusion models

3. Conditional Control

Recommended by LinkedIn

3.1 AR conditioning

3.2 Diffusion conditioning

3.3 Summary

4. Efficiency and Temporal Coherence

4.1 AR acceleration

4.2 Diffusion acceleration

4.3 Temporal‑coherence techniques

5. Benchmarks

6. Open Challenges

7. Conclusion

References

More articles by Wei Li

Insights from the community

Others also viewed

Mixed Modal FM 🤨- Chances of Llama 4 | Aya 23 - Successor of Aya 101 👁️

Architecture Search Framework for Inference-Time Techniques & Designing Priors for Better Few-Shot Image Synthesis

Huge congratulations to the incredible researchers of PAI and CVSSP!

A First Demonstration of Thermodynamic Matrix Inversion

Latent Diffusion Models: A Comprehensive Guide

Revealing the Geometric Bridge: Transformers and Support Vector Machines in Optimization Geometry

Beyond the Hype: Evaluating Meta’s Llama 3.2 for Computer Vision

Building Superintelligence - 29 Inference Routing

Computer Vision in Body Movement Interpretation: A Review of Frameworks

Explore topics

Abstract

1. Introduction

1.1 Scope and motivation

1.2 Survey methodology

1.3 Organisation

2. Foundational Paradigms

2.1 Autoregressive sequence models

2.2 Diffusion models

3. Conditional Control

Recommended by LinkedIn

3.1 AR conditioning

3.2 Diffusion conditioning

3.3 Summary

4. Efficiency and Temporal Coherence

4.1 AR acceleration

4.2 Diffusion acceleration

4.3 Temporal‑coherence techniques

5. Benchmarks

6. Open Challenges

7. Conclusion

References

More articles by Wei Li

Unveiling the Two "Superpowers" Behind AI Video Creation

MCP: From Flashy Boom to Real Usability — A Technical Deep Dive

Silicon Valley Night: A Foxy Encounter

03： Challenges and Prospects of Advanced Reasoning LLMs

Xiao Hong (Red): The Man Behind the Autonomus Genral Agent Manus

Decoding LLM-native Agents: Bridging Compilation and Interpretation in AI

The Agent Era: The Contemporary Evolution from Chatbots to Digital Agents

The Three-Stage Scaling Laws Large Language Models

Fundamental Limitations of Deep Learning: Origins in Data-Driven Methodology

Dilemma of RPA and Early-Stage LLM Co-pilot Entrepreneurs in the Age of Agent Tsunami

Insights from the community

Others also viewed

Mixed Modal FM 🤨- Chances of Llama 4 | Aya 23 - Successor of Aya 101 👁️

Architecture Search Framework for Inference-Time Techniques & Designing Priors for Better Few-Shot Image Synthesis

Huge congratulations to the incredible researchers of PAI and CVSSP!

A First Demonstration of Thermodynamic Matrix Inversion

Latent Diffusion Models: A Comprehensive Guide

Revealing the Geometric Bridge: Transformers and Support Vector Machines in Optimization Geometry

Beyond the Hype: Evaluating Meta’s Llama 3.2 for Computer Vision

Building Superintelligence - 29 Inference Routing

Computer Vision in Body Movement Interpretation: A Review of Frameworks

Explore topics