A Comparative Review of Autoregressive and Diffusion Models for Video Generation
Abstract
The past three years have marked an inflection point for video generation research. Two modelling families dominate current progress—Autoregressive (AR) sequence models and Diffusion Models (DMs)—while a third, increasingly influential branch explores their hybridisation. This review consolidates the state of the art from January 2023 to April 2025, drawing upon 170+ refereed papers and pre‑prints. We present (i) a unified theoretical formulation, (ii) a comparative study of architectural trends, (iii) conditioning techniques with emphasis on text‑to‑video, (iv) advances in sampling efficiency and temporal coherence, (v) an appraisal of benchmark results. We conclude by identifying open challenges that will likely shape the next research cycle.
1. Introduction
1.1 Scope and motivation
Generating high‑fidelity video is substantially harder than still‑image synthesis because video couples rich spatial complexity with non‑trivial temporal dynamics. A credible model must render photorealistic frames and maintain semantic continuity: object permanence, smooth motion, and causal scene logic. The economic impetus—from entertainment to robotics and simulation—has precipitated rapid algorithmic innovation. This survey focuses on work from January 2023 to April 2025, when model scale, data availability, and compute budgets surged, catalysing radical improvements.
1.2 Survey methodology
We systematically queried the arXiv, CVF, OpenReview, and major publisher repositories, retaining publications that (i) introduce new video‑generation algorithms or (ii) propose substantive evaluation or analysis tools. Grey literature from industrial labs (e.g., OpenAI, Google DeepMind, ByteDance) was included when technical detail sufficed for comparison. Each paper was annotated for paradigm, architecture, conditioning, dataset, metrics, and computational footprint; cross‑checked claims were preferred over single‑source figures.
1.3 Organisation
Section 2 reviews foundational paradigms; Section 3 surveys conditioning; Section 4 discusses efficiency and coherence; Section 5 summarises benchmarks; Section 6 outlines challenges; Section 7 concludes.
2. Foundational Paradigms
2.1 Autoregressive sequence models
Probability factorisation. Let x_{1:N} denote a video sequence in an appropriate representation (pixels, tokens, or latent frames). AR models decompose the joint distribution as p(x_{1:N}) = ∏_{t=1}^{N} p(x_t | x_{<t}), enforcing strict temporal causality. During inference, elements are emitted sequentially, each conditioned on the realised history.
Architectures and tokenisation. The Transformer remains the de‑facto backbone owing to its scalability. Three tokenisation regimes coexist:
Strengths. Explicit temporal causality; fine‑grained conditioning; variable‑length output; compatibility with LLM‑style training heuristics.
Weaknesses. Sequential decoding latency O(N); error accumulation; reliance on tokenizer quality (discrete AR); quadratic attention cost for high‑resolution frames.
Trend 1. Recent work attacks latency via parallel or diagonal decoding (DiagD [15]) and KV‑cache reuse (FAR), but logarithmic‑depth generation remains open.
2.2 Diffusion models
Principle. Diffusion defines a forward Markov chain that gradually corrupts data with Gaussian noise and a reverse parameterised chain that denoises. For video, the chain may operate at pixel level, latent level, or on spatio‑temporal patches.
Architectural evolution. Early video DMs repurposed image U‑Nets with temporal convolutions. Two significant shifts followed:
Strengths. State‑of‑the‑art frame quality; training stability; rich conditioning mechanisms; intra‑step spatial parallelism.
Weaknesses. Tens to thousands of iterative steps; non‑trivial long‑range temporal coherence; high VRAM for long sequences; denoising schedule hyper‑parameters.
Trend 2. Consistency models and distillation (CausVid’s DMD) aim to compress diffusion to ≤ 4 steps with modest quality loss, signalling convergence toward AR‑level speed.
3. Conditional Control
Conditioning transforms an unconditional generator into a guided one, mapping a user prompt y to a distribution p(x | y). Below we contrast AR and diffusion approaches.
Recommended by LinkedIn
3.1 AR conditioning
3.2 Diffusion conditioning
3.3 Summary
Diffusion offers the richer conditioning toolkit; AR affords stronger causal alignment. Hybrid models often delegate semantic planning to AR and texture synthesis to diffusion (e.g., LanDiff [20]).
4. Efficiency and Temporal Coherence
4.1 AR acceleration
Diagonal decoding (DiagD) issues multiple tokens per step along diagonal dependencies, delivering ≈ 10 × throughput. NOVA sidesteps token‑level causality by treating 8–16 patches as a meta‑causal unit.
4.2 Diffusion acceleration
Consistency distillation (LCM, DMD) reduces 50 steps to ≤ 4. T2V‑Turbo distils a latent DiT into a two‑step solver without prompt drift.
4.3 Temporal‑coherence techniques
Temporal attention, optical‑flow propagation (Upscale‑A‑Video), and latent world states (Owl‑1) collectively improve coherence. Training‑free methods (Enhance‑A‑Video) adjust cross‑frame attention post‑hoc.
5. Benchmarks
Snapshot (April 2025). LTX‑Video leads in FID (4.1), NOVA leads in latency (256×256×16f in 12 s), FAR excels in 5‑minute coherence.
6. Open Challenges
7. Conclusion
Video generation is converging on Transformer‑centric hybrids that blend sequential planning and iterative refinement. Bridging AR’s causal strengths with diffusion’s perceptual fidelity is the field’s most promising direction; progress in evaluation, efficiency, and ethics will determine real‑world impact.
References