Together AI’s Post

Introducing Chipmunk: Joint work w/ SandyResearch @ UCSD for training-free acceleration of Diffusion Transformers w/ attention/MLP step deltas! ⚡️ Up to 3.7x faster video and 1.6x faster image generation w/ dynamic column sparsity (while preserving Vbench quality)! 🚀 Open-source framework & CUDA kernels! 🔹 What is Chipmunk? Chipmunk accelerates Diffusion Transformers (DiT) without additional training through a combination of two techniques: (1) caching and (2) sparsity. Attention and MLP layers cache their outputs and subsequent steps of the diffusion process are reformulated to compute activation deltas against this cache.  🔹 Why does caching + sparsity work? DiT activations are naturally sparse and change slowly across diffusion steps. Chipmunk exploits this to cache previous step activations and compute sparse deltas against this cache directly within attention and MLP layers – up to 93% dynamic attention sparsity in HunyuanVideo! At Together AI, we’re always exploring the acceleration frontier to serve the highest quality models at the lowest cost! 📚 Read more: https://lnkd.in/dNN8T6NP 📝 In-depth blog: https://lnkd.in/dicJYXdX 🖥️ GitHub: https://lnkd.in/dAAB44_y

To view or add a comment, sign in

Explore topics