The AI CUDA Engineer
Working in Generative AI (LLMs, Image Generators etc) means the use of GPUs. And GPUs are a luxury for a majority of us. You want to make best use of it, if you get one. This means you need to do your best to optimize your code for running fast on a GPU.
CUDA (Compute Unified Device Architecture) is a low-level software layer that gives direct access to the NVIDIA GPU’s hardware instruction set for parallel computation (Kind of Assembly Language for GPUs). CUDA kernels are functions written in the CUDA language (C/C++/Python) that run on GPUs. CUDA kernels are parallelized by default to run on thousands of GPU threads concurrently.
By optimizing computations at the CUDA kernel level, we can significantly improve the speed of AI algorithms. However, writing efficient CUDA code requires deep knowledge of GPU memory hierarchies, threading models, and execution optimization techniques—skills that are relatively rare. In practice, most machine learning workloads rely on high-level frameworks like PyTorch or JAX, which abstract away GPU programming details.
The AI CUDA Engineer comes to rescue
On 19 Feb 2025, Sakana.ai released the first agentic framework “The AI CUDA Engineer”, an AI-driven agentic system that automates the discovery, optimization, and composition of CUDA kernels (https://pub.sakana.ai/static/paper.pdf). It translates PyTorch code into efficient CUDA kernels and iteratively optimizes their execution speed using evolutionary search, retrieval-augmented generation (RAG), and profiling-based feedback
The AI CUDA Engineer works in following 4-step pipeline:
1. Translates raw PyTorch code into equivalent CUDA kernels. It converts torch.nn.Module operations into functional representations. This enables the LLMs to easily reason about kernel optimizations.
2. Then it uses LLMs to generate CUDA kernel code. Tests the correctness of the generated code by comparing output with PyTorch implementation. Errors, if any, are summarized and fed back into the LLM for refinement.
3. Uses LLM driven evolutionary search optimization and profiling-based tuning (loop unrolling to reduce branch divergence, memory coalescing to optimize memory access patterns, register pressure management to balance register allocation) to improve speed of the CUDA kernels. The strategies it uses for improvement include – LLM ensembling, crossover prompting, profiling feedback, shared memory and tensor core utilization.
Recommended by LinkedIn
4. Uses retrieval-augmented kernel composition to improve future performance of torch native and compiled kernels. It stores a database of optimized kernels (17000+ CUDA kernels) and uses embedding-based retrieval to fetch the best past solutions and applies in-context learning to guide the AI in generating better CUDA kernels.
During evaluations, the AI CUDA Engineer was found to automatically translate PyTorch code to CUDA with 91% success rate, improving the execution speed of optimized PyTorch code by a median value of 1.52 times as compared to native PyTorch code. It increased the speed of tasks like instance normalization by 381X, lower triangular matrix multiplication by 147X, cross entropy loss calculation by 8.9X. The AI CUDA Engineer is able to generate CUDA Kernels with speedups of 10—100x over common PyTorch operations. This framework can produce highly optimized CUDA Kernels that are much faster than existing CUDA Kernels that are already commonly used in production (up to 5x speedups).
The AI CUDA Engineer archive is a dataset (https://huggingface.co/datasets/SakanaAI/AI-CUDA-Engineer-Archive) consisting of more than 17,000 CUDA kernels generated by The AI CUDA Engineer. It is released under the CC-By-4.0 license. The dataset includes a torch reference implementation, torch, NCU and Clang-tidy profiling data, multiple kernels per task, error messages and speedup scores against torch native and compile runtimes. The authors have also released an interactive website https://pub.sakana.ai/ai-cuda-engineer for interactively inspecting more than 17,000 verified kernels and their profiles including torch, NCU and Clang-Tidy data. The website allows you to explore various high-performing kernels across 230 tasks. It comes with a custom leaderboard that can be used to inspect related kernels across experiments and LLMs. You can visualize the kernel, retrieve related kernels, download code to verify the implementation and speedup as well as view the obtained profiling data. You can fully explore the optimization experiment.
Check out details here:
[1]. The AI CUDA Engineer: Agentic CUDA Kernel Discovery, Optimization and Composition, Robert Tjarko Lange , Aaditya Prasad , Qi Sun , Maxence Faldor , Yujin Tang and David Ha, https://pub.sakana.ai/static/paper.pdf