Modern AI, graphics, and data-intensive workloads often require high-performance GPU resources. But GPUs can be costly, power-hungry, and frequently underutilized if dedicated to a single VM or container. Fortunately, NVIDIA provides multiple ways to share and virtualize GPU hardware across different workloads, each with its own advantages and trade-offs.
In this post, I’ll walk you through four main approaches to GPU virtualization on NVIDIA hardware:
- GPU Passthrough
- Multi-Instance GPU (MIG)
- NVIDIA vGPU (time-sharing mode)
- Software Time-Slicing without vGPU
We’ll explore how each option works, the pros and cons, licensing considerations, and which use cases they serve best.
1. GPU Passthrough
- GPU Passthrough gives one virtual machine (VM) direct and exclusive access to a physical GPU. The hypervisor (VMware, KVM, Hyper-V, etc.) “passes” the PCI device through to the guest OS (e.g., via VT-d/IOMMU).
- The guest OS installs the standard NVIDIA driver as if the GPU were bare-metal.
- Performance: Nearly native. Little to no overhead, ideal for compute-heavy workloads or demanding applications that need full GPU capacity.
- Isolation: Excellent. One VM owns one physical GPU, so no risk of interference from other VMs.
- Scalability Limitation: Each physical GPU is dedicated to a single VM. If you need 10 GPU-enabled VMs simultaneously, you must install 10 physical GPUs.
- No Additional NVIDIA Licensing: Standard data center drivers typically suffice. No vGPU license required.
- Migration Challenges: Live migration (vMotion, Live Migration in Hyper-V) typically doesn’t work with direct pass-through devices.
- Single-user or single-tenant scenarios where each VM demands full GPU performance (e.g., HPC compute, large AI training).
2. Multi-Instance GPU (MIG)
- MIG is a hardware-based partitioning feature available on certain NVIDIA Ampere/Hopper GPUs (e.g., A100, A30, H100). It lets you subdivide a single physical GPU into multiple independent GPU “instances,” each with dedicated SMs, memory slices, cache, copy engines, and address spaces.
- On a server with a MIG-capable GPU, an administrator uses nvidia-smi (or similar tools) to carve the GPU into up to 7 separate partitions (on an A100). Each partition has guaranteed compute resources and memory.
- These partitions can be exposed to different VMs or containers, each seeing a “slice” of the GPU hardware as if it were a smaller standalone GPU.
- Hardware-Level Isolation: Each MIG instance is strongly isolated. A fault (e.g., a driver crash) in one partition won’t typically bring down the entire GPU.
- Predictable Performance: Because MIG physically segments the GPU resources (SMs, caches, memory), each partition’s performance is stable and unaffected by neighbors.
- Static Partitioning: If one partition is idle, the other partitions cannot borrow its unused GPU capacity. Reconfiguring MIG instances requires shutting down workloads and recreating partitions.
- Licensing: Using MIG alone does not require a separate vGPU license.
- GPU Model Constraints: Only certain high-end data center GPUs (A100, A30, H100) support MIG. No MIG on A40, A10, L40, etc.
- AI/ML multi-tenant environments that require predictable performance, guaranteed slices, and robust fault isolation (e.g., HPC clusters, AI training/inference services).
- Organizations that want to subdivide big GPUs into 2–7 pieces, each for a different VM or container.
3. NVIDIA vGPU (Time-Sharing Mode)
- NVIDIA vGPU is a software-based virtualization solution that allows multiple VMs to share a single physical GPU by time-slicing compute resources. Each VM sees a “virtual GPU” with a defined amount of framebuffer memory.
- The hypervisor (VMware ESXi, Citrix Hypervisor, KVM, Proxmox, etc.) runs the “vGPU Manager,” which schedules GPU compute cycles among the VMs’ GPU contexts.
- The total GPU memory is partitioned into profiles (e.g., 4 GB, 8 GB, etc.), and each VM is assigned a vGPU profile.
- CUDA cores, Tensor Cores, and other accelerators are allocated in short time quanta. If a VM is idle, others can make use of the GPU. If all VMs are active, they share the GPU’s compute cycles round-robin style (or according to NVIDIA’s scheduling policy).
- Flexibility: When one VM is busy and others are idle, the busy VM may harness close to 100% of the GPU. This can lead to high overall utilization.
- Licensing: Requires an NVIDIA vGPU license (e.g., Virtual Compute Server for AI/ML workloads, vWS for professional 3D visualization, vPC for virtual desktops). Without it, you get a 20–30 minute trial or reduced functionality.
- Overhead & Shared Performance: Under concurrent heavy loads, tasks can experience context-switch overhead and unpredictable performance (compared to MIG).
- Memory Partitioning: Each vGPU profile has a fixed portion of the GPU’s VRAM, and there’s no oversubscription of VRAM. If a VM demands more than its allocated memory, it may fail or degrade.
- Virtual Desktop Infrastructure (VDI) with multiple 2D or 3D sessions sharing a single or multiple GPUs.
- AI or data science environments where multiple VMs want on-demand GPU access.
- Enterprises using VMware, Citrix, or KVM-based infrastructures with robust management tooling.
4. Software Time-Slicing (Without vGPU)
- At a low level, NVIDIA drivers allow multiple processes or containers to concurrently run on the same GPU through time slicing of GPU kernels. This can be configured in Kubernetes or other orchestrators as “oversubscription.”
- No hardware or specialized vGPU software is needed to mediate. The OS or container runtime basically sees /dev/nvidia0 as a shared device.
- No Additional Cost: This approach is free (beyond owning the GPU itself). No specialized vGPU licensing needed.
- Weak Isolation: Memory is not strongly partitioned. One process can allocate most VRAM, leaving little for others. A crash may reset the entire GPU.
- Unpredictable Performance: Processes share GPU time “cooperatively.” Under heavy concurrent workloads, you can see highly variable performance.
- Recommended for Trusted or Development Environments: Great for small teams or test clusters with moderate GPU demands. Less suitable for multi-tenant production with SLAs.
- Containerized ML or inference tasks in Kubernetes, where multiple pods share one GPU in a controlled environment.
- Development or CI/CD pipelines needing occasional GPU bursts without needing formal partitioning or licensing.
Comparing Key Factors
Licensing Deep-Dive
- Passthrough and MIG (without vGPU) generally do not require additional NVIDIA licenses—just the standard data center GPU and driver.
- NVIDIA vGPU (in any mode, including MIG-backed vGPU) does require a subscription or license for each GPU or VM. For AI workloads, that’s typically NVIDIA Virtual Compute Server (vCS).
- Software Time-Slicing (no vGPU) similarly doesn’t need a vGPU license, but is officially supported only in limited “best effort” form.
Practical Recommendations
- If you just need a single GPU per VM, at near-native performancePassthrough is simplest. No licenses needed, easy to set up.
- If you have an A100 or A30 and want hardware-based multi-tenant isolationMIG is your friend. You get minimal overhead, robust isolation, and guaranteed slices.
- If you want multiple VMs or VDI sessions with dynamic GPU usageNVIDIA vGPU is the standard approach, particularly in VMware/Citrix environments. You’ll pay for vGPU licenses, but gain robust features (monitoring, scheduling, integration with enterprise virtualization tools).
- If you have containers or dev workloads and are comfortable with limited isolationTime-Slicing without vGPU can be great: no extra licensing or overhead. But concurrency and memory usage can become unpredictable.
Final Thoughts
Modern GPUs are exceptionally powerful but can be underutilized if dedicated to a single workload. GPU virtualization allows you to maximize ROI on your GPU investments and scale up AI, rendering, or virtual desktops more efficiently.
- Passthrough (no license, near-native speed) is ideal for dedicated usage scenarios.
- MIG offers hardware-based partitioning with strong isolation, specifically on A100/A30/H100.
- vGPU unlocks time-sharing plus enterprise-level orchestration (VMware, Citrix, KVM) but requires paid licenses.
- Software Time-Slicing is a no-cost, minimal-setup method for container-based GPU sharing—just keep in mind the lack of guaranteed isolation and performance predictability.
Choosing the right method comes down to performance requirements, security/isolation needs, budget for licenses, and the nature of your workloads. If your environment demands high concurrency with minimal overhead, MIG or vGPU might be perfect. If you only need a single GPU per VM, Passthrough may suffice. And if you’re containerizing everything in a trusted environment, simple driver-level time-slicing might work fine.
Have you tried any of these GPU virtualization methods? Which approach best suits your workloads? Feel free to share your experiences and challenges in the comments below!
Further Reading & References