Threads, Tensor Cores, and Beyond - Unveiling the Dynamics of GPU Memory in HPC
In my quest to delve into the intricacies of GPU architecture, I unearthed profound insights into how it significantly influences the design of distributed High-Performance Computing (HPC) systems. After demystifying the GPU's workings, it became evident that the implications for HPCs are substantial. Before we explore these implications, let's delve into a critical factor: compute intensity...
Compute Intensity
Programs require data to generate valuable output. This data comprises variables' values within the program. For instance, a task adding A to B fetches A and B values from memory before executing the operation and stores the result back into memory.
Accessing memory is a process has a cost in terms of time. Diverse memory types exist, each with varying levels of bandwidth and response time. Common intuition dictates that the more time it takes to retrieve data from memory, the greater the need for caution, as it can significantly decelerate overall execution of a task. When dealing with data situated in memory with high latency, it becomes crucial to assess whether the retrieval time is justified—a metric encapsulated by compute intensity.
It serves as a metric for assessing the effectiveness of a computational task, measuring the ratio of floating-point operations a system can execute compared to the volume of data it must transfer to and from memory. The calculation is straightforward: dividing the number of FLOPs by the memory bandwidth.
A higher value indicates a greater expense in obtaining the data, prompting a consideration of whether performing more operations justifies the effort to fetch it.
Consider the comparison between NVIDIA A100 and Intel Xeon 8280:
For the A100, it needs to execute 100 operations for every load of memory to break even. If it executes less, the processor remains idle, wasting resources.
Latency
In addition to the memory bandwidth, we want to consider how long it takes for it to service a request. It's like going to buy milk at the supermarket, measuring how fast you get there won't tell you how long you'll spend inside the store.
With a memory bandwidth of 131GB/s and a memory latency of 89 ns, the Intel Xeon 8280 allows the movement of 11,659 bytes in 89 ns. In contrast, for our basic operation A + B, we move 16 bytes per 89 ns. This implies that if no additional memory requests are made within 89 ns, the system operates for only 0.14% of the time. However, the situation worsens for a GPU, as its main memory latency (HBM) is 404 ns.
What is happening here? Are we being sold more power than we can actually use? No. By increasing the number of threads, GPUs and CPUs can
take full advantage of their memory capacity.
Threads and memory on a GPU
Within the GPU architecture, there are key components worth noting. Streaming processors (SM), also referred to as CUDA or Tensor cores in NVIDIA GPUs, serve as fundamental computational units responsible for executing parallel processing tasks. These processors excel in handling numerous simple tasks simultaneously, making them suitable for parallel workloads like graphics rendering, scientific simulations, and machine learning, where threads are executed.
SM enables parallel processing through SIMD (Single Instruction, Multiple Data) Execution, allowing the same operation to be performed on different data pieces concurrently. Moreover, multiple threads can be executed simultaneously by a core within an SM.
Tensor Cores, specialized hardware units within NVIDIA GPUs, possess access to specific types of memory for efficient matrix operations. This includes Global Memory (HSM), the largest and slowest memory type in the GPU hierarchy, used for data accessible by all threads and GPU parts. Shared Memory (L1/L2 Caches) is a faster, smaller memory space shared among threads within a thread block, facilitating efficient communication. Registers, small and fast memory units on the GPU's processing units (CUDA Cores or stream processors), store local variables with low access latency, ideal for frequently accessed data.
It's crucial to understand that Tensor Cores don't have dedicated memory but leverage the GPU's memory hierarchy, incorporating global memory, shared memory, and registers to efficiently perform matrix operations.
Recommended by LinkedIn
GPU vendors will balance their system to make sure there are enough threads running to take full advantage of the memories available when considering bandwidth and latency as we have seen earlier.
Large models and multi-GPUs
Up to this point, we've highlighted the need for appropriately sizing threads relative to available memory. However, what happens when a model becomes so large that it exceeds the capacity of a single GPU? The parameter size of a modern Large Language Model (LLM) reaches the magnitude of hundreds of billions, surpassing the GPU memory of an individual device or host.
For instance, the OPT-175B model demands 350 GB of GPU memory solely to accommodate its model parameters, not to mention additional GPU memory required for gradients and optimizer states during training, potentially pushing memory needs beyond 1 TB.
As a very rough rule of thumb, having twice the amount of memory on the GPU as the parameters in the model is what you need. Consequently, GPUs need to be interconnected.
As previously mentioned, Tensor Cores exhibit remarkable speed. Yet, in practical scenarios, administrators overseeing large distributed High-Performance Computing (HPC) environments note that Tensor Cores are often idle, awaiting data from global memory or remote GPUs. It's common to observe Tensor Core utilization ranging from 45-65%, indicating that even for substantial neural networks, Tensor Cores remain idle about 50% of the time. This is because when data moves to another GPU, the total round trip time (link plus memory latency) is high.
However, having a large bandwidth alone is not sufficient.
Parallel Processing and East-West Traffic
The algorithms designed to take advantage of multiple GPUs are as important as the infrastructure they run on. They create the need for any to any GPU communication which are complex to realize at very large scale. Let's see why this happens.
Distributed learning is a highly intricate subject with numerous strategies currently under active research for distributing computations across GPUs. Common approaches involve slicing the computational graph by layer, known as pipeline parallelism, where a GPU is designated for the execution of a Deep Learning (DL) layer. Alternatively, Tensor parallelism allocates a subset of multiple layers to a GPU.
Once the data is distributed, there are two operations that take place periodically and places very high demands on the interconnect fabric:
All-Gather: The primary purpose of the all-gather operation is to collect data from multiple nodes and distribute it to all nodes. Each node contributes its data, and the result is that all nodes receive a copy of the combined data.
All-Reduce: The primary purpose of the all-reduce operation is to perform a reduction operation (e.g., sum, average, maximum) on data distributed across multiple nodes and then distribute the result to all nodes. It combines data from multiple nodes into a single result that is shared by all nodes.
We see here the concept of disaggregated computing elements: memory, CPU, GPU, storage of different servers are pooled together which increases the need for East-West communication. Any to any traffic patterns are important and are very challenging to accommodate in non blocking fashion in very large fabrics.
Performance is a combination of effective algorithms and well designed infrastructure.
The HPC fabric
Currently, NVIDIA stands as the exclusive provider of InfiniBand solutions, having acquired Mellanox, the last independent supplier of InfiniBand products, in 2019. On the other side, the Ethernet community comprises nearly everyone else. This year, they established the Ultra Ethernet Consortium (UEC) and are actively working to present a robust alternative to InfiniBand. Within the UEC's working groups, particularly the Physical, Link, and Transport Layer groups, there is a clear emphasis on their objective: to minimize latency and enhance bandwidth. The significance of these objectives is now well understood.
Going back to our compute intensity yard stick:
If the Memory Bandwidth or the fabric latency are variable, it will affect the memory efficiency of all GPUs and therefore cause the system to deliver lower than expected performance. This happens because of over-subscription or poor quality of service of core interconnect in large fabrics. Core fabrics need determinism so optimized distributed algorithms can account for stable latency between any two GPUs.
Conclusion
We have seen how, compute intensity, memory efficiency affect the performance of the GPUs. We have extended that to distributed systems and understand how significantly it impacts HPC designs. We have also discussed a few algorithms which cause any to any communication to take place. For fabric links and switching, the choice between Ethernet and Infiniband for handling East-West traffic remains an open question, with developments in the Ultra Ethernet Consortium aiming to provide a robust alternative. The evolving landscape of HPCs promises an intriguing future.
Sources:
NVIDIA: How GPU Computing Works