Congestion & Rate control for AI/ML traffic
RDMA can be carried by InfiniBand or ethernet, in nutshell for ethernet to carry RDMA traffic will hit couple of challenges. Tons of debate why ethernet needs to be used! However, AI workloads on ethernet will have 2 challenges one is elephant flow which will be encountered while training of AI workloads across GPUs connected to ethernet cluster, second is congestion challenge experienced during synchronization phase of AI workload which will use ECN / PFC to mitigate synchronization phase.
During the distribution phase of Workload, ECMP distributes packets across all uplink ports facing spine side, when packets are distributed most of these flows due to limited entropy hashing may pick the same physical port for communication keeping other uplink ports not being utilized, hence the ECMP limitation with hashing cannot spray packets across all those ports. When you spread those random packets across link, very much possible sequence of packet is broken as they arrive from different path giving birth to out of order packets., sequencing logic has to be implemented on RoCE NICs or egress port of switch as explained below.
example RDMA-UC unreliable connection, RDMA – RC reliable connection which guarantees reorder of packet and likely to be used most of the time while distributing packets. Below is one scenario allowing to measure limit of out of order packets.
Figure 1 – calculating max packet reorder gap.
In figure 1, we have a simple topology with RDMA packet generator + RDMA packet re-order box which can drop packets, reorder packets, error packets etc, followed by test port that can generate RoCEv2 traffic at line rate. Simple goal is to measure at what point sequencing of packet is broken on switch port, i.e., the network switch port no 1 is going to spray packets on port 2 / port 3, the RDMA packet reorder box would break the sequence of packets and port no 6 would reassemble packets, ability to reassemble would depend on the algorithm they implement and most of implementation is proprietary, importantly it would depend on the buffer size to maintain packet orderability and make sure to track all packets are in order before the reassembly logic gives up on assembling packets.
The same use case can be extended to NICs for breaking logic of reassembly.
Summary, above case with RDMA packet reorder box will help us to know at what point the tracking of packets gets worsen on network port or NIC port before it gives up.
During the synchronization phase of AI workload, at receiver it will hit the congestion at one point, before NIC experiences congestion network port would detect it, here we can use PFC to slow down transmitter, however congestion spreading is a challenge as explained below,
Recommended by LinkedIn
Figure 2 Xoff congestion spreading.
In figure 2, you can see when port 9 (Switch S2) is congested the xOFF message is triggered and flows in a direction towards switch S1 which effectively slows down traffic between servers connected to switch port 1 and 5, this is PFC congestion spreading across cluster. Hence its important before PFC is triggered to use ECN mechanics to slow down sender to avoid congestion spreading.
Once those ECN packets are emitted by switches, the RoCE NICS on server has algorithms to process and slow down traffic. As seen in figure 3 below, at port 3 the congestion is first detected by switch which generates ECN packets, switch threshold plays important role at what point to generate first ECN packet, let’s come back latter here.
Figure 3. ECN vs CNP vs RP
ECN frames are processed by receiver RoCE NICS, hence at what rate CNPs are (Congestion Notification packet) generated is very important, here we can experiment CNP generation at 100 Usec, 50 Usec & 10 Usec to put those messages towards NIC-B/NIC-C so that they can slow down rate. The threshold when to generate ECN on switch vs respective CNP generation on NIC is a crucial point to make sure PFC is not triggered due to packet loss, very important math that would depend on underlying hardware. We try to avoid PFC as much as possible depending on ECN vs CNP adjustment. As switches have different buffer size similarly R-NIC have limited buffer size from different Vendors it becomes an experiment point and allow many vendors to show muscle on their own hardware. At last sender NIC-B/C would process those CNPs to slow down rate, this is called reaction point algorithm, here once the CNP are received the current rate is dropped to half of current traffic rate, once CNPs are stopped the rate at which it increments have couple of ways additive increase or hyper increase. In contrast to DCTCP which gets acks to adjust window size to increase/decrease rate, we don’t have that liberty in this implementation. Here we depend on CNP packets rate which will help R-NICs to slow with various proprietary implementation.
In summary its crucial to play with various values of CP, NP, PFC, CNP & RP to get maximum performance from hardware, hence opening doors to use test tools.
Sales & Business Development Leader | MDI Alumnus | Regional sales head leading sales for Reliance Jio, Bharti Airtel, Tata Group, Government, System Integrators & OEM' | Channel management
1yExcellent Post Ravi Patil
Senior Product manager for Automotive test solutions.
1yExcellent post highlighting the complexities in this rapidly evolving technology landscape.
Engineering/PO/PM -Technical Lead at Cisco Systems
1yexcellent article,keep up the good work.