Low Latency High-speed Systems - The Kernel Bypass Approach

Ijaz Ahmad

Senior Technical Lead @ Siemens | C/C++11/17/20 | Multi-Threading | Nucleus RTOS | Web-Developer | Electrical Engineer | Embedded Systems

Published Jul 9, 2024

Ijaz - our system currently has a latency of around < 100 microseconds, and we're eager to reduce it even further!

As an engineer with a long history in software development, I had to ask, "How do you achieve it?" Mr. Someone replied, "I'm not on the engineering team, but we use a technique called kernel bypass."

Kernel bypass — oh gosh, you guys are so smart. How did I not know about this? Am I missing out on something important? 😢 It almost sounds like hacking or patching the Linux kernel! I had to dig down into this kernel bypass thing.

End of the story, back to the topic.....

What are Low Latency Systems?

Low latency systems are designed to minimize the delay (latency) between an input or request and the corresponding output or response. These systems are crucial in applications where timely processing and rapid response are essential.

When discussing low-latency systems, we refer to systems connected to an external data source that can be geographically distant. The goal is not only to quickly acquire this data but also to promptly make it available to the target application, potentially yielding an output. In such systems, the key variables contributing to latency include:

Total Latency: Data travel time + Data acquisition + Data processing.

Article content — General low-latency systems interconnect

There are variables that can be mitigated and others that can't be. For instance, data travel time can be minimized by relocating the receptor geographically close to data producers and adding some high-speed data links. (Ever wondered why L1 caches are faster? They're located closer to processor cores, leveraging static RAM technology).

On the data receptor side, it is obviously just another computer, which may or may not have specialized software or hardware components. However, it typically includes standard components such as a network card, an operating system, and the target application.

Are we limiting this discussion to Ethernet as the communication medium? Yes, but we will see later that it's not limited to Ethernet only and the concept is more generalized.

Speaking of quick data acquisition, with the latest technological advancements, do you know what bandwidth is currently supported by Network Interface Cards (NICs)? Any guess? 10GbE? 20GbE?... They have already surpassed 200GbE speed! Yes, that's true.

Let me give you a few examples.

Marvell Nova 2 Optical DSP: This NIC is capable of 1.6 Tbps bandwidth with 200 Gbps electrical and optical interfaces, doubling the bandwidth of current 800 Gbps optical modules.
Lumentum's 200G per Lane Components: Lumentum has introduced 200G indium phosphide (InP) transceiver components that enable 800G and 1.6T optical transceivers.
Broadcom's 1.6T Ethernet Switches: These switches are designed to handle 1.6 Tbps bandwidth, supporting 200 Gbps interfaces.

That's a tremendous amount of speed capability, isn't it? Yes, for very high-speed networks specialized NIC cards (few mentioned above) are used. So what's the problem then? The faster the NIC gets, the less time is available to process each packet, resulting in a tighter time budget. If your processing unit (CPU) cannot process the incoming packets promptly, it will inevitably struggle to keep up. Below is the time budget for standard Ethernet speeds:

So, what enhances a low-latency system next is the usage of NIC cards. Now that our system has the capability to capture network traffic at extremely high speeds (via NIC), the next challenge is faster processing. As mentioned before, after all the receptors are also general-purpose computers with a General Purpose Operating System (GPOS) may be.

The general flow of packet processing in an operating system involves several steps. First, raw packets are received at the hardware level, typically by the network interface card (NIC). These packets are then processed by the network stack, which handles various network protocols and ensures the data is correctly formatted and routed. Finally, the processed packets are passed to the end application, where they are used according to the application's specific needs. This flow ensures that data received from the network is efficiently and accurately processed and made available to the appropriate application for further use.

Recommended by LinkedIn

What is Quantum Secure Data Centres?

Profile IT 5 months ago

Top Trends in Quantum Computing for 2025 | How can the…

Bill Genovese CISSP ITIL 5 days ago

How Vector Packet Processing (VPP)Empower Asterfusion…

Candy Lee 7 months ago

Wait what? A GPOS!!! don't you know:

A universal solution cannot optimally solve every problem it is applied to.

GPOS like Windows or Linux offer flexibility, but they aren't optimized for the real-time processing required at such high data rates! They have components (e.g., network stacks, etc.) that meet maximum requirements but may not necessarily fulfill the needs optimally.

Don't you agree? - No.

Okay, then why do you think there are many Linux flavors such as Kali Linux, Ubuntu, Linux Mint, CentOS, and RedHat? Each must serve specific needs that others don't. It's not merely a random choice.

The built-in OS network stack is not optimized for handling exceptionally high-speed data. The question is can we replace the built-in stack with something more powerful? What if we don't want the raw packets processed and formatted by the OS, but instead want the application to handle them directly? This shift in packet processing from the network stack to the end application requires a mechanism to provide the application direct access to the underlying hardware, effectively bypassing the kernel.

Kernel bypass... hmmm, getting somewhere.

Kernel Bypass Approach:

Kernel-bypass networking reduces the overhead of in-kernel network stacks by shifting packet processing to userspace. Depending on the architecture of the kernel-bypass solution, packet I/O is managed by the hardware, the operating system, or directly in the userspace. In a typical setup, packets flow directly from the Network Interface card (NIC) to the userspace with minimal intervention from the operating system. Instead of the OS, the userspace takes on the responsibility of implementing the I/O packet processing and the remaining aspects of the network stack.

Now that we understand why kernel bypass is necessary, let's explore how it's achieved and examine some off-the-shelf tools available. Kernel bypass involves acquiring packets directly from hardware without intervention from the kernel or operating system. However, it's not as straightforward as it may seem. Simply attempting to access hardware directly by declaring a pointer to target memory isn't feasible due to potential MMU exceptions for unauthorized memory access. Additionally, NICs are not memory-mapped devices; they interface through PCIe interfaces. Therefore, a kernel-privileged driver is essential to efficiently route this traffic directly to the end application.

Typically, hardware vendors (e.g., Xilinx ) provide custom software, including drivers and minimal stacks, which establish a direct interface or pipe to the underlying hardware for use by the target application. Below are some common technologies used for kernel bypass:

DPDK (Data Plane Development Kit): Provides a set of libraries and drivers for fast packet processing in user space. It bypasses the kernel network stack to achieve high performance.
RDMA (Remote Direct Memory Access): Enables direct memory access from one computer to another without involving the operating system. RDMA technologies include InfiniBand, RoCE (RDMA over Converged Ethernet), and iWARP.
SPDK (Storage Performance Development Kit): Offers a set of tools and libraries for high-performance, user-space storage applications. It bypasses the kernel to interact directly with NVMe devices.
VMA (Voltaire Messaging Accelerator): Optimizes messaging middleware for high-frequency trading applications by bypassing the kernel for network communications.
Solarflare/Xilinx OpenOnload: Provides user-level network stack acceleration, allowing applications to bypass the kernel and directly access network interfaces.
Netmap: A framework for fast packet I/O in user space. It can bypass the kernel to achieve high-speed network packet processing.
XDP (eXpress Data Path): Allows the execution of high-performance packet processing programs at the lowest point in the Linux kernel networking stack, but still involves some kernel interaction.
IO uring: Provides asynchronous I/O operations in Linux, aiming to reduce the overhead of system calls by enabling efficient user-space access to I/O operations.

Kernel Bypass: Beyond Networking

Kernel bypass is not limited solely to network traffic or packet processing; it can extend to other types of I/O operations and system resources as well. While the term "kernel bypass" is often associated with networking technologies like RDMA (Remote Direct Memory Access) and DPDK (Data Plane Development Kit), the underlying concept involves bypassing the operating system kernel to access hardware directly from user space. This approach can significantly reduce latency and improve performance in various scenarios beyond networking. Here are some examples where kernel bypass can be applied:

Storage Acceleration: Technologies like SPDK (Storage Performance Development Kit) allow user-level applications to directly access storage devices (such as NVMe SSDs) without going through the kernel's storage stack. This bypass can reduce overhead and improve I/O performance for storage-intensive applications.
High-Performance Computing (HPC): In addition to networking, RDMA technologies like InfiniBand can be used for fast inter-process communication and shared memory access in HPC clusters, bypassing traditional kernel networking stacks for lower latency and higher throughput.
GPU Computing: Modern GPU frameworks and libraries often allow applications to bypass the CPU and kernel for certain operations, leveraging direct access to GPU resources for computation and data processing.

Overall, while kernel bypass is frequently discussed in the context of networking for its performance benefits, the concept applies broadly to any scenario where direct, efficient access to hardware resources from user space is advantageous.

Summary:

The default in-kernel middleware services are often inadequate for achieving extremely low-latency systems needs and require more specialized approaches. While in many general use cases, there is no need to bypass the default OS flow for hardware access transactions, in specialized scenarios such as high-frequency trading (HFT) systems, waiting for the OS kernel to schedule some workqueues to handle incoming hardware requests (e.g., new packets on an Ethernet interface) can introduce unacceptable delays. In such cases, both hardware modifications to utilize specialized equipment and specialized software to fully leverage the underlying hardware are necessary. Kernel bypass is a well-known approach to address this challenge, although it is not yet standardized by industry norms, which can limit its widespread adoption.

Dror Guy

Senior Software Engineer at Ionir

6mo

Nice and to the point technical summary. Thank you!

1 Reaction

See more comments

To view or add a comment, sign in

Low Latency High-speed Systems - The Kernel Bypass Approach

Ijaz Ahmad

Senior Technical Lead @ Siemens | C/C++11/17/20 | Multi-Threading | Nucleus RTOS | Web-Developer | Electrical Engineer | Embedded Systems

What are Low Latency Systems?

Recommended by LinkedIn

Kernel Bypass Approach:

Kernel Bypass: Beyond Networking

Summary:

More articles by Ijaz Ahmad

Others also viewed

Logging is the Floppy Disk of Observability

Byzantine Fault Tolerance

Unraveling the Intricacies of UUID: The Art of Creating Uniqueness in Chaos

Routing Protocol insights from Information Axioms

Decoding Computer Network Ports: Your Gateway to Digital Conversations

White Paper: Quantum Fractal Encryption Key Generation. Advancing Cryptographic Security with Fractal Patterns in Quantum Systems with MMEQ

QUIC vs. TCP+TLS – and why QUIC is not the next big thing

Interrupt Handling and Latency in ARM TrustZone: Secure and Non-Secure World Management

ILP, DLP, TLP, RLP, and more... easily explained!

Full 100G TCP Offload for AMD Alveo Accelerator Card | TOE100G-IP Core

Explore topics

What are Low Latency Systems?

Recommended by LinkedIn

Kernel Bypass Approach:

Kernel Bypass: Beyond Networking

Summary:

More articles by Ijaz Ahmad

Understanding and Mitigating Cache Contention and False Sharing in C++: Technique for Optimized Performance

C++ Vs Rust - Which One to Consider for Your Next High-Speed System?

Unlocking HFT: An Introduction and Simplified System Design View

Signal Processing on Simple/Scalar Processors vs DSP Processors - Speed and performance comparison

Speed and Performance Comparison of Harvard Bus vs System Bus on ARM Cortex-M Processors

TROS – Build your own Tiny Real Time OS/Scheduler (TROS)

Cross-Compiling GCC Toolchain for ARM Cortex-M Processors

Boot Sequence in [Embedded Systems] SoCs, Microcontrollers

Use of Volatile Keyword in Embedded Systems.

Why C Code compiled under Linux OS doesn’t run on Windows??

Others also viewed

Logging is the Floppy Disk of Observability

Byzantine Fault Tolerance

Unraveling the Intricacies of UUID: The Art of Creating Uniqueness in Chaos

Routing Protocol insights from Information Axioms

Decoding Computer Network Ports: Your Gateway to Digital Conversations

White Paper: Quantum Fractal Encryption Key Generation. Advancing Cryptographic Security with Fractal Patterns in Quantum Systems with MMEQ

QUIC vs. TCP+TLS – and why QUIC is not the next big thing

Interrupt Handling and Latency in ARM TrustZone: Secure and Non-Secure World Management

ILP, DLP, TLP, RLP, and more... easily explained!

Full 100G TCP Offload for AMD Alveo Accelerator Card | TOE100G-IP Core

Explore topics