SPSC Queue Part 1: Ditch the Lock, Speed the Queue!
Efficient inter-thread communication is a crucial aspect of high-performance systems, particularly in low-latency environments such as trading systems. One of the most fundamental and widely used concurrency primitives for such communication is the single producer single consumer (SPSC) queue. Despite its simplicity, the design choices made when implementing an SPSC queue can have a profound impact on performance, particularly in terms of latency.
In this article, I compare various implementations of a bounded SPSC queue (also known as a ring buffer), with a primary focus on latency characteristics. This serves as an introduction to lockless queue techniques. If you're already familiar with std::atomic, I recommend checking out Part 2 (coming soon), where I optimize and benchmark a lockless queue against boost::lockfree::spsc_queue.
Instead of discussing general-purpose implementations, I will analyze queues specifically designed for a high-performance environment, where the producer and consumer run on separate physical cores, continuously spinning while waiting for data. This setup eliminates context-switching overhead and ensures that performance differences stem from the queue's design rather than OS-level scheduling artifacts.
Interested in the topic? Consider reading the article on Substack for a better experience: Read on Substack now
API and Benchmark Setup
Before diving into different implementations, let's define the requirements for the queue and the benchmarking methodology:
SPSC Queue v1: Good ol’ mutex
A simple way to implement a bounded queue is to store the data in a vector, and use mutex for synchronized access to the shared data. The implementation below uses ‘head’ and ‘tail’ indices to point to the position in the vector to be pushed to or consumed from. Since the queue is bounded by a given capacity, the vector can be resized on construction and we can avoid allocations during push and consume.
This queue works pretty well for most systems, but it has a few limitations:
SPSC Queue v1.1: Byte storage
Instead of using a vector to store the elements, can allocate a byte array of the required size, effectively creating a memory pool. Using placement new, we can construct objects of type T at the appropriate locations within this storage. To implement this, we define an Element class that contains a byte array large enough to hold an instance of T. We then allocate an array of these Element objects at runtime (the buffer). New elements are constructed in place at buffer[tail], and when an element is consumed, it is explicitly destroyed to ensure proper cleanup. I have used std::construct_at and std::destroy_at for placement new/delete.
In the implementation above, notice the alignment on line 26. This is critical when working with a memory pool because the default alignment of byte storage is 1. However, the storage must be aligned to match T’s alignment requirements to construct elements of type T.
By using byte storage, we have eliminated the first limitation — T no longer needs to be default-constructible. However, it still must be copy-constructible due to the std::construct_at call. We haven’t done much about the second and the third limitations. Notice that the producer and consumer access different parts of the buffer: the producer always modifies the buffer’s tail, while the consumer works with the head. This means both threads can access the buffer simultaneously, requiring a lock only when accessing the head and tail.
To mitigate the second limitation, we can release the lock after performing the full/empty check, process the element without holding the lock, and then reacquire it to update the head/tail. While this theoretically allows more parallel execution, it incurs the overhead of locking and unlocking twice. In my benchmarks, this approach resulted in worse performance due to mutex locking being a significant source of latency. However, it may perform better in scenarios where the consumer’s callback function is nontrivial and time-consuming.
The third limitation cannot be resolved with mutexes alone. To achieve a non-blocking design, we need alternative synchronization mechanisms — this is where std::atomic comes in. Introduced in C++11, std::atomic provides lock-free atomic variables, ensuring that no other thread observes an intermediate state. If you're unfamiliar with std::atomic, I highly recommend watching Fedor Pikus’ talk from CppCon.
SPSC Queue v2: std::atomic
Atomic operations eliminate the need for locks by relying on compare and swap, and busy waiting rather than suspending thread execution. In contrast, a mutex enforces locking, which introduces the overhead of context switching. Given that both threads are actively spinning, suspending execution is unnecessary. Below is an implementation using std::atomic:
Profiling Results
I used nanobench to profile different implementations, running each one 5 times with 100 iterations per run. The code can be found on GitHub.
As evident from the results below, the atomic SPSC bounded queue is more than 2 times faster than the mutex implementation. Further, I benchmarked the implementations against the queue against boost’s SPSC queue that uses atomic under the hood as well (boost::lockfree::spsc_queue). Despite the queues using the same lockfree mechanism, the boost spsc queue significantly outperforms my implementation of the SPSC.
In part 2, I will discuss how the SPSC queue can be further optimized by leveraging memory barriers in std::atomic, and do some further optimizations. Spoiler alert: In one of my previous posts, I briefly discussed false sharing (here) — we’ll see it in practice in part 2!
Thanks for reading! I’d love to hear your thoughts in the comments!
PhD Student at Polytechnique Montreal | Ex SWE-2 (R&D) at Cadence Design Systems | High Performance C++
2moGreat read !
DevOps & Software Engineer @ LSEG (London Stock Exchange Group) | Oracle Java 17 Certified | HashiCorp & 2x AWS Certified
2moVery helpful
Senior C++ developer
2moVery helpful
Product Manager at Bharat Electronics
2moI prefer intel tbb concurrent queues
business-savvy | tech-savvy | media-savvy
2mo“Despite the queues using the same lockfree mechanism, the boost spsc queue significantly outperforms my implementation of the SPSC.” Okay now you’ve got me hooked! I’m looking forward to part 2, Sarthak. Thanks for posting up the article.