Do not guess. Measure!
Intro
"The bitterness of poor quality remains long after the sweetness of low price is forgotten". Benjamin Franklin
In this article, I showcase the process of testing and fine-tuning two well-known .NET(C#) serialization libraries for an imaginary project inspired by real software solutions I've encountered throughout my career in various companies. While this article covers a range of topics, the main idea is encapsulated in the title. By the end of this article, I hope you will understand the key insights discussed.
The context
This section is somewhat theoretical and aims to set the context and problem, so the motivation behind the following parts will be clearer. Feel free to skip it and jump to the next part if you prefer.
If you've worked with Kafka, you might have encountered topics with retention and topics with compaction. A topic with retention represents a constantly appending stream of events, where each event is self-sufficient and will be removed only after a specified retention period. Examples include application diagnostics logs or transaction logs in a payment processing system. Such topics are sometimes called streams. Another type is the topic with compaction, also known as a table. Unlike retention topics, these topics store identifiable objects (entities), and only the latest state of the object matters—all previous states (with the same ID) will be compacted and completely removed. Reading such a topic from start to finish and storing all records in memory offers a result similar to storing all data in a table and executing a "SELECT * FROM {TABLE}" command in SQL.
Creating such "table" topics is very practical when the number of unique possible objects is either strictly limited or stays within a predictable range, allowing them to be fully loaded into application memory with a manageable memory footprint. This schema offers good performance (because all data is stored in app memory and can be accessed with minimal I/O delays) and great scalability (thanks to Kafka and cheap RAM). Examples of such topics include sensor data (IoT), fleet tracking systems (logistics), leaderboards (gaming), configuration management, matches (sports), etc.
However, the common downside to this approach is that when the application starts, it must read the entire topic from the beginning, which will not be immediate. For many applications, this delay is not a problem and can be ignored. For others, there are possible solutions. The application could periodically create snapshots of the objects in memory, serialize them, and store them somewhere so that a new application that just started could load the snapshot and continue working from the most recent checkpoint.
If you decide to go in this direction, you will need to choose a serialization method so that your data set can be transformed into bytes efficiently for storage. You may already know the capabilities, specifics, and benchmarks of different serialization libraries and, in such a case, be able to confidently choose the exact library to use. If you are not proficient in serializers, then the best approach would be to test them before applying them to your architecture. The worst scenario, of course, is to use a certain library blindly just because it is popular and everyone talks about it. In the next parts, we will be testing MsgPack and Protobuf serialization libraries.
Non-functional requirements
I aim to choose an effective serialization method for serializing 50,000 SportEvent objects generated by my faker class from David Boyarov on LinkedIn: Faking real data in .NET with Bogus article. But first, let's clarify what I mean by an effective serialization method:
I've decided to compare two well-known libraries: MsgPack and Protobuf. Based on my research, they are both must be very fast and effective. The authors of these libraries promise CPU efficiency, low latency, and small output size right out of the box. However, memory efficiency in our particular scenario requires additional attention, but we will touch this topic later on in the next part.
How will I measure effectiveness? For latency measurements, I'll use Benchmark .NET. For checking CPU and memory efficiency, I'll be using IDE profiling tools. For the measuring output size, I'll rely on custom code.
Here are my system specifications:
Let's begin...
Serializers Demo
See code base for this article in the repo here. Open solution file BinarySerializersBenchmark.sln.
I'll start with simple demos for both serializers, where I will serialize/deserialize a test dataset using the most basic implementation. These demos ensure that the serializers can work with streams and can serialize my dataset without exceptions. My test dataset consists of 50,000 SportEvent objects.
Reason for using streams:
The simplest imaginable interface for a serializer, which is probably byte[] Serializer.Serialize<T>(T @object), would not be memory efficient in our case. This is because we intend to store serialized bytes in RAM shortly after serialization completes. Depending on the output size, we will experience bigger or smaller spikes in our RAM consumption. Later in this article, we will see that MsgPack “compresses” an initial data set of 50_000 objects to ~400 MB output, which means that each time compression happens, a 400 MB spike will occur. Although this is not exactly terrible, I would prefer to avoid it since issues tend to arise at the least convenient moments in software, which could cause our app to fail with a NotEnoughMemoryException.
To address this issue, I want to use an interface that utilizes streams so that my call looks like serializer.Serialize<T>(T @object, Stream output), where I'll use a file stream as the second parameter at runtime. The serializer will write the output to the file in small batches, minimizing RAM usage spikes. Therefore, our chosen libraries need to support streams.
MsgPack
For MsgPack, I'm using the well-known MessagePack package v2.5.192. I'll demonstrate the default serializer along with two embedded compression methods (LZ4Block and LZ4BlockArray) to see if they offer any benefits, such as smaller serialized data output. See the file with demo here.
Protobuf
For Protobuf, I'm using protobuf-net v3.2.45. See the file here.
Initially, I started with another library, Google.Protobuf, as suggested in the Protocol Buffers Documentation tutorial. It suggests creating a .proto file with type definitions written using platform- and language-neutral syntax. Such a file can be compiled into platform/language-specific type definitions later on. Simply put, you can compile a .proto file into C# or other supported language types using the provided command line tools. The benefits of this approach are obvious, as it allows for the use of the same types across different platforms. However, I found the generated types to be quite restrictive. This approach feels somewhat outdated, reminding me the 2010s. I want the flexibility to design types in my application, whether using class, struct, or record, and to decide whether to create methods for my classes and include logic, or to keep them as POCOs (Plain Old CLR Objects). With Google.Protobuf, the protocol dictates how types are designed, and to be honest, I find them quite ugly. That’s why I decided to use protobuf-net, which can serialize types as long as they are properly marked with DataContract attributes.
Analyzing demo results
In Program.cs uncomment line 68 and run demo. You should see output similar to the following:
Test data set of 50000 objects created. Approximate size of test data set in memory as objects is 1300714 kB
------------------------------------
Protobuf.net serializer demo.. Press any key to continue...
Protobuf serialized data size is 381026 kB
Press any key to continue with deserialization...
------------------------------------
MsgPack serializer demo.. Press any key to continue...
MsgPack serialized data size is 437676 kB
Press any key to continue with deserialization...
------------------------------------
MsgPack(with Lz4BlockArray compression) serializer demo.. Press any key to continue...
MsgPack serialized data size is 405235 kB
Press any key to continue with deserialization...
------------------------------------
MsgPack(with Lz4Block compression) serializer demo.. Press any key to continue...
MsgPack serialized data size is 414443 kB
Press any key to continue with deserialization...
We successfully serialized and deserialized data to and from the file stream without errors. We observed that, on average, the serialized data size is approximately three times smaller than the objects in memory for both serializers. However, Protobuf performs slightly better than MsgPack, providing about a 13% better result compared to plain MsgPack and about a 6% better result compared to MsgPack with LZ4BlockArray.
At this point, we can also observe CPU and memory utilization while the serializers do their job. I’m using Rider’s Monitor for this purpose, but you could also use the Memory Performance Profiler in Visual Studio Community (or higher).
As we can see, Protobuf performs very well. There is almost no memory spike during serialization, and a reasonable memory usage spike during deserialization. CPU usage remains around 10%, which is acceptable.
However, something unexpected happens with MsgPack. Despite using serialization into a file stream, I observed a memory spike of about 438 MB! Additionally, it allocates a portion of 419 MB in the Large Object Heap during deserialization. This is not what I would expect from a serialization library that supports operations with streams. After researching this issue, I found others raising similar concerns and came across an explanation from one of the authors:
Also just FYI you're writing to a stream each time, and while that's perfectly fine, it's not messagepack's most efficient mode (which is to write to an IBufferWriter<T> that recycles memory such as Sequence<T> from Nerdbank.Streams).
MsgPack with compression (LZ4BlockArray or LZ4Block) presents an even worse picture because it allocates more memory and encroaches upon the Large Object Heap area. See the pictures:
Serialization Benchmark
Now that we've demonstrated our serializers, let's proceed with the benchmarks. Benchmarks will be performed with a test data set of 50,000 test objects. You can observe benchmark source here and in order to run it just uncomment appropriate line in Program.cs.
Recommended by LinkedIn
I can see the following result:
Protobuf is slightly faster than MsgPack without compression while MsgPack compression adds significant delay without providing good compression rates compared to MsgPack plain. Also there is a significant difference in the memory allocation between these two libraries, which confirms our previous findings. MessagePack lib that we are using is not memory efficient compared to Protobug taking into consideration specifics of our test scenario.
Fine-tuning MsgPack
It is already clear that Protobuf-net is a better choice for our scenario. Although it is slightly slower than MsgPack in terms of latency, it excels in other areas, such as smaller output size, better memory management, and at least the same CPU utilization. However, we will not stop here and will attempt to improve our use of MsgPack in hopes of making it competitive with Protobuf.
Naive Implementation
How are we going to achieve this? We will serialize our large data set in smaller parts, which we will call chunks. If the data set consists of 50,000 objects and each chunk contains, say, 250 objects, then it will be serialized in (50,000 / 250 = 200) smaller parts. The tricky part is that each chunk will contain a different amount of bytes, so we will also need to store this information in the resulting output. The easiest way to achieve this is by writing the length of the chunk before the chunk itself. Here is the visualization:
The chunk length is written into a reserved 4-byte (int) segment before its content. The maximum value for an int is 2,147,483,647, so each chunk cannot be more than 2,147 MB, which is more than sufficient.
The implementation might look like this, see this file, line 8:
public static void Serialize_Slow<T>(T[] dataArray,
Stream output,
int chunkSize = 1_000,
MessagePackSerializerOptions? serializerOptions = null)
{
if (dataArray.Length < chunkSize) chunkSize = dataArray.Length;
serializerOptions ??= MessagePackSerializerOptions.Standard;
for (var i = 0; i < dataArray.Length; i += chunkSize)
{
var remaining = dataArray.Length - i;
var currentChunkSize = remaining < chunkSize ? remaining : chunkSize;
var chunk = dataArray.Skip(i).Take(currentChunkSize);
var chunkBytes = MessagePackSerializer.Serialize(chunk, serializerOptions);
output.Write(BitConverter.GetBytes(chunkBytes.Length));
output.Write(chunkBytes);
}
output.Flush();
}
The implementation is straightforward and simple. We serialize the data set in smaller chunks, the size of which can be set using the chunkSizeParameter with a defaut value of 1,000. However, this implementation is not optimized, and the main reason for that lies in lines 15 and 17. Each time these lines are called, they allocate memory for a new array to store serialized bytes. Apart from constant memory allocations, each of these arrays will need to be erased from memory shortly after being used, which puts additional, unnecessary pressure on the garbage collector. But how can we improve this?
Improved implementation
The common solution to this problem is to use the Span and Memory types.
For those who don’t know, Span<T> and Memory<T> represent contiguous regions of memory with a set of useful methods, the most famous of which is Slice. This method allows you to work with random regions inside the span as if those regions were completely independent. This type is at the core of various optimizations within .NET itself and in many company code bases because it avoids unnecessary memory allocations. Instead of allocating a new memory region in the heap each time, it allows reusing already allocated memory by slicing off a piece of the needed length.
The difference between Span and Memory is that Span is a structure that points to the memory region in the stack, while Memory is a reference that points to the memory in the heap.
This is my definition of Span and Memory. For more precise and exhaustive information, I recommend checking this article.
Although MsgPack doesn’t provide overrides that work with Span or Memory, it does offer an override that takes IBufferWriter. This is acceptable for our purposes because, internally, MessagePack can obtain Span or Memory from IBufferWriter while serializing data.
By the way, the first thing mentioned in the Performance section of the MsgPack README states: “The serializer uses IBufferWriter<byte> rather than System.IO.Stream to reduce memory overhead.” See this link for more details. So, it looks like we are on the right path.
Our improved implementation listed below, and you can observe it in this file, line 31:
public static void Serialize<T>(T[] dataArray,
Stream output,
int chunkSize = 1_000,
MessagePackSerializerOptions? serializerOptions = null)
{
if (dataArray.Length < chunkSize) chunkSize = dataArray.Length;
serializerOptions ??= MessagePackSerializerOptions.Standard;
var chunkBuffer = new ArrayBufferWriter<byte>(256_000);
Span<byte> chunkLengthBuffer = stackalloc byte[sizeof(int)];
var dataArrayAsMemory = dataArray.AsMemory();
for (var i = 0; i < dataArrayAsMemory.Length; i += chunkSize)
{
var remaining = dataArrayAsMemory.Length - i;
var currentChunkSize = remaining < chunkSize ? remaining : chunkSize;
var chunk = dataArrayAsMemory.Slice(i, currentChunkSize);
MessagePackSerializer.Serialize(chunkBuffer, chunk, serializerOptions);
// TODO: handle negative case
BitConverter.TryWriteBytes(chunkLengthBuffer, chunkBuffer.WrittenCount);
output.Write(chunkLengthBuffer);
output.Write(chunkBuffer.WrittenSpan);
chunkBuffer.ResetWrittenCount();
}
output.Flush();
}
There are several important details about the new implementation worth mentioning:
The improved method becomes more efficient as more chunks are prepared during serialization. I’ve performed a quick benchmark on a test data size of 10,000 items with a chunk size of 250 items. Under these conditions, the improved method allocates about 40% less memory and finishes with 30% less latency. The more chunks there are, the more significant the difference between the two methods becomes.
Improved MsgPack Serializer Demo
Now, let’s run demos for the new serialization methods, for this uncomeent line 69 in Program.cs. The memory usage charts look fine now for both the plain MsgPack serializer and the versions with compression:
I hope you can identify the points where serialization and deserialization start on the charts this time (hint: check the CPU utilization). As we can see, memory consumption during serialization has dramatically improved for both the plain MsgPack serializer and the versions with compression. We no longer observe the significant spikes seen previously.
Serialization Benchmark with improved MsgPack
Also let’s run benchmark with improved MsgPack serializer, see this file:
As we can see, this is quite an improvement for MsgPack. We have significantly enhanced the memory usage of MsgPack serialization without and even improved its latency now it is almost 30% faster than Protobuf-net.
Deserialization Benchmark
Lets also run deserialization benchmark, see this file. You can enable it by uncommenting appropriate line in Program.cs file.
We can see that both serializers deal well with this task with more or less similar latency, however what is interesting is that our improved MsgPack serializer even beats Protobuf-Net in both memory allocation and latency.
Conclusion
If this were a real decision-making process and I were free to choose between MsgPack and Protobuf, I would definitely choose Protobuf-net. It meets my non-functional requirements well and is ready to use with minimal tweaking right out of the box. While MsgPack_CSharp can also be used, it delivers slightly worse results and requires proper tweaking, as we saw in our example.
Another point to consider is that we are very close to not meeting the constraints set in the non-functional requirements section. Our CPU utilization during serialization/deserialization is around 10% on average, and the deserialization latency is nearly 5 seconds. If I had more time for this investigation, I would also examine FlatBuffers and ZeroFormatters. They promise exceptionally fast deserialization efficiency, which could potentially address our latency concerns. Perhaps I will test them in one of my future posts.
Regarding MsgPack, I have used it extensively in successful real-world software. It has proven to be effective, and I am confident I will use it in the future. My use cases simply indicate that, for my particular scenario, there are better alternatives. Without testing, I might have assumed MsgPack was the best and most efficient serialization library, especially after observing the well-known performance charts in the MsgPack README. While it excels in certain scenarios, there are better alternatives in others.
The main takeaway from this article is the importance of performing close-to-real-world scenario testing of libraries and tools before including them in your codebase. It is easy to choose a technology based on promotional articles highlighting all its benefits. However we all know that silver bullet is a myth and switching one technology to another after realizing it is not optimal for your specific case is challenging. I understand we are often constrained by deadlines, but the outcome of such investigation and prototyping can be invaluable. Whenever possible, it is better to conduct thorough testing. The more such tests you perform, the more proficient you will become in understanding use cases, enabling you to make more informed decisions with better awareness in the future.