Reed-Solomon (RS) v.s. Fountain Codes in Distributed Storage: A High-level Overview.
It is quite typical in the literature to compare replication and erasure coding in terms of reading/writing latency, throughput, and data durability. However, it is less frequent and more challenging to compare different erasure coding schemes in terms of these important resources. This is usually due to the construction techniques of various codes being noticeably different and which in turn makes them hard to compare. In addition, some of the correction codes are quite close in structure to observe any performance difference in distributed systems. This brief is intended to provide the pros and cons of two different coding schemes in terms of various resources deemed valuable in a typical distributed storage setting. The coding schemes we compare are known as Reed-Solomon and Fountain-type codes which are different in terms of their construction complexity, code properties, and implementation styles and thus constitute an interesting subject matter to investigate, particularly in a distributed storage setting. Both codes are linear and yet have distinct features that make them stand out and hence make the comparison possible to a large extent based on the application requirements.
Linear codes typically provide erasure correction and fault tolerance by dividing the source data block into distinct and equal-size k partitions and create extra redundant partitions of the same size to make up a total of n partitions (or generate partitions from data partitions directly), where n is called the block length of the code and this operation is known as “encoding”. Although this is a very crude definition of “encoding”, it will serve the purpose of this article. Each partition gets allocated to a separate cluster node and only a subset of partitions is sufficient to reproduce the whole data block through some computation called “decoding”. Functionally, mapping data partitions to encoded partitions does not necessarily entail the data block to exactly appear at the encoding output, which makes the code non-systematic. Using the simple pigeonhole principle, it is easy to verify that the maximum number of erased partitions that can be corrected is <= n - k and when the equality holds, the code is called optimal in terms of storage capacity.
In a distributed storage scenario, space efficiency is not the only objective data center managers try to optimize. In fact, space-optimal codes usually lead to large latencies due to massive computation and long background jobs for system maintenance. There are majorly three important workloads the nodes of a storage network have to manage properly. The most trivial one is to address direct read/write requests. The second workload is known as degraded read/writes which happens when for instance direct read/write could not successfully finalize its operation due to node failure or unavailability. In that case, the requests are forwarded to appropriate locations in the network to address incoming requests with increased response latency. The last workload is generated in an attempt to respond to system-level autonomous read/write requests due to system repair processes. In order to address the last two workloads, coding systems should be redesigned in order to minimize the data transfer between cluster nodes that communicate to create repair-induced data inappropriate network locations in order to maintain the required reliability throughout the lifespan of the storage system. It’s the last two workloads that usually keeps storage nodes busy and hence contribute adversely to the end-to-end user performance experience.
Reed-Solomon (RS) codes are one of the best known space-optimal codes heavily investigated and extensively used in the industry. However, RS codes come with complex encoding and decoding algorithms that rely on complicated Galois Field operations. There are many studies and open source implementations to improve the RS coding performance. The most notable one is known as Jeasure 2.0 implementation, which uses the modern parallel CPU architectures at the Galois Field level to improve the execution speeds of RS encoding/decoding operations with many construction methods and parameter selections. OpenMP, pthreads as well as GPU implementations of Jerasure have subsequently followed after Jeasure’s inception in 2007. Jerasure is currently supported by RedHat and successfully integrated to open source distributed storage systems such as Ceph and OpenStack Swift. The main attractions of Jeasure are its impressive encoding/decoding speeds and optimal storage space consumption. Another well-known implementation is supported by Intel that goes with the name ISA-L library. Although it does not provide the same design space as Jerasure does, its support and optimization for modern Intel SIMD instructions make it quite an attractive option.
One of the parameters of RS codes is Galois Field size. Implementations of reasonable execution speeds use 4 (Microsoft Azure) and 8 (Cleversafe, Facebook). However, in RS codes, the selection of the field size puts a constraint on the maximum attainable block length. For example, it puts a constraint 17 on the block length for a size of 4. With a short block length code, the coded user data can contain more than one codeword, and hence under uniformly random error scenarios, the system will operate at lower reliability than the system which encodes the data into a single long codeword, assuming both codes are optimal in terms of space efficiency. This is so because in a multi-codeword scenario one particular codeword can receive more errors than the rest and may fail and yet other codewords can successfully be decoded. In order to compensate for the reduced reliability in such cases, extra redundancy is typically added which is called the interleaving overhead.
In distributed systems, partitions are sent to different failure domains i.e., partitions are sent to different storage nodes whose failure probabilities are fairly independent of each other. This way, the system maintains the required reliability through generating controlled redundancy. The more independent nodes we geo-spread the encoded data, the better it gets to reliably store information. If it is small, the distribution of data shall not be widely distributed to cover multi-failure cases efficiently. In addition, nodes sometimes contain disks (disks here may refer to HDDs, SSDs, virtual partitions, etc) that can be victims of various partial failures such as latent sector errors, silent data corruptions, retention errors that appear unexpectedly. Placing multiple partitions on different sectors/sections of disks can save the system from whole disk repairs and replacements. Such small-scale error events are observed to dominate the disk-based datacenters and cause a lot of investment to go to waste. Therefore, having a long block length and placing multiple partitions on to the same disk device can help deal with partial problems which may otherwise lead to whole disk failures and repairs and thereby inefficient read/write operations and increased wear/tear. Another disadvantage of having a short blocklength code is that if it is small, this means we write/read data to a small number of nodes while the data is stored. Considering these operations being asynchronous and parallel, the final I/O and throughput of read/writes shall be constrained by the performance of drives and the blocklength because it is the aggregate I/O and throughput performance from which the storage cluster shall be affected the most. Small blocklength can cause these performances to drop by a large margin and lead to increased latencies within the storage system.
Figure 1. Comparison of computation connections in the bipartite representation of RS and Fountain Codes.
On the other hand, short block lengths lead to manageable complexity for RS codes because the complexity roughly increases quadratically (with most advanced techniques some less than that) with increasing blocklength and yet exponentially with the increasing field size. This partially explains why main industry leaders chose to use small field sizes in their products. Plus, for these choices of small size, there are really clever implementation tricks that help boost the overall performance. Therefore, although it is desirable to minimize the interleaving overhead by using long block lengths, it might be prohibitive to implement such codes over large finite field sizes which may lead to unacceptable latencies in case of data migration and retrieval requests. For instance, let us suppose we have the data that we need 2048 codewords for encoding when the code block length is 16. We could have used a block length of 32768 instead to make the interleaving overhead zero. However, the price we would be paying is the increased operational complexity and a potential retrieval delay.
Fountain codes were invented many years ago and yet become practical only recently with the introduction of LT, online and Raptor codes. Operation principles of these codes are solely based on simple XOR logic constructed over large data regions using parallel hardware. Fountain codes are good at places where RS codes are not so perfect. For instance, the complexity problem of RS codes mentioned earlier is not applicable to Fountain codes because the complexity of fountain coding operations can be made scale linearly with increasing block length (for instance Raptor codes). Hence, fountain codes allow large block lengths to be possible (theoretically without any bounds) and address the disadvantages mentioned earlier due to small blocklengths such as reduced I/O and aggregate throughput.
The working principles of fountain codes are much simpler and unlike RS codes, the encoded output typically depends on a subset of the data partitions. This is illustrated in Figure 1 for systematic RS and non-systematic fountain codes. The edges in this figure establish the computational relationship of the data and parity partitions. Pink (parity) partitions are computed as suggested by XOR-ing the connected data partitions as shown. This partial data dependency property is desirable because it leads to local groups appearing within the code and hence a single data partition regeneration can be handled within the local group which reduces the required bandwidth for the occasional repair operation. Of course, making the code possess such desirable properties for a distributed storage setting has a side effect which is the non-optimal coding in terms of space efficiency. What this means is that fountain codes consume more storage space than what an optimal code (RS) would use to store information at a given reliability level. This is known as the coding overhead. For instance, in the example of Figure 1, the RS code can tolerate any three block erasures (failures) whereas the fountain code cannot. In order for fountain code to have the same failure resilience, it has to have more coding (+parity) symbols or partitions. Thus, RS codes have zero coding overhead whereas fountain codes have non-zero coding overhead. Although with finite blocklength the coding overhead of Fountain codes is non-zero, as it gets large, the coding overhead becomes smaller and hence its space efficiency gets close to the optimal. Trading-off the space efficiency with I/O and throughput efficiency and interleaving overhead, the use of Fountain codes might be quite beneficial for distributed storage architectures. A very related paper on the use of fountain codes for large scale cloud storage is given in Liquid Cloud Storage.
It is quite common for virtualization environments and some known applications such as bank transactions to use a high-performance storage backend that is usually designed to use SSDs as a large cache. It is also known that when SSDs address writes requests, especially when they reach their capacity, due to write-induced operations such as garbage collection-incoming requests shall experience long waiting times and hence reduced write rates. Such is also observed with mixed workloads such as repeated reads after writes. Although reads do not lead to such write-induced latencies, when they arrive before and after writes in an interleaved fashion, their completion will also be affected by write-induced latencies (due to the write queue). One of the workaround to this problem is to separate the read and writes by allowing a subset of SSDs in different nodes to do only reads or writes and periodically switch their operations. Since some reads and writes must be delayed due to this scheduling, a faster-caching device such as DRAM must be used to store delayed writes. In such a distributed setting, redundancy can be used to separate reads and writes and the bigger the redundancy is, the better the read performance of the SSD cache layer becomes. However, the computation workload of the coding must be reasonably well as SSDs are intended to be a non-volatile fast responding cache layer of the modern distributed storage architectures. In this case, Fountain codes will excel and allow perfect read performance for the high performance required applications such as the HPC market at the expense of using more redundancy and less storage efficiency.
In today’s storage clusters, nodes are constituted of various types of storage devices such as disks, SSDs, and sometimes even tapes (yes they are still around and will be around - that requires yet another article if not convinced). As these devices are used under heavy workloads, they wear/tear, and hence the data within them becomes vulnerable to loss with high probability over time. One way to overcome this problem is periodically replace disks even before they fail as an immediate precaution. However, it is hard to adapt (usually more time costly too) the hardware to the changing environmental conditions, aging and loss of reliability. An alternative option is to change the erasure code policy and generate more redundancy and adapt to the decreased reliability. The redundancy generated by RS codes cannot be easily updated by generating additional redundancy for adaptation. To be able to increase the strength of protection with RS codes, we need to re-encode the whole data block and replace the parity partitions with the new partitions. This shall lead to unacceptable network traffic and I/O due to erase and rewrite cycles. On the other hand, fountain-like codes allow simple modes of operation to generate only the additional redundancy without changing the already existent redundancy. This saves systems time, network resources and adapt to the changing hardware conditions instantly leading to improved maintenance and hence reach up to maximum durability with the existing infrastructure on time. Such an adaptive nature reduces the so-called update complexity of the error correction code.
Finally, it is worth to mention the value of the erasure code’s mathematical structure to allow distributed computation (yes this has a saying related to coded computation and so on). This is becoming more serious as today’s cloud infrastructures heavily use heterogeneous storage devices for virtualized environments and applications (particularly machine learning algorithms, data analytics, etc) require more stable and consistent latencies when they want to reach up to a source. In RS codes, since every single data partition must play part in a given redundant partition or parity computation, a distributed system either computes the parity and send or nodes collaborate and send their data for parity computation. The latter is quite costly because it will involve the whole user data to be sent over the network. Thus, the former method is typically used to reduce the bandwidth computation. However, such an approach will create an unfair distribution of computation across the network for erasure coding. Since it is customary to have computation power at different network nodes, it is desirable to distribute the computation as well. Fountain codes may excel with respect to this, because coding partitions do not in general use the whole data partitions, but only a subset. Thus, with a small increase in the used bandwidth, the computation can be distributed across the network. Our example case is encoding, however, the same idea can be extended to decoding, system repairs, and update operations as well.
There has been a recent attempt to create an open-source fountain code-based software library that goes with the name Founsure 1.0. This library provides decent correction against node losses or unavailabilities as well as good repair and updates features at the expense of increased coding overhead (inefficient space complexity). On the other hand, Qualcomm Inc. owns a space-efficient (very close to optimal) fountain code that goes with the name RaptorQ optimized for multimedia transmissions for broadcast and multicast scenarios. Although the complexity is linear, the proprietary product RaptorQ employs many techniques to reduce the coding overhead. However, this makes the implementation quite complex (linear time with a large coefficient) and its performance is observed to be lower than Jerasure performance with appropriate selections of coding parameters. In addition, minimizing the coding overhead might not be the only goal of a distributed storage application hence requiring critical changes to the existing implementation. In my humble opinion, the effect of Jerasure being open-source and accessible to research groups has got a lot to play in the performance gap between these two libraries due to various researchers have long been working on the recent improvements on Jerasure library (despite all legal issues raised by streamscale inc.). The other proprietary fountain code implementation is used in Lattus product of Quantum Corp. in which online codes are used as claimed by Amplidata’s patent documents. It is projected that this implementation uses more coding overhead to trade off the other important resources of a distributed storage system as suggested in this brief.
All in all, RS and fountain codes have their own pros and cons and should be thought of as complementary rather than competent to each other in many applications. They individually have their own use cases and can find unique places in distributed system applications in a hybrid fashion where system can adaptively switch from one to the other depending on the circumstances. Above all this, when these codes are applied to distributed storage, they have to be carefully integrated into the existing framework so that the storage performance fully benefits the advantages of each coding scheme that are proven useful on the research papers. One still needs to keep in mind that ideal (such as stationarity etc.) models and environmental conditions may hold on in papers, however, my humble opinion is that real-world and implementation is whole another realm to tackle.
Director of Advance Development, LTO at Quantum
6yAn excellent over review of two different erasure codes with totally different architectures. Fountain codes offer simplicity yet a powerful adaptability where one can add additional parities without effecting the original code word data but comes at an expense of extra storage overhead whereas RS is fixed and one must recompute entire code word if the application demands additional parity later on. Also another potential advantage with fountain codes is the fact that they can be used to compute very long code words where as RS is limited to typically sub 20 symbols per code word when implemented by current software even if one uses Intel's ISL library due to execution speeds. For instance a simple 20/4 policy with RS has 4 parities total. But a similar policy with fountain codes assuming 10% extra code overhead can be developed using 22000/16000 code and this can be done at the compatible speeds to RS codes with 20/4 policies. Disregarding the 2000 extra code overhead the fountain code offers 4000 parities which is equal to RS codes 4 parities. Yes fountain requires extra overhead however this is where code research can be applied to reduce such cost. Some of the potential benefits of very long code words are the reduced meta data, ability to represent files with different sizes using simpler and smaller number of code words and ability to correct for higher random errors compared to shorter RS architecture. For instance using the same 20/4 example, if one looses a disk, the code will loose a symbol however with fountain type codes with 22000/16000 policy same HDD will cost the code 1100 symbols . Now the RS code in each remaining HDD has only 1 symbols per HDD to use for errors such as thorn writes, silent errors, BER errors, misdirected writes etc. But the compatible fountain code has 1100 symbols per HDD after loosing a entire HDD. Therefore if the errors are localized to each disk, the robustness of fountain codes due to very long code words are very intriguing even at higher code overhead. With the ability to use very long code words provides additional robustness and improved data durability with fountain based codes especially for hyper-scale data center applications. For instance with 20/4 policy RS is forced to use 20 disks to spread symbols across but with a compatible 22000/16000 fountain code as we discussed with this example, one can use more disks where data durability will increase substantially due to the availability of more parities without sacrificing any increasing storage overhead. Having said all these unlike Reed Solomon (RS) codes, fountain based code designs are complex and one needs to determine if a particular implementation has any potential error rate floors.