Cloud Native ABC - C like CAP Theorem
Cloud native software is highly distributed, must operate in a constantly changing environment and is itself constantly changing.
Cornelia Davis. Cloud Native Patterns. Manning Publications Co.
In this series, I will be looking at various concepts and tools relating to the term "cloud native". Each article is planned as a small mosaic piece that gradually comes together to form a colorful overall picture. In the last part, I described the relevance of deployment strategies within the context of cloud native system landscapes, focusing on Blue-Green deployments (to do justice to the alphabetical nature of the series' title 😉). In the third part of this series, I would now like to look at the CAP theorem and its significance for distributed systems.
What is the CAP theorem and what does it say?
The CAP theorem (which isn't related to SAP CAP) was formulated in 2000 as a conjecture by computer scientist Eric Brewer, and attained the status of a theorem in 2002 with the provision of an axiomatic proof by Seth Gilbert and Nancy Lynch. More exciting than the historical origins of the theorem, however, is probably its core statement:
A distributed system can fulfill two of the following properties, but not all three:
Recommended by LinkedIn
What is the significance of the CAP theorem for cloud native architectures?
As already seen in the first part of this series on asynchronous communication, network-based communication between distributed nodes of a system plays an important role, as does the need for robustness of individual components against errors. This means that partition tolerance is a property that a distributed system in the cloud must fulfill in any case. The consequence of this, based on the CAP theorem, is that such a distributed system can either guarantee high availability or constant consistency, but not both together. This means that you have to choose one of the two properties when developing the system. In the second part of the series on deployment strategies, we also saw that high availability is an extremely important characteristic and that even comparatively short downtimes can result in significant financial losses. From this constellation that high partition tolerance and high availability are indispensable properties of a cloud-based application, it inevitably follows that consistency in the strict sense must be "sacrificed" for this.
Eventual Consistency
How can this be implemented in reality? Let's take a look at a simple example. An online webshop is implemented using a distributed architecture consisting of two components: the webshop itself, which serves as an interface to the customer, and a stock service in which items and the respective stock levels are managed. The stock service uses a database for this purpose. This system undoubtedly fulfills the consistency requirement: if the user accesses the webshop article, a request must always be sent from the webshop to the stock service to request information on which articles are available and how many of each article are available. If there is no error, this always returns the current data status from the stock service database. But what happens if communication between the webshop and the stock service is temporarily unavailable? In this case, the system is not well able to react to this interruption, i.e. the partition tolerance is weak. The webshop may still be highly available in the sense that the user can access the UI, but it will probably display information that the article information is currently not available and therefore no purchases can be made, which is not very pleasant for customers.
Redundant data storage is the method of choice in this case to increase partition tolerance. The web store also receives a database into which stock data from the stock service database is synchronized. As a rule, it does not make sense to keep a complete copy of the entire amount of data, but only a subset of the most important data is synchronized, on the basis of which it would be possible to guarantee maybe not complete, but at least partial and most relevant functionality of the system in the event of a temporary interruption in communication. In the example given, the webshop would retrieve current article and stock data from the stock service when users access it and store this current status in its own database. If the stock service is unavailable, the webshop uses the data from its own database. This means that the user's shopping experience is not impaired and the system appears to be fully functional. This is referred to as "eventual consistency". Eventual consistency means that if a data item does not change any further, all nodes that persist this item will eventually maintain the same data status. However, states can occur in the system in the meantime in which different nodes represent different statuses for the same data item.
Of course, this involves certain risks. Suppose an item is stored in the webshop database that is still available in the warehouse. In the meantime, a change occurs in the stock service that causes this item to be removed from stock. However, this change is not immediately transmitted to the webshop. Now a customer accesses the webshop, but at this moment the connection between the webshop and the stock service is interrupted, i.e. the webshop displays the data from its own database. The customer is pleased that the item in question is available and buys it. As soon as the connection in the network is stable again, this conflict becomes apparent and the exception must be handled. An email is probably sent to the customer apologizing for offering an item that is no longer in stock. A reversal of any payments made is initiated or you can suggest an alternative item to the customer or grant a discount for a future purchase. In the end, this solution is probably better for the webshop compared to the alternative of stopping all transactions for the duration of the disconnection. However, it is important to understand that eventual consistency can also come along with challenges and complex exceptions that need to be considered and handled correctly. A well-known pattern for carrying out a rollback in distributed, asynchronously communicating systems in the event of an error, which ensures that a consistent and correct data status is created at the end, is the Saga pattern, which I would like to introduce with the letter "S".
In conclusion, it cannot be said across the board that high availability is always better or more important than consistency. Take, for example, a banking system in which account balances and payment transactions are managed. Displaying incorrect account balances or non-execution / multiple execution of transactions, for example, must be avoided at all costs in order to maintain customer trust and acceptance. A distributed online banking system is also more likely to be taken completely offline than incorrect information or functionalities being offered.
Summary / TL;DR