Have you ever come across an advertisement for services like graphic design or house painting that begin with a catchy headline: “Cheap, Fast, and Good: Pick Two”? Cheap represents the financial aspects of the project.
Fast represents the speed at which the tradesperson completes the project and quality signifies the level of craftsmanship and excellence.
The tradesperson is simply saying it’s unreasonable for a client to expect to get all three. There is a trade-off and the client should choose what to forgo: Fast and good = expensive Fast and cheap = low quality * Good and cheap = takes time.
What is the CAP theorem?
In the tradesperson analogy, you are faced with a trade-off, similar to the CAP theorem. The CAP theorem uses the same type of logic in distributed systems.
A distributed system is made up of multiple nodes storing data and can only deliver two of three desired characteristics: Consistency(C), Availability(A), and Partition tolerance(P).
Consistency
Consistency means every read request from clients receives the most recent write. All clients see the same data no matter which node they are connected to.
For this to happen, whenever there is a write in a database node, the data should be replicated to all other database nodes. In other words, updates to the system are immediately reflected in all nodes. The data should be consistent.
Availability
Availability means that any client making a request should always get a valid response even when some nodes are down or unavailable.
Partition Tolerance
A partition is a communication break, which happens when there is a lost or delayed connection between 2 or more nodes in a distributed system. Partition tolerance means that the nodes cluster should continue functioning despite a number of communication breakdowns between nodes.
To make it easier to understand, let’s look at a concrete real-world example.
Real World Example
Let’s say you have developed a system and it gets quite popular. You decide to scale horizontally by increasing 2 database nodes(n2 and n3) in order to serve clients’ requests more effectively.
Ideal situation
When data is written on the n1, it is replicated to the other nodes(n2 and n3). Both data consistency and system availability are achieved. This is an example of a CA (consistency and availability) system.
A CA system cannot exist in real-world applications since network failure is unavoidable. Network partitioning generally has to be tolerated.
Normal situation
In the real world, network partition cannot be avoided, and when it occurs we must choose either consistency(CP) or availability(AP). When n3 goes down and it cannot communicate with n1 and n2, if clients write to n1 and n2, the data cannot be propagated to n3. This means n3 will have stale data when a user makes a request via this node.
A CP (consistency and partition tolerance) system like a bank will choose consistency over availability. They must block all write operations to nodes n1 and n2, to avoid data inconsistencies among the 3 nodes which will make the system to be unavailable.
Bank systems have extremely high consistency requirements, for instance, it’s crucial for a bank to display the latest bank balance of a client to avoid over withdrawal.
If data inconsistency occurs due to network partition, the system should be unavailable and return an error until data inconsistencies are resolved.
An AP (availability and Partition Tolerance) system will keep accepting client data reads even when the data is stale. n1 and n2 data nodes will keep accepting data writes and the data will be synced once the data partition is resolved.
This architecture is popular in systems like social media platforms where users’ uninterrupted access and interaction with the system are prioritized. Social media users can continue browsing through the content even if the data is stale.
Conclusion
Distributed systems make a group of computer nodes work together to achieve high computing power and high system availability that was not available in the past.
The system has a lower latency, a higher throughput, and a near 100% uptime.
Distributed systems offer many benefits, however, according to the CAP theorem, you can only prioritize two out of the three properties:
Consistency
Availability
Partition tolerance.
This simply means, in the event of a network partition, you have to choose whether to prioritize system availability and sacrifice consistency or vice versa.
It’s important to note that the CAP theorem does not provide a prescriptive answer but rather highlights the trade-offs that need to be considered in distributed system design. Different systems may make different choices based on their specific requirements and constraints.