2. Consistency, Availability and low Latency in Distributed system
(workarounding the CAP/PACELC theorems)
Introduction: "1. Cache and Data Consistency in Distributed systems (CAP/PACELC/CRDT)"
While eventual
Consistency is OK many times, there are still cases when we want a
strong read-after-write consistency for certain read-after-write flows.
There is an optimum design that assures strong Consistency inside
read-after-writes flows. Arguably, this is the highest Consistency level
that can be assured without a prohibitive impact on Availability and
Latency.
The trick begins with full asynchronous replication between geographical regions. This assures eventual consistency for the type of reads that can tolerate it, with minimum latency (local reads). Such replication can happen in less than 1 second normally. Reads that can tolerate eventual consistency will always use the local data directly.
For flows that need read-after-write (strong) consistency, we design a mechanism to detect the rare cases when the local value is older than previously written data in that flow. For this, the flow needs to carry a small consistency information.
I say that any other solution that will
not carry consistency information in the flow cannot achieve lower
latency, however it will achieve higher latency in certain scenarios.
This solution to flow consistency requires to assign a certain increasing "version" for each updating value. When a flow writes a data, it receives the (always increasing) version of the updated data. That version must replicate asynchronously with the value in all geographical nodes, overwriting the old value and old version - this can be atomic with minimum cost.
When a subsequent read is done in another geographical node, the request flow will present the version received after the previous write. If the replication was not fast enough, the node performing the read request can detect that it has an older version and wait some more milliseconds for replication to catch up. Alternatively, the inquired node can do a request directly to the writing node.
If the writing node gets unreachable in that short period, the read can fail or fallback to eventual consistency - depending on business requirements. What is important is that the eventual consistent requests will continue to be served with minimum latency, assuring the highest possible Availability.
This solution
assumes that it is OK to sometimes receive even newer data - from other concurrent writes in other flows. Assuring isolation between flows is not
covered here, however it might be addressable with a similar design based on more strict requirements on version, similar to detecting a merge conflict in git.
Implementation
If
the write is always done in a single node ("main region"), it is enough to set for each value the updating timestamp. Timestamp can be incremented sometimes so it always increases. The timestamp is returned to the writer. The writer presents the timestamp when reading from another region. The reading region decides if the replicated data has at least that timestamp. If replication is not finished yet, poll data until timeout or request value from the main region.
There
are solutions to assure local write in any region. The design is
similar; however the "version" handling gets a little more complex.
And now the details
Real life problem
When reading from another region, the flow will provide the version received when writing. If the local data has a lower version, the node needs to poll until a timeout to have the data replicated. For the cases when the replication is lagging, the node can do a homing to the main region, to retrieve the data out-of-order. In many cases it is acceptable to receive an even newer version than was written in the flow, while you can have a separate handling for such events - based on the version.
Using a timestamp as version
Instead on incrementing a counter, we can assign the current timestamp to the new value. If the current timestamp is smaller or equal to the previous timestamp, we might need to increment the timestamp - to make it to always increase. Of course, that timestamp should update atomically to prevent reusing the same value.
If we use milliseconds, we can have a maximum 1000 updates per second, that is usually enough for each single key. We can use the more expensive nanosecond time for higher update rate.
Having the version as a timestamp brings some advantages. First, we can always know how old is that data. When we do additional caching of the data, the value can lag in the cache way more than the replication latency. Having the exact update time allows to make more granular decisions for cached data. Even without caching, the replication can lag sometimes, and we can use the timestamp to decide how old is too old.
When we want to enforce strong flow consistency, no older data is good enough. When we settle for eventual consistency, we favor Availability by allowing stale data, as long as that data is not too old. Different business flows might have different age requirements for the same data.
Note that you cannot have synchronized clocks across regions. The clock can only be consistent with order inside one region - when taken from a single server. When you write in multiple regions, you cannot reliable compare clocks when the distance between the timestamps is small (milliseconds).
Example of age trade-offs
Unlike versions, you can use timestamp more creatively, based on the "age" of the cached data. For example, even when the game player's rank updates each 5 seconds, you can read with a maximum age of 3 minutes during the game - to allow aggressive caching and low latency.
However, after each game you can require a maximum 6 seconds age to have the best chance to receive the last updated value. If local data is older than this, you can just show "data is not available". The important thing is that you will not block that request too long, even if there are network issues between nodes.
Some data like age, name can be cached for days - maximizing Availability, and only the flows that updates them should require a very fresh value - enforcing Consistency.
Global counter versus local counter
Having a monotonic version that is global to the data repository opens the possibility for additional consistency requirements. For example some flows might want to never read a data that is older than any data that was written or read before in the flow. This can assure consistency over multiple chunks of data even if the replication order is not guarantied. The flow must simply present the maximum of versions obtained for previous reads and writes.
However, the global counter can become a bottleneck for the system. The more scalable approach is to have a separate counter for each value, that only increases when that value updates. In this case you cannot compare versions between data, however the system becomes more scalable.
Bigger consistency domains
Note that you can assign version at any granularity is required for your case. If your data needs to be consistent across multiple keys, you can assign a version for all those keys. You only need to assure that, locally in each region, the version is updated atomically with all the data it guards.
Just remember that using bigger strong-consistency domains results in higher latency - because there is a bigger chance that another flow to update the same data and require an even higher version than this flow would need.
You can theoretically create very long consistent flows. The drawback is that the latency increases with dependencies and it is harder to deal with concurrent writes from other flows.
At limit, if you try to make all your data a single consistency domain... then you end up with the CAP/PACELC limitations that you wanted to avoid. The trick is to isolate small consistency domains for each flow that has a business need in read-your-writes consistency.
A similar
strategy can be applied for writing in multiple regions. The difficulty
is to create monotonic versions across regions without coordination, for example using vector clocks. This should be the subject of a future article.
Read also:
3. Data Consistency and the "Theory of relativity"
Please share this article if you find it interesting.
Subscribe by email or RSS feed if you want to receive future articles.
Thank you.
Comments