1. Cache and Data Consistency in Distributed systems (CAP/PACELC/CRDT)


Abstract

There is always a tension between data Consistency and system Availability when Partitioning a system across datacenters (think CAP). Especially data cache-ing poses interesting challenges. This tension becomes way more acute as soon you have 2 data centers separated by more than 10ms latency. I present below some of the problems along with possible solutions.

In the end I will present an elegant solution that maximizes Availability while providing the needed Consistency level for read-after-writes flows. The solution requires the client to carry a monotonic id along the flow. I would postulate that any solution where the client don't carry some consistency info will provide a higher latency that the presented solution (see chapter "Flow consistency").

The examples below are simplified to be intuitive and easy to understand, however these learnings also apply to N datacenters.


How it starts

Suppose you started with on single datacenter called Central. It has all your data, and all clients are sending requests there. You have some in-memory cache-ing (like Memcached, Redis), and you are managing to keep it consistent by making sure that each write to database is also put in the memory cache as it is written to database.


Let's say this Central datacenter is in US, but you start having clients also in Europe. You want to reduce the latency for European clients that is like 200ms for a simple ping, however it gets multiplied for https handshake, making a simple request to take over 700ms. You decide to create a second data-center in Europe, let's call it Satellite.

The Satellite can do https handshake with clients way faster and redirect all requests to Central. Each request still takes like 200ms, but not 700ms as before. It's way better, but not ideal. You decide to cache some often used data in Europe, to not need to go to US each time you need them.


Now the hell begins

Suppose you decide to keep writing only in Central, and serve reads from both Central and Satellite. Allowing writes only on Central will make the explanation easier to follow, while the problem is almost as complicated as having writes in both data-centers.

If writes are done only in Central, data in Central will always be correct. However, from time to time, data that is cached on Satellite is updated in Central. For example a client buys a new service, so a local US sub-system is updating client's data. This leaves Satellite with an outdated version of that data in cache. You could call Satellite each time to ask to invalidate or update it's cache, but this would dramatically increase the latency for each write.


Decide to pay latency only on write

A simple solution is to allow Satellite to always use it's cached data, but assure that each write will update Satellite before the request finishes. Reads from Satellite can be served with small latency - using cache, while paying the latency cost only on writes. This worth evaluating is the writes are few and reads are many. On the rare occasions when writes in Central cannot update the cache in Satellite, you can decide that it's business acceptable to serve some stale data on Satellite. Just to be sure, you should set a reasonable expiration time for cache on Satellite. This assures that even after a problem that produces inconsistencies, the inconsistency will be fixed by the cache expiration. For example, some clients might tolerate to wait 1h to have full consistency restored.

Even if it looks simple, updating cache on Satellite when Central has a write has its pitfalls. It is not safe to just copy the new data from Central to Satellite when a write comes to Central. On rare occasions, two updates from Central might get processed in reversed order, leaving Satellite with the oldest update. It's safer to only invalidate the cache on Satellite, so Satellite will fetch the last data at the next read. Even if invalidates are processed in the reverse order, after the last invalidate Satellite will read the last updated version from Central.

This above solution is technically correct, however it dramatically reduces the effectiveness of the cache, as data needs to be fetched again from Central. You actually pay the latency two times: on invalidate and on first read. Over this, until the next read, the cache might expire. If you want, you can force a data refresh after each invalidate. Still, you don't want to keep the cache timeout too high because, on rare network outages, some cache data might remain outdated for that expiry period.

You might imagine an additional system that will asynchronously propagate data updates, so the system can recover after an error. This is not such a good idea, because having two update channels makes possible for an older update to be processed after a newer one, leaving the data inconsistent. It is safe, however, to have a second channel of invalidates. After invalidates, the next read will force the Satellite to read the latest data from Central.

There is a stronger result than CAP theorem, proving that, in order to assure (strong) Consistency you need to pay the latency between Satellite and Central even when there is no network partitioning: PACELC.


Trying to reduce the latency on write

An idea is not to call Satellite when Central knows that the updated data is not currently cached on Satellite. This can be achieved by storing, for each data on Central, a flag telling if it was cached or not. The problem with this approach is that the more you cache on Satellite, the more writes needs to pay the latency. This improves writes only by reducing the cache-ing on Satellite - impacting Availability. Still, let's see how it goes.

In this case, only updates for data that was cached on Satellite will require an oversea call. Still, if something wrong happens with the connection between continents, Central would have to block that writes - waiting for Satellite to respond. If you are not careful, you can run out of resources very fast (threads, file descriptors, memory). This compromises the Availability of the service not only for Satellite/Europe, but also for Central/US. Think A from CAP theorem.

You can try to protect the writes by a timeout, but what should it happen on such timeout? If you permit this write in Central, the Satellite will stay with it's outdated cache until cache expiration. You could set a very low cache expiration on Satellite, but doing that makes the cache ineffective, as it will expire before data re-use. Creating an additional system to replicate data when connection restores is a nightmare that you should avoid as much as you can.

A solution that is sometimes applied is to shard the data close to the region where it is used often. The data of Europe clients stay in Europe, the data of US clients stay in US. This is an acceptable solution if you don't have many writes from the other data-centers. Still, this solution does not permit to failover from Satellite to Central for disaster recovery. Maybe Satellite is not such a mature datacenter, you may want to redirect it's requests to Central sometimes, however in this setup, Europe's data is not available in Central. You might need this solution for data sovereignty, but for resiliency it is not very appealing.

Alternatively, you can keep all data in both Central and Satellite and assign the closest datacenter as "master" for each specific data. The master datacenter would be in charge of writes, keeping a kind of "lock" for that data. The master should always know where that data is cached and invalidate it on each write. This reduces the average latency if data is often consumed close to it's "master" datacenter. However, the latency increases when the data is often accessed from multiple datacenters. Also, this solution is really hard to get right.

In order to be more adaptable, the "master" of the data should move between datacenters based on the pattern of access. However, keeping multiple datacenters to have the same vision about what is the current master is so complicated that it's almost sure that it will create more problems than it solves.


Eventual consistency

Most of the time, reading data that is some seconds behind is not a big deal. If the client can simply refresh and see the update you can use eventual consistency. You can asynchronously replicate all data changes without blocking writes. Don't try to do it yourself, there are many edge cases, for example after a disconnect. One good option is to use a database system that provides asynchronous replication. In this way, you can have all data in all datacenters, that is a really good thing for resiliency (Availability).

Many storage systems can use multiple channels of replications between multiple system, while guaranteeing a "strong eventual consistency" - systems will see the same image of data regardless of the order of the received updated. See CRDT for details. Some cases can be problematic when updating the same entry concurrently. This works best with keeping data immutable, never update, only insert/delete, see CQRS.

In case of a partition between Central and Satellite, replication will fall behind. While in normal conditions asynchronous replication can update Satellite in less than few seconds, a communication problem can leave Satellite with data that is tens of minutes old, even hours. Even after communication is restored, it might take hours for the replication to catch up. You have to choose a strategy for this situation: how long do you allow to serve stalled data. At some point you might decide to refuse serving traffic from the Satellite datacenter, or redirect requests from Satellite to Central if possible.

If you think about it, in case of a network Partitioning between Central and Satellite, risking to serve stale data is the best you can do. Without a connection to Central, Satellite does not have any means to know if a certain data was updated or not on Central after network partitioning. The choice is to serve possible stale data (sacrifice Consistency) or not serve data at all (sacrifice Availability). Eventual consistency sacrifices a bit from Consistency in order to improve Availability. What is better is a business decision.

For something like a bank, account eventual consistency is usually not an option. If you decide to not serve data in case for replication lag, you need an additional health check to know when the replication stopped. No data from Central can mean that either there was no update or ... Satellite has lost communication with Central. You can still have some milliseconds of lost updates.


The read after write problem

Having couple of seconds update delay might not even be noticeable for most usages. However there is a case that is really painful: when you have a client flow that involves writing in Central and immediately reading from Satellite. For example a client pays to have a new service configured in Central then the client is redirected to Satellite to use the new service. In this case, the chance to read stale data increases dramatically.

The problem is, even if these cases are rare, you cannot easily decide when a read must be consistent with a write that happened just a moment ago. If you would know this, the right solution would be to call Central for these cases, to be sure that you serve the last available data.

In order to not pay the latency on all reads, the client should provide an additional hint when doing read-after-write: telling that this read must be fully consistent and no cache is allowed. When you write something to Central, the data replication needs to pay the inter-datacenter latency anyway. It's better to pay that latency on Satellite (at read) than blocking the write in Central.

Such reads after writes in a flow will have lower Availability that cached reads. On the other side, if the Satellite is disconnected from Central, there is no way for the Satellite to see the latest write on Central anyway. The important thing is that we can continue to serve many other reads that are ok with a "close to realtime" view of data (eventual consistency).



Flow consistency

Most of the consistency issues happens in a "read after write" flow, as above. A simple solution is for the client flow to pass a "no cache" option when coming to reads from Satellite after a write in that flow. However, this option might force a read from Central even when the data is already propagated to Satellite through near-realtime asynchronous replication. You can improve this by requesting the client to carry some consistency data.

When the write is done in a flow, the client can receive from Central a monotonic id - a version of the written data. This id must be asynchronously replicated to Satellite, with the data. When reading from Satellite, the client presents the id of the written data. Satellite should ignore cached data that have a lower id, in this case doing a forced request to Central.

Sometimes Satellite might serve an even newer data (higher id) - if another update came close after. Still, the clients are guaranteed to  never receive a stale data - that was actually updated by the previous write in the client's flow. As updated data should arrive in less than 1 second, you can also poll at Satellite for the updated data to arrive (with a timeout).

For eventual-consistent-tolerant requests, the outdated cache can still be used for reads - while it should normally be updated in less than 1 second. For these reads, the cache can assure a very low latency for Europe's clients. Also, if there you detect network problem between Satellite and Central, we can reconfigure the system to assure Availability at the cost of serving some stale data (less Consistency). You can fine tune how old is too old by setting an appropriate cache expiration.


Using timestamp

If you are unable to replicate a data version id from Central to Satellite, you can use the client's write timestamp instead of an id received from Central. This is also monotonic, however it creates other difficulties, as there might be a small time skew between Client, Central and Satellite.

Even if the Satellite has a cached data with a timestamp that is few milliseconds higher than the write timestamp (as seen by client), this data might be actually a little older than the write, because clocks are not perfectly in sync.

Because of this, Satellite still needs to pay the latency to Central when the presented timestamp is recent. However, when the presented timestamp is old enough (more than a realistic time skew), Satellite can safely serve the cached data.





Conclusion

The above solution solves many practical consistency requirements. The provided consistency is a little weaker than Causal Consistency, while it is simpler to implement and allows to cache data aggressively for the cases that tolerates eventual consistency. This solution can be applied also to a master-to-master system (writes on all datacenters).

I believe the solution to ask the client to carry the monotonic id is actually the optimal solution to minimize the Latency cost and assure maximal Availability. All other solutions require a rendezvous synchronization point that cannot always be in the proximity of the read/write. I am not aware of existing research on this particular solution, let me know if it already has a name.


See part2: Flow consistency



Please share&subscribe if you find it interesting. 
Thank you.

Comments