Cache Is a Lie You Agree to Believe

Caches are among the most powerful tools in distributed systems, but they introduce modal behavior that can turn a performance optimization into an existential threat to system stability.

"What's interesting and important here is that these are both stable loops. Unless something changes, the system can run in either one of these modes forever. That's good in the case of the good loop, but bad in the case of the bad loop." Marc Brooker

A cache creates two distinct operating modes. In the happy loop, the cache is warm, database load is low, latency is good, and everything works. In the sad loop, the cache is cold, all requests hit the database, the database slows down, requests timeout before the cache can be filled, and the cache stays cold. Both loops are self-reinforcing the system is bistable. This is a textbook metastable failure waiting to happen, and it is arguably the most common one in production systems.

The insidiousness runs deeper than it appears. Caches "extract cacheability" the traffic that misses the cache is inherently less cacheable than the traffic that hits it. Load testing usually does not reveal the bad loop because caches love high, predictable, well-behaved load. The real danger is load with a heavier-tailed key distribution than normal. And over time, as teams scale down databases because "the cache handles it," the system becomes addicted to its cache what was a helpful optimization becomes load-bearing infrastructure that nobody remembers life without.

DynamoDB's approach is instructive: it does not allow caches to hide the work that would be performed in their absence, ensuring the system is always provisioned to handle the unexpected. For look-aside caches specifically, consider read-through caches instead they allow cache filling to continue even after the application gives up on a request, steadily increasing the hit rate during recovery. Use soft and hard TTLs so stale data can be served during downstream brownouts. And always test with caches disabled to verify your system survives.

Takeaway: Never let a cache hide the true cost of your workload provision for the world where the cache does not exist.


See also: Metastable Failures Are the Hardest to Prevent | Efficiency Is The Enemy of Resilience | Goodput Matters More Than Throughput