Fail Fast Is Failing… Fast!
Changes in our compute environments are placing pressure on our tried and true distributed systems solutions
Fail Fast, Fault Tolerance, and Gray Failures
For over 40 years, FAIL FAST  has been the dominant way we achieve fault tolerance. In this approach, some mechanism is responsible for ensuring that each component is up, functioning, and responding to work. As long as it’s still alive and healthy, we continue forward, when something goes awry with the component, it is removed from the system and the remaining components reorganize themselves to continue. For many (but not all) of our systems, this is how we ensure the system remains alive when pieces fail.
As we move to leverage cloud computing, this is getting more challenging. First of all, I love cloud computing and think it is an essential step forward for our industry. Still, the way we create robust solutions is under pressure as the individual components experience emerging challenges called GRAY FAILURES. In a gray failure, a server or part of the network doesn’t fail fast but instead starts running slow. Running slow is WAY worse than failing-fast. The slow component, sometimes running at less than 1% of its normal speed, may be healthy enough to say “I’m still here!” but slow enough to clog up all the work. This makes fail-fast schemes vulnerable.
Back Before Everything Turned Gray
Back in days of yore, hardware for your servers was in YOUR datacenter in tightly controlled environments. Some salesperson had convinced you that you needed the best, most expensive, and most resilient servers. You knew if you bought THOSE servers, you wouldn’t get fired. Connecting those servers, you had a local network supporting only your stuff. You could ensure that the traffic running in your local network wasn’t TOO crowded. Because of this, the individual servers could respond predictably to high priority messages like “How’re you doing, Good Buddy?” The messages flew through the over-provisioned network like a trip down Highway 101 through San Francisco at 4am on Sunday morning.
Leveraging these extremely probable answers to the inquiries about health, it was easy for a supervising server to impose rules for health or sickness. There was an expected time for an answer to this health check. If the response took too long, you’d try again. The longer you go without an answer, the more likely the lagging server was truly sick. After a reasonably small number of attempts, you could give up and declare the party pooper was excluded from the party. Yeah, the decision was probabilistic but the odds were really, really good and you could decide pretty fast.
Next, it became important to crisply evict the renegade node. Sending it a message to cease and desist may or may not make a difference. Frequently, this is called STONITH (Shoot The Other Node In The Head) . Sometimes, the other node has a hard head and STONITH doesn’t work. Another trick is to ensure the banished node cannot do any harm. If you can guarantee the wayward node can’t make any change OUTSIDE the node, the rest of the team can move on with our lives. Let it rot in solitary.
In this fashion, you could detect the node had failed and ensure it had completely failed. In the past, it didn’t take too long to do this.
It’s Hard to Avoid Being an Ass
There is a paradox called Buridan’s Ass  named after the 14th century philosopher Jean Buridan. Buridan’s Ass highlights an apparent contradiction inherent to the concept of determinism or the belief that every event is derived from previous events. The paradox proposes that a very hungry donkey is placed exactly halfway between two bales of hay. Assuming the donkey will go to the closest resource, it will starve to death.
In electronics, there is a phenomenon called METASTABILITY. It means the system may be in an unstable state for an unbounded time. To have a valid signal as an output to the circuit, it must reside within a certain voltage or current level. If the output lands in the middle gray area, wonky things happen to the next circuit. It, too, can do weird things.
This metastability is inherent within asynchronous circuits. Since you don’t know the timing of the input signals, some of the time the various inputs arrive simultaneously and the circuit can’t decide between the two bales of hay. Asynchronous digital systems usually add arbiters to the input signals to ensure the signals are ordered and, hence, avoid metastability. When working within a clock domain on a synchronous device, the clock is used to ensure the timing of the inputs provided to the logic circuit and they avoid metastability problems. As synchronous circuits receive incoming asynchronous signals, special synchronizers work to make the probability of metastability vanishingly small but still possible.
If we do a lift-and-shift of a distributed system onto a complex cloud computing environment, there’s a bunch of new problems posed to our distributed systems design. Running a server in a virtual machine gives you a LOT of valuable computing for a good price. However, it may do so according to its own timeframe. The noisy neighbor problem refers to varying capacity for YOUR virtual machine as it competes for resources with OTHER virtual machines on the same physical servers. Multicore servers add to the fun as their coordination may or may not stall messages you hope to send. Trips through the cloud networking infrastructure may experience congestion causing communication emulating the US Post Office. That makes timer-based fail-fast probabilistic. It always was probabilistic but now the odds have shifted away from miniscule probabilities to simply very-hard-to-debug rare probabilities.
Before, when we composed distributed systems with very predictable servers and networks, you had a darned good idea how quickly your cohort would answer a health check. Using that expectation, you could remove sick nodes quickly while only RARELY removing a healthy node. It was a probability game where the odds were EXTREMELY high that you’d make the right decision. This is much like circuits working within a clock domain to avoid metastable behavior.
Servers and/or the messages to and from them don’t always work at a predictable pace. It’s much the same as removing the clocking within a synchronous circuit so it is an asynchronous digital system. The probability we can dampen and remove metastability has become WAY lower.
Consider a team a team of jugglers handling dozens of balls on stage. If some of them are sent into slow motion in a non-correlated way, it can be a problem. Each juggler has his own timeframe and sees balls arrive when they arrive. For a short while this may make sense. It becomes virtually impossible to juggle time-based interactions across the team as each juggler’s time speeds up and slows down unbeknownst to them.
When the old deployment of a distributed system is lifted-and-shifted into the cloud, it’s like a trip to the fun zone at the carnival. Time and distance get distorted and fail-fast decisions become unpredictable. When servers don’t approximately march to the beat of the same drummer, our distributed systems algorithms can become metastable.
Even worse, most of our systems depend on other systems in the datacenter. HDFS depends on Zookeeper. Zookeeper and other services depend on DNS . Each of these systems has timeouts and retries. This can lead to cascading delays, timeouts, and failures. This is another form of metastability that can accentuate problems rather than damper them.
When this metastability interferes with a primary or leader in your distributed system, you don’t have rapid access to the “perfect truth” or linearizability that many of our systems require. Availability can suffer quite a bit.
Distributed Algorithms: “Pretty Good” Is Better than Perfection
Now, let’s turn our attention to the importance of algorithms that don’t mandate the perfect latest answer. When a request to a user is based on a pool of replicas containing stale state, any one of the replicas will suffice. The background work to update the replicas with new state might be quite a bit behind and that’s OK. An excellent example of “pretty good” can be found when we read a replica of the product catalog or of product reviews in online retail. Similarly, as we do a web search, we access all the shards for the search terms in the query. It is not essential that each request returns the absolutely most recent. “Pretty good” is pretty darned good.
When sending work to a pool of servers with replicas of these caches, the client can time out and retry to bound the latency. It doesn’t really care too much about the health of an individual server as the client can get a good answer from another by retrying using a technique called HEDGING . The state of the unresponsive server is not too important for a while. Letting it be metastable while it lives in an uncertain state is OK. Frankly, we don’t need to have a perfect answer about the server’s membership in the cluster. Over time, it will start to behave better or we’ll toss it out of the cluster. Demanding a perfect answer in a hurry is not needed.
We see similar emerging tolerant algorithms for logging. Apache Bookkeeper  is an open-source logging subsystem in which writes to append new records to the log do not need to get to all log servers containing replicas of the log. Getting to a required subset is enough. Similarly, AWS Aurora ,  runs a database in a centralized server but sends its updates to a pool of storage servers. Because not all the storage servers need to respond in a timely fashion, the approach used by Bookkeeper and Aurora is dramatically more resilient to delays in servers. An individual replica can live in its own time-warp and our broader distributed algorithm proceeds happily.
I think of this as just like water flowing around a rock in a river. Getting ENOUGH WORK through is dramatically more resilient than insisting ALL THE WORK GETS THROUGH.
Without Fail Fast, Primary/Backup Algorithms Are Doomed
So, when our traditional algorithms depend on perfect knowledge of membership in a cluster, what can we do? Many systems are built expecting perfect knowledge to give perfect answers. For example, HDFS (Hadoop Distributed File System)  has a Name Node that is responsible for the allocation of data pages. Unless HDFS knows that we have either zero or one primary NameNode at a time, the filesystem may be corrupted.
When one server fails fast, it is replaced, data sloshes around the cluster, and pretty soon we’re back in business.
HDFS depends on fail fast as its model for liveness. I worry about our continued ability to use fail fast and still follow the motto of that old Timex watch commercial: “It takes a licking but keeps on ticking.” 
Have You Ever Met a Stable Distributed Algorithm?
Well… some of our algorithms are, indeed, stable when running on top of a surrealistic world. Some, however, are not. Any algorithm dependent on a fixed set of responsive servers conveying strongly consistent linearizable data will increasingly see challenges if the trends towards metastable clusters continues.
There are a number of reasons why parts of a cluster may perceive timing differences with other parts:
Servers or VMs not getting the expected amount of the computation for a period of time,
Multicore servers may be internally coordinating or stalling ahead of answering your question,
Networks not providing a pair of servers their fair share of the bandwidth causing delays,
Hardware in the datacenter behaving unpredictably, especially as the price of these components is driven down .
Increasing dependencies on other services (e.g. Zookeeper or DNS) that can cause cascading wonkiness due to timeouts.
All of these challenges compound in large, complex, and intertwined environments.
I want to emphasize that I see this as a great thing. While, in my dreams, it would be great to have a dedicated freeway lane for my commute home from the office, it’s not really practical. Instead, we share freeway lanes in spite of the inherent risk of delay in an open queuing network that allows lots of incoming traffic without any constraints. Not many of us can build a freeway dedicated to our personal use, hence we share. Cloud computing is sharing.
The same is true in real life. If you live in a small town with a walking commute home, you can predictably walk in the front door at 6:23pm every evening. As you move to the big city with elevators, parking garages, and freeways to traverse, it’s harder to precisely time your arrival.
Open Questions to Ponder
How can we compensate for these issues of metastability in cloud environments? Does adding resources help?
Socially, humans have a tendency to over-consume resources until there’s a problem. Will we always succumb to the tragedy of the commons  and suck up all slack until we are metastable?
What algorithms can give us both crisp and clear linearizable updates AND rapid latency 99.9% of the time? How about 99.999% of the time? How do these respond when the environment is under stress?
How can we evolve away from algorithms that are sensitive to metastability into a world where we tolerate and dampen most metastability?