Dialing in On Availability
Availability is a measure of how often you can ask for work and be happy with its completion
When I was a kid in the 1960s, we had phones connected to the wall by wires. Fortunately, my parents had a long cord in the kitchen so you could sit at the table and talk over the wire to the wall. Furthermore, they always worked! When you picked up the phone, you got a dial tone [1], punched in a phone number, and chatted with your girlfriend. Rarely, you got disconnected. If you did, you immediately got another dial tone. Quickly inputting the phone number got you back to her.
I got my first cell phone in 1996 when I was 40 years old. They were simultaneously a miracle and a royal pain in the butt. While you could take them almost anywhere, they would drop calls and worse not let you place calls.
With a phone on the wall, the dial tone seemed to be always there. When the power to the house dropped, the phone company’s private power meant the phone had a dial tone. In my first 40 years before getting a cell phone, I can’t remember the phone on the wall not offering a dial tone.
I’ve seen various quotes about the availability of the phone network (connected to the wall). Perhaps, seven-9’s (99.99999% or 3 seconds per year outage) but that may be apocryphal. I don’t know. What I do know is that the definition of availability was the opportunity to dial another call.
In a transactional database system, it’s OK for transactions to spontaneously fail. Lock conflicts or other problems can result in the work being discarded. You don’t want this to be common but once in a while is OK. After hearing about an abort, the application can restart the work and hopefully succeed. In a distributed database, the system may reroute the restarted transaction to a different database server. That’s cool.
It's important in our big complex environments to measure and improve our availability. We sometimes trade off availability for other features such as mobility and pretty dancing videos that we can access during a phone call. These usually work but sometimes fail. Still, we define availability as our ability to try again.
Transactional systems are much harder to make available than systems allowing you to fill a shopping cart, search the web, or connect two phones. In those systems, a pretty good answer is, well, pretty darned good. In the phone system, it is essential to reconnect you to the same girlfriend when redialing the phone number. You and your girlfriend needed to remember the state of the conversation and restart the business process of getting her to agree to a date on Saturday night. That, is a higher-level form of availability.
Today, I carry a cell phone. The convenience and mobility are more important to me than availability. Sometimes, I have to retain more state in my brain pending my ability to dial again. This is getting more challenging as I get older.
I’m just trying to establish the correct tone for any discussion about transactional availability and its interaction with the business task at hand.
Pat
[1] Many readers may be too young to remember the dial tone. It was a noise over the phone that told you that punching in a phone number had a good chance you would be able to connect to the other phone.
OPEN QUESTIONS:
What does availability mean to YOUR solution? Is it OK to have a dial tone but get a garbled connection that you can’t understand? What requirements on the connection’s quality do you impose before you start counting the wall-clock time?
Does state within the service count as you track availability. Large online retail sites know very well that if you deny access to a user’s shopping cart they will leave and do the dishes. If you give them an empty cart (due to an internal problem), they will be surprised and start filling the cart again. How is availability counted when these things happen?
Fundamentally, if availability measures the time until you can start again, what are the expectations of: you can start again?
Great framing of perspective. We've been leveraging the 5-9's definition from the ITU for decades, but most segments of technology bring it into their own backyard in slightly different shapes. Dialtone was, and is, the ITU definition of 5-9's (and even that has exceptions).
Some segments measure on the ability to make a request (dial-tone), some on request completion (concluding that conversation with your girlfriend, uninterrupted), yet others focus on the ability to complete the conversation within a set amount of time. There are many others, but each and every one of them have implied assumptions around the role of the service and the ecosystem they reside in (maturity, complexity, criticality, and so on).