How It Is

Forget 100%

Please don’t tell me about how your service is 99.999% reliable. Tell me all about how you’re prepared to handle the unexpected.

Can I tell you a secret?

Kids doing karateI hate the five nines. The Six Sigma Stigma has me wishing that everyone who tells me they’re a ‘black belt’ please die in a fire. It’s not that I don’t think that the process can work for some people, or that it’s useless as a whole, but that I think too many people treat it like an MBA. “I did this thing for a few months, I am now an expert.” I had a bunch of coworkers who did that. I hated them. I got to the point that if you said “We need five-nine reliability” I had a Pavlovian reaction that involved me rolling me eyes and tuning out.

Now this doesn’t mean than I don’t think 100% ‘uptime’ in anything is a nice goal, but I see it as a lofty goal. Look, you know this stuff already. I will not walk down the stairs successfully 100% of the time. I will not have the next key I strike on my keyboard function as I expect 100% of the time.

So moving this off and saying “I don’t need 100% uptime, I need 99.999% uptime.” doesn’t actually change anything. In fact, I’m willing to bet that a lot of people look at the five 9s and think that it’s so close to 100% that they should never see or notice an outage. Thanks, Six-Sigma people, you just made 99.999% synonymous with 100%.

That wasn’t the secret, though.

My secret is that I don’t care about 100% uptime in anything.

I worked for too long in deployment to understand that there is no such thing as 100% uptime for anything. There are ways to minimize and mitigate downtime, and there are ways to make sure it causes as little impact as humanly possible, but there’s no way to avoid it. Ever reboot your computer? Of course! Ever upgrade WordPress? You have then experienced downtime. It’s a nature of life. I expect it, I don’t sweat it.

So if I don’t care about 100% uptime, what do I care about?

Reliability, accountability, responsibility, and timeliness.

I was reaching for the words that end in “ibl”, but really the root ofable is my concern above all else. Are they able to handle it when things go pear shaped? Are they able to fix problems quickly, correctly, and efficiently? Are they able to prevent the exact same error from happening again? Are they able to own up to their mistakes?

Two wood models fighting over moneyI don’t expect anyone to do all that 100% of the time, but I expect them to care about the things that are important to them as an entity. My webhost should care about the severs not being on fire and serving up webpages. My bank should care that my money is safe and available. My government should care that it’s … Too soon? Anyway, the point is that you should care about what you do, and provide the best service you can. Now, if 50% uptime is your best, maybe I’ll look for someone else. I am reasonable about these things. If email goes down, how fast did you get it back up? But to me 50% isn’t reliable unless I’m looking for something that, intentionally, only works half the time.

All this said, I don’t actually look at the uptime numbers all that much, unless I feel that the reliability is sub-par. The actual numbers, the metrics, the absolute “This service is up 100% of the time or your money back” is not really what I count on. My friend Pippin said it wisely the other day:

I have far more faith in a company that encounters occasional problems but responds incredibly promptly than one that has fewer issues but doesn’t respond half as well

He happened to be talking about a brief (like 10 minute) outage on his site when all databases were inaccessible.

Don’t bank on the percentage, bank on the ability to react and come back.