risk - Half-Elf on Tech

We make multiple test environments and platforms, testing with hundreds of users. We perform stress tests (how much traffic can my site take?), and have an obscene amount of checks and balances to ensure that only code that is good makes it into the file product. We have teams who question every change asking “Do you need this?” or “What’s this function for?” We audit every update process and ensure that our work is as good as we can make it. This is all done, we say, to reduce our risk. Our software, we insist, will be better if we do all these things.

But the failure rate has not dropped.

Initially, when a product is released, there’s a spike of failures. We didn’t anticipate that, or this didn’t work like we expected it to. Those are not classified as ‘failures’ but as kinks to be ironed out. Six or seven months down the line, we’ve released another set of itterations to fix the worst offenders and our failure rate drops to a comfortable rate where, most of the time, everything’s fine.

What if I told you that one in five IT projects was a success?(Source: Statistics over IT Failure Rate)

What if I told you that all your myriad checks, balances, testing, forms and CYA dances didn’t make anything less risky?

What if I told you it was all Risk Theater.

Of course you can do things in a less risky way. If given the choice between dismantling a bomb in a nice quiet room, where you have all the time in the world and a blast shield, or doing it on the back of a van while being shot at and you only have 30 seconds, everyone would point at that room and say ‘Less risky!’ And they’d be right. The problem with risk is that there are often, if not always, external forces that perpetuate risk.

We have to ask ourselves “What is risk?” We can look at it mathematically. Risk = {si, λi, xi} – and most of us have no idea what that means. Risk is not a magical number that says “Defusing a bomb is this risky.” Determining risk is how we discern how likely something is to happen, and from that, what is the likelihood of an unwelcome outcome.

Too often risk is defined as risk = likelihood * consequence and safety = 1-risk

This can misinform: acceptable risk is a consideration of likelihood ANDconsequence, not a simple multiplication with safety as the additive inverse of risk. Acceptable risk and safety are normative notions, changing with situations and expectations, and must be assessed accordingly. (Source: Woody’s Perspective – by Steven A. “Woody” Epstein)

Risk analysis, for all it’s a mathematical discipline, is just that. A discipline. That means the numbers matter far less than you think they do, and if all you do is look at the numbers and say “But we’ve predicted a five point uptime!” then you’re ignorant of the truth.(A five point uptime refers to the claim people make of providing 99.99999% uptime. The five 9s after the decimal point are feel-good numbers/) The trick to it all is that variation is something computers are phenomenally bad at handling. Look at your computer. It’s what can be best described as a ‘brittle’ system. If you throw something a computer’s never seen before, it tends to react poorly, because unlike the human brain, it can’t adapt or improvise. It can’t know “Oh, you meant ‘yes’ when you typed ‘yea’” unless some programmer has put in a catch for that. On some systems, it may not even know the difference between an uppercase Y and a lowercase y.

Variation is nature. It’s reality. It’ll never go away, either. The point of risk analysis is not to come up with that number to say ‘By doing foo, we are x% less risky.’ The point is to look at the system and understand it better. The point is to learn. The act of explaining and defining the process, whatever it is from changing a tire to pushing software to a few hundred servers, is what makes a process less risky. You understand what it is you’re doing, and you can explain it to someone so they too can understand it, and now you know what you’re doing. The numbers will come, but they’ll also change over time due to variation.

We mitigate our risk by understanding, testing and documenting. But just as you can never have 100% uptime on a system (you have to upgrade it at some point), you cannot excise risk entirely. On the other hand, we cannot ignore the need for testing.

A woman named Lisa Norris died due to a software error, caused by a lack of testing. All the safety checks, the manual monitoring and brainpower failed because the automated system wasn’t tested. Prior to the automated system going online, the old way was for people to manually transcribe medical dosage. This was felt to be ‘high risk’ because there was a risk of transcription error. However nowhere in the incident report were any ‘manual errors’ noted, prior to the automated system being used. We can assume, then, that any manual errors (i.e. transcription errors, the risk the system was meant to abrogate) were caught in-flight and corrected. The automated system does not appear to have ever been tested with ‘real world’ scenarios (there’s no documentation to that affect that anyone investigating the situation had found). If they had run simulations, testing with data from the previous, manual system, they may have found the errors that lead to a woman’s death. (Source: Lisa Norris’ Death by Software Changes – by Steven A. “Woody” Epstein)

There remains a possibility, however, that even with all the testing in the world, that the error that led to Miss Norris’ death would have been missed. So how do we make testing better? As long as we’re only testing for the sake of testing (i.e. it’s expected, so we do it), or we follow the standard test plan, we miss the point of dry testing. Even people who stick by their ridgid test scripts are missing the point.

Open Source software, however, gets the point.

You see, we know we can’t test everything, and we know that we’re going to miss that one variation on a server where code that works a hundred times on a ninety-nine servers will fail on that one where it has a tiny difference. And yet, if a million monkeys banging on a million keyboards could write Hamlet, then why can’t they fix software? They can help cure AIDS, we know. Crowd sourcing knowledge means that you let the monkeys bang on your data and test it in ways you never imagined it being used. No longer driven by a salary (and that really does lock your brain in weird ways), the monkeys (and I’m one of them), cheerfully set up rigs where we can roll back quickly if things break, and start just using the iterations of software, coming up with weird errors in peculiar situations.

We always talk about how we want to lower the bar and make products more accessible to more people. Make it easier for them to use. In order to sustain that model, we need to embrace the inherent risk of software and teach the users how to correctly perform basic troubleshooting and report real errors. To often we write our code in a vacuum, test it in a limited fashion, and release into the wild knowing there will be a second release to fix things. As development matures, we push out more changes more often, small changes, so people are eased into that new thing. We step out of isolation and embrace the varations of how our product will be used.

Now we need to get our users to step out of their isolation and join the monkeys. We can’t make things better for everyone unless everyone is a part of the improvement process. We must ease these users into understanding that every software product is ‘in progress’, just like we taught them to accept that all webpages are perpetually ‘under construction.’ Our dry tests will never be complete until we can determine how to safely bring them in. Maybe we make smaller changes every day, like Google does with Chrome, such that they never notice. Or maybe we ask them to ‘check the box to join our test group and get cool features before everyone else!’ But we must do it, or we will fall behind in giving the users what they want, and giving them a solid, safe, secure product.

Until then, we’re not analyzing or assessing actual risk, we’re merely players in risk theater.

Tag: risk

Risk Theater and Open Source Testing