Categories
How It Is

Failure of Imagination

It’s an accepted fact that all code will, at some point in time, break. The Open Source community is plagued with this problem, but it’s not endemic just of them. Software aging and rare conditions cause failure in all software.

This is not an excuse.

People make mistakes, and we all accept that. But what I find astounding is that a lot of users look at the software developers and say things like “I find it unacceptable that you let this critical error slip through.” They seem to think that anything less than a perfect piece of software coded by perfect people with no errors is cause to demean the programmers.

The other day my father (a seasoned risk analyst – see http://woody.com) passed me an article by Herbert Hecht called Rare Conditions – An Important Cause of Failures, where in Hecht explains that “rarely executed code has a much higher failure rate (expressed in execution time) than frequently executed code during the early operational period.” The point, for those of you who felt your eyes glass over at the big words, is that the less often a piece of code is used, the more likely it is to break.

I read this article and immediately shouted “Yes! Exactly what I’ve been saying!” And it’s not because we don’t try to write the best code we can, either. It’s because most of us use conventional test case preparation. We say ‘This is what we want the code to do, we shall test it and see if it works. Great! Now what if I did this…’ The problem there is you need a skilled person coming up with the ‘what if.’

Back in my desktop-software support days, I was testing a piece of software I’d never used before, and I crashed the program. Hard. I had to reboot. And I found I could crash it repeatedly doing that. I called the vendor, who sent their best techs out to look at it. I showed him what I did and he started laughing. What I was doing was something no one familiar with the software, and it’s purpose, would do, because it was simply wrong. Like putting in your phone number in the field for your first name. They agreed that they should error trap it, however, and it certainly shouldn’t crash the system.

Why didn’t that problem ever show up in their tests? To quote Frank Borman on the Apollo 1 fire “Failure of imagination.” They simply couldn’t conceive of a world where in someone could be that ignorant of the right way to do things. They didn’t document it, because it wasn’t a requirement of their software, but of the process being completed, and they didn’t error-trap it because no one in their right mind would do that.

The world keeps making bigger and bigger fools, doesn’t it? When you test code, you’re always going to have a bias. You’re looking at it from the perspective of someone familiar with both the program and its purpose. When we test WordPress (and I include me on this as I beta test WordPress and file trac reports when I find problems) we test it from the perspective of experienced WordPress users. We’re the people who read the documentation on what changed. We’re familiar. And that’s the problem.

So we look at Hecht’s second suggestion. “Random testing over a data set that is rich in opportunities for multiple rare conditions.” Basically it’s making a list of everything that could go wrong, the really wild and rare errors you’ve seen (in the case of WordPress, you could probably cull some great ones from the forums, at work I review my trouble tickets and make a list of the most common), and testing that. Testing stuff you KNOW will break. Again, this has the problem of bias, but it allows you to make sure when your code fails, it fails elegantly. This kind of testing has the other problem of finding the right data set. This really is the hardest part, and takes some seriously dedicated people to come up with one that limits bias.

Hecht’s final suggestion is “Path testing, particularly where semantic analysis is used to eliminate infeasible paths” but he quickly points out the problems:

[This] technique can be automated and is the only one for which an objective completeness of test criterion can be identified. However, it is costly and will practically be restricted to the most essential portions of a program.

So what can we do, other than be smarter? We can test better, certainly, but that’s more difficult than it should be. And why does all this happen in the first place? We’re smart people! We should code better! In a related article (also sent by my father) by David Lorge Parnas discusses Software Aging. In that paper, he posits that too often we concentrate on getting the first release done, rather than looking at the long term goals of the software.

This article I find has particular relevance to the open source community, which is filled with people who become software ‘engineers’ via non-traditional paths. How non-traditional? In Parnas’ article (dated 1994) he mentions that software designers come either from the classic computer science background or they’re home-grown from their business specialty. This means that a person who writes code for a bank either is a programmer who knows their coding languages and the basics of how to think ‘in computer’, or it’s a bank employee who picks up code and learns it on the job. There are, obvious, drawbacks to both backgrounds. A banker understands the desired end functionality of the program, and the CompSci guy understands who to write it, but not how it’s used. They are, both, too specialized.

I inherited some code that was undocumented and had the problem of inelegant failures on rarely run processes. Over the last five years, I’ve steadfastly cleaned this up to error-trap correctly and output meaningful errors. I don’t have a CompSci degree. Actually, I don’t have a degree at all, in anything, and while I never declared my major, it was Anthropology (by virtue of the courses I’d taken). As luck would have it, I’m also somewhat ignorant of the purpose of most of the code I write. This means every time I’m tasked with a new project, I look at it with fresh eyes, and I can see the flaws people in their little boxes are unable to see. I’m perpetually on the outside, which means my perspective is almost always ignorant, but rarely is it unintelligent. On multiple occasions, my simple question of ‘Why are we doing it this way? Is it to make it easier on the end user or the programmer?’ elicits astounding reactions. I can help pull the programmers out of their heads and look at the long-view.

Parnas calls that ‘designing for change’ (a familiar ad slogan in the 1970s). Part of the problem, he thinks, is that people don’t have the appropriate education to their job and, thus, are untutored in the basics of programing and the related thought process. I disagree, and not just because I am inappropriately educated to my day job. It’s true my ‘training’ didn’t prepare me to write code, but my education did. I was taught, from a very early age, to think and reason and question. Those basic principals are why I’m able to do my job and follow along with Parnas’ work. Perhaps he would be better put to say people have not made the effort to learn the basic groundings of good software design, independent of your education and ‘purpose.’ The banker can learn software design irregardless of the intent of his code, and the CompSci programmer can master enough of banking to understand the purpose of what he writes. The middle ground that has the view of goal and design is what allows us to design for the future and write code that can grow and age.

That all depends on if the code is well documented. Parnas’ rightly twigs to the most common problem I’ve seen in programing. No one likes to write documentation, and when we do, it’s clear as mud to someone who is unfamiliar with our goal and design. If you don’t the goal of the programmers when they came up with Drupal (to pick on someone else for a change), then none of the documentation will help you. Software, being based on mathematics, should have documentation that reflects its parentage, says Parnas. This should not be confused with user documentation, which should have none of that. Developer documentation should resemble mathematical documentation if it has any hope of being useful and lasting.

While Parnas’ paper was written in 1994, I wonder if the problem of crossover between developer and user was as prevalent as it is today. Today there’s no clear line between the developer who writes the code and the end user who wants to use it. This is the case most noticeably in open source projects, like WordPress, Drupal, Joomla, and so on. These projects are championed by the developer/user, a creature that may not have existed as such as widespread phenomenon 17 years ago. While Parnas does mention the possibility in use-cases, he only does it to highlight the problem of isolation among developer groups, and not as a potential root cause to why problems are missed. They are missed in the isolated groups because we cannot see outside our ‘box’, for lack of a better room, and envision that particular ‘what if.’

This is why we have a great need for reviews. When a doctor tells you that you have cancer and you’re going to die, you seek out a second opinion. The same is done with software. The code works for you, so you let it out into the world for other people to test. We need the outside sources to come and bang on the rocks and tell us what’s wrong. That’s why you see people like @JaneForShort asking more people to join the WordPress beta tests. She knows the core developers can’t possibly test everything, and the more use-cases we can come up with, the more we can make the end result better.

Would having more professionals solve this problem? Parnas seems to think so. He thinks that having more people trained in engineering and science combined will produce better programmers. After all, what is coding but math applied to engineering? I feel it’s the techniques that are more important. A grounding in basic algebra (and some calculus, certainly) should be enough to be able to program in most languages. And a well-formed understanding of the disciplines of engineering should allow a person to look past ‘This is my problem, how do I solve it?’ You need innovation, understanding, and a respect for how things work in order to write effective programs.

Why did we miss a critical error? Because we didn’t see it. It’s always going to remain that simple. With better education, will we be able to see it in the future? Perhaps. But all the traditional book education in the world cannot teach a person how to think. Even if we can perfect the creation of the well-thinking human, we will always be loosing a battle against the universe creating bigger fools. But those thinking people can find the problem, once it’s reported, solve it, and learn to make it better the next time.

And that should be our goal.

Make the code better, test it smarter, document it clearly, and plan it thoughtfully.