The term criticality accident is what happens when there’s an increase of nuclear chain reactions. This lets loose a radiation surge that kills people. This is what happened at Chernobyl, Three Mile Island, Fukushima, and many other places. To date, twenty-two criticality accidents have occurred outside nuclear reactors (some resulting in deaths), but thus far, none have resulted in explosions.
When we look at the death of Louis Slotin, we think ‘God, how did we not know that was dangerous?’ When we regard the Trinity Test, we think ‘How did we not know we were unleashing hell on earth?’ The fact is that we cannot see the future, and we cannot predict how far we we go. Therefore we certainly cannot see when we are too far gone before, indeed, we have gone too far. You cannot divine and magically know the unknown, though that doesn’t mean we’re in complete ignorance of the possibility.
There is always a possibility that things can go terribly wrong, in the worst way possible. In a nuclear power plant (NPP), obviously things like a meltdown is up there on the top of the ‘worst outcome’ list. Did the scientists know there was a possibility this could happen? Of course they did. Did they know that it might leak into the ocean, pollute the land, and kill those 50 people who are still working at the plant, and who are all expected to die of radiation poisoning? Again, of course they did.
Before any NPP is built, they go over the risks and mitigate them as best they can. They review the known risks, and solve as many as possible. But there will be a point where someone will correctly state “We’ve thought up ways to solve every problem we can come up with. Now we need a plan to handle situations where the unexpected arises.” Oh yes, they have a plan for this sort of thing too, but it’s probably really basic.
I don’t work in NPPs, I work for a bank. We sit around and discuss things like ‘If the city of Chicago is destroyed tomorrow, how would we make sure that everyone can get to their money?’ Given that your money, like mine, is pretty much virtual and stored on computers, we do that via data integrity. Make sure that our data is all safe, secure, and backed up in multiple places. We have multiple data centers across the state, protecting your money. What about the software? It’s written to talk to those data centers. How do we compensate if one of them vanishes? The problem with those meetings, is that people want to know specifics. And I always point out ‘Give me a specific situation example, and I will give you specific steps. But since every situation is different…’ Because the answer to ‘What do we do if our Chicago servers vanish?’ is ‘Route everything to this other location.’ See how that’s really basic?
The problem with all this is we can only plan for what we can imagine, and we can’t imagine past our abilities. Should we have seen the possibility of someone flying a plane into the World Trade Center? Of course! We should have always thought ‘Hey, this nice big skyscraper sure is an easy target for someone really pissed off!’ But the probability of that happening was so low we didn’t come up with plans for how to handle it. A criticality incident happens at that point when we realize what we should have known all the time, but couldn’t have possibly known because we are not omnipotent. We are not perfect and we cannot know everything.
In the case of a nuclear power plant, when all hell breaks loose, people die. Even today, we know that the radiation being leaked out is bad for us, for the environment, the water and animals, but we don’t know how bad. We cannot possibly know. We can guess and infer and hypothesize. But we do not know. And the only way to know is to experiment. If that doesn’t scare the pants off you, to realize that all innovation comes from an experiment that could kill us all, well, then you’re probably not aware of the hadron collider and how we all joked about how it would open up a black hole and kill us all.
Innovation takes risk. It takes huge risks. The people who take the risks, like Louis Slotin, know that things can happen. They know that irradiating themselves to kingdom come ends with a slow and painful death, and not becoming Dr. Manhattan. We won’t become Spider-Man or any sort of godlike superhero. We. Will. Die. And we know it. And we do not stop. We cannot stop. The only way to get better, to make it safer, is to keep trying and keep dying.
Not to be too heavy handed, but with our code, it’s the same thing. We cannot see where too-far in our code, where danger lies, until it hits us in the face. We will destroy our programs over and over, we will crash our servers, and infuriate our customers, but we will pick up the pieces and learn and make it better the next time. This is human nature, this is human spirit and endeavor. We cannot fear failure, even if it brings death. For most of us, the worst it can bring is being fired, but really that’s not that common. I’ve found that if you step up and accept responsibility for your actions, you get chastised, warned, and you keep your job.
When everything goes bad, it’s easy to point a finger and blame people. That’s what people do. They complain that the programers suck and didn’t test enough, that the testers didn’t do their job, that everyone is terrible and did this just to piss them off. They rarely stop and go ‘What did I do?’ They rarely say thank you, and they rarely learn from the experience of failure. Thankfully their failures will not end in death. Money loss, certainly, and a great inconvenience to everything in your life, but you learn from this far better than you can learn from anything.
Learning from extreme failure is not easy. It’s hard to get past that initial moment of absolute terror. It’s harder still to train the end users (clients, readers, whatever) that this is okay. This is normal and it happens to everyone, everywhere, everything. But if we cannot learn from failure, we’ll never have the courage to create again.
Get messy. Make mistakes.