One of the biggest things I learned working for a bank was that there is never a good time for an outage.
Take, for example, my ongoing argument about how we had to reboot servers with as little downtime as possible during our deployment process. This is pretty simple, right? If you have to reboot something, it has to be turned off for a little while. Now with a clustered distribution setup like we had, with ten servers in three locations (30 servers), we would deploy to every server in location A, reboot, then B and so on. Now naturally this causes an outage in each location, and we always had different ideas about how to handle it.
One option would be to have the second step of our deployment process be to put up a ‘Sorry’ page, saying the service was offline, and then push to all three locations at once. That minimized downtime to about thirty minutes on average. We’d push the code zipped up to the servers as step one, sorry page was two, gently disconnecting inflight traffic without losing any transactions was three, and unzipping new files was four. If needed, reboots were five, and then six was to remove the sorry page. Pretty fast, right? The downside was that there was an outage, and if we had a problem it would take longer to fix.
The other main option would be to turn off location A, shunt all active users to B and C, upgrade, and repeat. This took longer, usually around 90 minutes, but no service outage, right? Right…? Nope! The problem with shunting users was that we had to wait until the transactions were done before we could redirect them to the new server, which meant servers at locations B and C would be handling a sixth more traffic each, which meant when A came back online and we sent traffic to it, it was more than we had moved off it, so it took longer. Oh, and now A has new code, which B and C does not, so now we have two versions of the service running, and we can’t dynamically flip people around. We have to check service versions before our Round-Robin would work.
Now, in both cases, the longest part of the process was usually the ‘gently disconnect all inflight traffic’ part. But you can see how it gets super messy really fast. The reason this was always a point of contention was that we really didn’t want our customers to have downtime because there was absolutely never a good time for everyone. We picked Thursday nights at ten pm Central for our updates, since we were a Chicago based company, but as time went on, we had to move some to Friday, which you’d think would work for a bank. After all, no one does business on weekends?
I hope you laughed a little at that.
As the Internet makes it more and more possible to work at any hour, we find that services need to be available at any hour. The whole concept behind ‘business hours’ making things standard never really worked for everyone. I mean, if your company was 9-to-5, you would have to do the brunt of your personal business on lunch and breaks. At least today, if I needed to run a personal errand at 2pm, I can just leave and come back to finish my work. Can’t do that at a bank, though! People need to come there to get their work done, so you have to be there.
It’s really annoying, and it comes down to a simple fact: There’s never a good time to turn something off.
This came up recently a friend asked what I was doing at work on December 24th and I replied “The usual. Answering tickets, making the Internet better, closing vulnerable plugins.” He was surprised and asked if I felt mean closing people’s plugins on Christmas Eve. This lead to me teasing the hell out of him for nagging me one weekend with two emails, a slew of tweets, and a text, asking for help with a problem (the last email was, I kid you not, ‘never mind, I read your ebook!’). While I was annoyed then, he apologized and we’re still friends (if the teasing didn’t indicate that already).
But it did bring up something important to him. The time he had to work on his side project was weekends and nights. The time I had to provide random help was weekdays and maybe Sunday. For him, to have his friend and Multisite Resource not available was killing his ability to finish the project. Now, he freely admitted that banking everything on asking someone for free help was a terrible business model, and since then he’s stepped up, read the books, practiced on his own, and he’s now a pretty darn good admin for a network.
Still, on December 24th, he asked if there was a ‘worse’ time to close someone’s plugin. “Sure,” I replied glibly. “The day their mother died. The day their car broke down. The day their tech quit. The day of their product push. The day we upgrade WordPress….” He realized I had a lot of examples and conceded the point. There’s never a good day because we don’t know what’s going on in your life. We can’t know. I’m not actually psychic, after all.
Sometimes people like to complain that we don’t ‘run WordPress’ like a business. If we did, we’d never close a plugin on a Friday night, or on the eve of a major holiday, or without warning, or … You get the idea. And they’re partly right. If WordPress.org was a business, a lot of things would be different. We’d have any easier way to warn people and put shutdowns on a timer, or delete accounts, or help everyone who posts in the forum. But the mistake here is not ‘ours’ for making WordPress what it is, but in you for expecting a free, volunteer, community to act like a company.
Cause we ain’t.
So while it’s never a good time to close a plugin, or reboot a server, or install new software for everyone, we’re going to have to do it at some time. An individual can’t be available 24/7 because we have to sleep, and we have other things to do. So accept that it’s just never a good time, fix it as soon as you can, and carry on.