People who follow me on twitter know my frustration with my day job. It’s not that I hate my job, it’s that I hate when the rules get in the way of things. Over the last three years, we’ve grown from a simple ‘do this please’ directive to a behemoth of monitoring and oversight.
Here’s an example. We run standard installs at 3pm once a week. Tickets must have a start time of 3pm and an end time of 5pm. Thank you auditing. If they don’t, they must have secondary approval to give us the okay to go at a ‘non standard time.’
Now, there is a sane reason for this. We do the install at 3pm, but from 1 to 2pm, we do server maintenance, and from 5 to 9 we do the databases. So really, 3 to 5pm on that one day makes sense, right? We don’t want to run over or the database guys get mad, and we don’t go early cause the server guys get mad. We’ve been doing this on the same day, except Thanksgiving or the random ‘on Thursday’ holiday, year in and out for over 30 years. Yes, 30. Some changes go at 10pm to 2am that night, but the 3pm run for this particular type of change is as normal as anything.
One day I get a ticket with the time ‘3:15pm to 5pm.’ You’d think I could just say ‘Sure, not a problem.’ It’s within the 3-5 time slot, and fifteen minutes is nothing. But no. No, I have to say “I’m sorry, but your ticket requires a start time of 3pm. We are not permitted to make exceptions on this.”
It burns at my very soul to have to tell people something this idiotic. I mean, it’s fifteen minutes and it would still run within the allotted time! Heck, the process this guy wanted takes 5 minutes total! But no, our tool locks things down to the point that I can neither start the process early nor can I accept a non-standard time without triggering alerts that, at the end of the month, slap me into the “Oversight Review Board” meeting, where I have to explain why I did it.
The problem is that the oversight machine gets in the way of our ability to be productive. This mechanism grew from the ‘old days’ when we would submit a request to make a change, and if I didn’t know that the server was being worked on from 1pm to 2pm, I’d just run the ticket whenever. The timeslots were general guidelines, not set in stone. Then we grew, and people realized they needed to coordinate a server change with a code push (my job) with a db upgrade and then with some other totally separate install. And since no one could possibly be expected to memorize every single moving part in the company, we have a new ticket system to manage it for us.
Old Way: I put in a ticket to make a change with the time/date I’d like it to happen and my boss approves it. The people making the change pick up the ticket and do the work.
New Way: I put in a ticket to make a change with the time/date I’d like it to happen and my boss approves it. If this change has any red-flags (like it’ll take more than 24 hours, or it affects XYZ), it goes to the Change Review Board, who looks at it and either approves it or asks me to come in and explain what I’m doing. Furthermore, if I go on certain dates, it goes to another level of review. If I want to do it in less than a week prep time, it gets extra review and my boss’s boss has to approve it.
Conceptually, this is meant to have enough eyes looking at a change that someone says “Wait! Bob, we can’t upgrade the DB severs that day! Joanne’s major install is that day!” However, nowhere in here is the system actually checking for us and saying ‘You’re going to be touching the following servers.’ Nowhere does a computer do the mind-numbingly boring work it’s great at and verify that all the interlocking pieces related to my change are also not changing, or if they are, it’s a related change.
We didn’t make the system work any better, we just became better at covering our asses. Now we know how to write a request with the right buzzwords. Like every request I make requires me to include what I’m changing, why I’m changing it, how I’ll test it, what documentation is there, who will be the ‘point’ person, what follow up we’ll do, and what unexpected problems might there be and how to we plan to fix them?
That last one makes me wince. I often write “We don’t expect any problems, but we’ll follow standard troubleshooting guidelines to fix them.” My boss tends to have to rewrite that for me, because my capacity for handling stupid questions is usually filled by the time I’ve completed the 10 questions on the form.
I know that the purpose of all this is to make sure that every change we make is one we needed to make, and that it’s done with the right amount of forethought and understanding. What it’s done was make everyone annoyed, and annoyed people don’t do work efficiently. Also it’s asking technical people to write explanations to non-techs, something a number of them aren’t good at and that’s okay! We can’t be expected to be Renaissance Geeks, good at all things.
Should the technical people be able to say ‘This change will make our ATMs faster’? Of course. And they do. But when they’re asked to detail out every single step, multiple times, in multiple ways, they get annoyed. Instead of asking the question once, they ask the ‘what are you doing?’ question 10 times, in 10 ways, to try and get you to answer what they want to know. And at the end of the day, they still don’t know.
Of course, the real reason for all this is so that when it goes wrong, the Bobs can point and go “Well, Joanne there screwed up.” and Joanne can point back and say “I said I was rebooting the ATMs at 4am, and you approved it.” and round and round it goes. I made a lot of friends once when I stepped into a M&M(M&M stands for “morbidity and mortality” and is a periodic conference in many medical centers usually held to review cases with poor or avoidable outcomes.) and announced “I can’t see why the system didn’t run as intended, so the logical reason for the outage was that I made a human error and clicked the wrong button.” Of course then they wanted me to code out human error and I decided they were idiots.
We went from pretty much no oversight past a rubber stamp, and relying on the little guy doing to work to raise any red flags, to massive amounts of oversight where we still rely heavily on the little guy doing the work to raise that red flag. The system locks us in, brokering no room for typos without having to restart the whole chain of events over again, so if you accidentally type in 3:01pm, and the little guy doesn’t notice, you both end up being asked why you did something ‘wrong’ on the metrics report at the end of the month.
Sometimes in my other posts I say that my perspective on the machinations of things like WordPress and Drupal oversight is different. This is why. I’ve seen the extremes on both ends, and I respect the need for both oversight and attentive management. I think that Open Source tends to handle it better because they can’t afford the big massive teams who have but one job, and that is to know everything. They know they can’t, so they know how to work together. They’re not afraid to email/IM/Skype each other for help, and if everything breaks, they can fix it and laugh about it over beer.
It’s not that they don’t ‘get it’, it’s that they do get it. Corporate America doesn’t.
Comments
2 responses to “Too Much Oversight”
Oh… the stories I could tell about this…. one question… what is your Lead time for Installs / changes? (Meaning, how many days in advance can you make a change / install something into production without needing to jump thru hoops or developing stigmata to show that it’s necessary)
Officially, three days. Of course, that’s if the change is a ‘normal’ moderate risk change, being done outside of month-end-processing (last day of the month through the first 5 business days are high risk). Oh and you still have to fill out all the questions, but no one presses you on them unless your install breaks things and you have to roll it back.