Tag: essay

Google vs Splogs – Part 2

Now that you know all about the myth of the duplicate content penalty, we can look into spam.

This year, Google got slammed because the quality of their search was being degraded by spammers. Mostly splogs, I will admit, but Google rightly points out that their ability to filter out spam and splogs in all languages is actually much better than it was five years ago. (Google search and search engine spam – 1/21/2011 09:00:00 AM) No, Google isn’t getting worse, there are just more spammers out there. They also take the time to differentiate between “pure webspam” and “content farms.”

“Pure webspam” is what you see in a search result when a website uses meta data or hidden content in order to bully their way into being highly ranked in unrelated searches, or just basically game the system. A decade ago, this was horrific. Now it’s nearly negligible. This type of spam grew pretty organically out of people trying to understand the algorithm behind search engines and manipulate it legally. As we gained greater understanding of meta keywords and in-context content, we came up with more and more tricks to legitimately make our sites more popular. There was a point in time where having hidden text with as many keywords related to your site was not only common place, but lauded. It didn’t last long, as shortly after the good-guys sorted that out, the bad-guys did too.

“Content farms” are the wave of the future, and Google calls them sites with “shallow or low-quality content.” The definition is vague, and basically means a content farm is a website that trolls the internet, takes good data from other sites, and reproduces it on their own. Most content farms provided automatically inserted data. There is no man behind the scenes manually scanning the internet for related topics and copy/pasting them into their site. Instead, this is all done via software known as content scrapers. The reasons why they do this I’ll get to in a minute, but I think that Google’s statement that they’re going to spend 2011 burning down the content farms is what’s got people worried about duplicate content again.

A content farm is (partly) defined as a website that exists by duplicating content. Your site’s activity feed/archives/post tags pages are duplicating content for the users. Does that mean your site will be adversely affected because of this?

No. It will not.

Google’s algorithm is targeting sites of low content quality. While your stolen post is a beautifully written piece of art on its own, it’s the site as a whole that is used to generating a search ranking. As I’ve been touting for a decade, the trick to getting your site promoted in Google searches is to make a good site. Presuming you made a good site, with good content, and good traffic, and it’s updated regularly, there is very little risk that Google will peg your site as being of “low content quality.” Keep that phrase in mind and remember it well. Your site isn’t highly ranked because of low content, remember! It’s the reverse. If you’re being ranked for good behavior, good content, and good work, you will continue to be rewarded. In a weird way, content farms are actually helping Google refine their search so that it can tell the difference between good sites and bad! (Why The Web Needs Content Farms – by Eric Ward on February 16, 2011)

The next Google algorithm update will focus on cleaning content farms from positions of unfair advantage in our index. This will likely affect websites with considerable content copied from other online sources. Once this update is complete, preference will be given to the originators of content. We expect this to be in effect in no less than 60 days. (Google search and search engine spam – 1/21/2011 09:00:00 AM)

What Google is doing is not only laudable, but necessary. They are adapting to the change of how spam is delivered, and doing so in a way that should not impact your site. The only ways I can see this affecting ‘innocent’ sites are those blogs who use RSS feed scrapers to populate their sites. This is why anytime someone asks me how to do that, I either tell them don’t or I don’t answer at all. While I certainly use other news articles to populate my site, I do so my quoting them and crafting my own, individual, posts. In that manner I both express my own creativity and promotion the high quality of my own site. I make my site better. And that is the only way to get your site well-ranked. Yes, it is work, and yes, it is time consuming. Anything worth doing is going to take you time, and the sooner you accept that, the happier you will be.

For most small to medium sites, there’s not a thing you need to do in order to maintain your ranking. There are no magic bullets or secrets behind the SEO, to manipulate your site to a better ranking. In point of fact, doing so can be seen as gaming the system and can downgrade your results! Once again. Make a good site and you will be rewarded. Certainly, as I said yesterday, optimizing your robots.txt file and getting a good sitemap will help, and I really do suggest a Google Webmaster Tools account to help you with that. In 2011, Google is still king, so once you get your site well listed within Google’s machine, you’re pretty much going to be tops everywhere.

Why do splogs and content farm game the system in order to get highly ranked? Profit. Some do it to get their domain highly ranked and then sell it for a lot of money, others do it to infect your computer with a virus, and then there’s the rare hero who thinks this will get them money because of the ads on their site. Sadly, this still works enough to generate just enough of a profit to keep the splogs going. This is also true of spam emails. Yes, that means your grandmother and Carla Tallucci are still falling for the Nigerian Princess scam emails. The only way to stop all of that is to stop those methods from being productive money makers for the spammers, and that is something that will take us all a very long time and a great deal of education to the masses.

Your take aways are pretty simple. Make a good site with good content. Update it regularly. Use a sitemap to teach search engines what’s important. You’ll be fine. Don’t sweat internal duplication.

22 February, 2011
Google vs Splogs – Part 1

I am not an SEO expert. In fact, there are only a handful of people whom I feel can claim that title without making me roll my eyes so hard I get a migraine. Anyone who tells you they have all the answers to get your site listed well in Google is a liar, because there’s only one good answer: Make a good site. That’s really it. How then do all those spam sites get listed in Google, Bing and Yahoo to begin with, and will the techniques the search engines are using to eradicate those sites hurt you?

Everyone’s done that search for something and been nailed by a bunch of splogs. Like you look for ‘Laurence Fishburne’ news and you get 100 websites that claim to have news, but really it’s just total, useless crap? Those are splogs, a neologism for spam blogs, and they suck. Splogs are blogs where the articles are fake, and are only created for search engine spamming. They sometimes steal your hard work and use them, by scraping RSS feeds or who knows what else, and generating fake content. Why? Some people do it to infect your PC with a virus, and others do it to trick in into clicking on their ads.

The problem is spam blogs are so prevalent that they’re adversely affecting search engines, making it harder and harder for you to find real, factual content. This year, rurmors started flying that Google was going to go on the warpath against Search Engine Spam, and in doing so, would downgrade perfectly valid sites with ‘duplicate content.’ Having read and re-read the articles posted by Google on the matter, I’m quite certain that, yet again, people are playing Chicken Little. Nowhere, in any of the articles I’ve read, has there been any discussion of the intent to penalize legitimate, valid, websites for containing internally duplicated content.

In order to understand the duplicate content penalty problem, and yes, it is a problem, you need to understand how most content management systems (CMS – this includes sites like Drupal, Joomla and WordPress) display their data to the users.

You write a blog post and the content is stored in the database, along with any tags, catgories, or meta data you put in. When someone goes directly to the blog post, they see. However, they can also see the post if they go to a list of posts in that category, with that tag, on that date, in that year, etc etc and so on and so forth. So the question a lot of new webfolks ask is “Is that duplicate content?” No. It’s not. Nor is having ipstenu.org and ipstenu.org point to the same page. In fact, that’s good for your site. The more, valid, ways you have of providing your user with information, the easier it is for them to find what they want, and they happier they are. Happy users means repeat users, which means profit (in that oh so nebulous “web = profit” theory).

So what is this mysterious duplicate content penalty?

Let’s take this from the horse’s mouth (or at least Google):

Let’s put this to bed once and for all, folks: There’s no such thing as a “duplicate content penalty.” At least, not in the way most people mean when they say that.(Demystifying the “duplicate content penalty” – Friday, September 12, 2008 at 8:30 AM)

Google goes on to outright state that so long as the intention is well meaning (like making it easier for people to find your valid and useful content), then you will receive no adverse effects in searches for your blog. That means 99.999% of you out there can relax and walk away. What about people who use things like WordPress MU Sitewide Tags Pages (which takes the excerpts of all posts on a WordPress MultiSite installation and duplicates them onto another site), or BuddyPress’s activity stream (which records everything in multiple places)? Again, the answer is the same. You’re doing this to make the site more available and accessible, ergo no harm ergo no foul.

Google also makes the claim that since CMSs generally don’t handle duplicate content ‘well’ (their word, not mine), non-malicious duplication is common and fairly harmless, though it will affect search results. Here’s where things get sticky. Personally, I disagree with Google’s claim that CMSs handle duplicate content poorly. A well written CMS, knowing that no two people think the same way, takes that into consideration when crafting a site. You want an index, but if you know someone looks for things by subject matter or year, you need to have a way to provide that information for the reader. Google’s problem is that in doing so, you have also provided it for the GoogleBots who patrol your site and pull in the data for searches, which makes the dreaded duplicate content.

Perhaps Google has forgotten (or not made the connection) that they do the exact same thing. They want to show you what you want to see, and while I may search for “Laurence Fishburne actor” and you might look for “Morpheus Actor”, in the end, we both want to see sites about this actor guy named Laurence Fishburne. How do you make sure we get the right information? You have the content sortable in myriad manners. Does that make it duplicate content? Of course not (unless you’re Bing, which is a whole different subject). Google points out:

Most search engines strive for a certain level of variety; they want to show you ten different results on a search results page, not ten different URLs that all have the same content. To this end, Google tries to filter out duplicate documents so that users experience less redundancy. (Demystifying the “duplicate content penalty” – Friday, September 12, 2008 at 8:30 AM)

Thankfully, you can eliminate redundancy by providing Google with a sitemap of your website.(About Sitemaps – Google Webmaster Central) With a good sitemap, you can tell search engines how to weigh your site’s content. Which pages are more important, which can be ignored, etc etc. With WordPress and a good plugin, this can be done automatically by making a few choices in an admin interface. You also want to spend a little time understanding your robots.txt file. Perishable Press has a great article on optimizing it for WordPress.

Now that you know about the myth behind the duplicate content penalty, tomorrow we can get into content farms!

21 February, 2011
Has your site been exploited or victimized?
Nothing frosts my lizard more than someone saying ‘WordPress has been hacked!’ and I’ve finally decided it’s just a case of ignorance.

I’ve been using WordPress since around the famous 2004 MovableType bait’n’switch, when they decided to go paywall. That was not what made me switch to WP. In fact, I had already installed a b2 site in order to allow anyone in my family post a story about my grandmother (and I should really put that back online one day…). It was a lot of little things that made me switch, and I don’t really regret my choice. MT is very useful, very cool and very impressive, but it wasn’t what I wanted or needed.

Yesterday, Byrne Reese posted about how WordPress Won the Blog War. He’s a former Movable Type project manager, so I presume he knows what’s he’s talking about. As a former member of the MT community (under a non-Ipstenu handle) and current heavy user of WordPress, it’s very entertaining to hear a behind-the-scenes view of the ‘war.’ I never saw it as a war, and as anyone who knows me can attest to, I’ve never been a rabid fanboi for one OS or another, one product or another, simply because of who makes it. I like my iPad, but if it doesn’t work for you, I’m more than happy to help you find an alternative. I believe in finding the product that works for you.

What really caught my attention in the post were the comments. The very first had this gem by Matt Haughey:

Now that they’ve won the battle, I think the biggest problem for WP now is two-fold: One is the constant threat of exploits with your own WP install. It’s crazy and like running Windows 95 without patches. Everyone I know with a self-hosted WP has been exploited in the last year or two and worries about it regularly.

Two facts:
1) My WordPress install has never been hacked in the 7 years I’ve had it.
2) I do worry about it constantly.

About a year ago, my server was hacked. Ironically it came three days after I’d posted about WordPress security. How was I hacked? Because I followed insecure practices. I’ve touted, for a while now, that security is a tripod:
- The Web Host is responsible for making sure the sever itself is up to date with the latest patches etc, and that the server is configured in a safe way.
- Web-apps are responsible for not unleashing needless insecurities to the system.
- The end-user we pray to the flying spaghetti monster that they’ve not done something to violate security out of ignorance.
I was hacked because I violated security, which made my server open to attack, which thankfully resulted in my Web Host bailing me out (have I mentioned I love them?). I went to a website on an non-virus-protected PC (yes, Windows), I got what I thought looked suspicious pop-up in IE from a site I knew and trusted, and while the pop-up was there, I opened an FTP (not secure FTP!) connection to my server. I seriously could not have been stupider. Thankfully it was easy to fix, and I since turned off FTP (it’s SFTP or nothing). Actually I also wiped Windows XP off my computer, but previously it was required for my work.

On Byrne’s post, Mark Jaquith (a WP developer) remarked this:

I haven’t seen an up-to-date WordPress install get directly exploited in around five years. Seriously.

I thought about this for a moment, and had to nod. This is true for me as well. Every WordPress install I’ve seen with problems has been due to the web-host or the end-user being insecure. Even when that end-user is me, I’ve yet to have WordPress itself hacked. This does not mean I think WordPress can’t be hacked, just that it’s actually a pretty secure little tool by itself.

Then Mark went on to say this:

All of the large scale instances of WordPress being compromised lately were because of web hosts who don’t prevent users on one account from accessing files on another account. In these cases, WordPress wasn’t exploited so much as it was victimized due to a lower level security issue on the server.

He was far more succinct then I’ve been able to be on the matter, but I’ve touted for a long time that the problem is WordPress, but it’s not WordPress’s fault. Ask anyone in IT why Windows has more viruses than a Mac, and most of us will tell you it’s because Windows is more popular. More people use it, so more hackers/spammers/crackers target it. I wouldn’t say, in 2011, that Windows 7 is more vulnerable than OS X, but I would feel comfortable saying that it is targeted more.

The answer is the same when I’m asked why WordPress gets so much spam. Because it’s used a lot! The more prevalent your product is (i.e. the more successful it is), the higher the likelihood is that some jerk with a kiddie script will try to attack it. This is just a fact of life, and I’m not going to get into how to solve it.

What I feel we need to be aware of is the education of the user base for any product. My father once gave a memorable lecture I caught when I was about six or seven, about our expectations with computers and why AI was never going to be like we saw on Star Trek. “Ignore the man behind the curtain!” he said to the crowd. Back then, I had no idea what he meant. Today I realize that it was two-fold. On the one hand, we think ‘Automate everything! Make it all just work!’ That’s the magic box theory of computers. It all just works and we don’t have to do anything. The reality is that there is always a man behind the curtain, making the magic happen.

The ‘two-fold’ meaning is that (1) we want everything to work perfectly without manual intervention, and that’s just not possible and (2) we don’t want to have to learn WHY it all works, just make it magically work.

My savvy readers are, at this point, thinking “But if I don’t know why it works, how can I fix it?” To them I shrug and agree that you cannot be expected to fix anything you do not understand. Furthermore, the less you understand something, the more likely you are to inaccurately blame someone/something. Which brings us back to why I hate when people say ‘WordPress has been hacked!’ Actually, I hate it when they say anything has been hacked (Drupal, Joomla, WordPress, MovableType, etc etc etc).

We have a few choices at this point. We can stop ignoring the man behind the curtain and learn how the levers work ourselves, or we can accept that we’re not clever enough and hire someone. Either way, we should always take the time to sort out what’s wrong. When my cat was, recently, in the kitty ER for bladder stones (she’s fine now), racking up a $1000+ bill for services, I wanted to know all about what caused them, why did the food work, etc etc. I’m not a vet. I would never make it through medical school (I don’t like blood). But I know how to use my brain. As my professor, Dr. Lauer, told me in high school, “We’re teaching you how to think, how to talk to adults while you’re a child, so you know how to be a person.”

Teach people how to think. You’d never take your Mercedes Benz to Jiffy Lube for an overhaul, so why are you trusting a $5/month webhost without a phone number to support your business? You wouldn’t take your child to a back-alley doctor, so why are you hiring some guy with blink-tags on his site to fix your website? Use your brain. If your webhost tells you ‘Sorry, we can’t help you,’ then take your money someplace else. Website support should always include them taking backups at least every day (you may only get yesterday’s backups, but they should still have ’em). A good host will help you when you ask specific questions.

My host (there’s a link on the top right) will answer the phone 24/7, they helped me craft a backup strategy, un-do the hack on my server, trace down what was using up so much CPU, bead mod_security into submission … the list goes on and on. My point here is not that you should use them (though if you do, tell them I sent you!), but that you should find a host who supports you to the level you need. The brunt of what you pay for hosting is an insurance policy. You’re paying them to bail you out when (yes, when) you need help, and if you’re only paying $5 a month, then you should only expect that level of return.

Educate yourself, educate your host, but have realistic expectations.
10 February, 2011
Don’t Bother Disabling Right-Click

Every now and then I see someone ask ‘How do I disable right-clicking on images on my site?’ My answer, invariably, is ‘You don’t.’ The real question I suppose is ‘How do I stop people from ripping off my work on the net?’ and the answer to that is still ‘You don’t.’

Is it online? Yes? Then it can, and will, be stolen. Does that matter? Kind of. The only way to make your works un-steal-able is to never publish them anywhere.

When the last Harry Potter book came one, some diligent nerd spent countless hours photographing every page of the book, uploaded it online, and oh look, we all knew how it ended. That did not stop everyone from buying the book though, and in the end, was pretty much an amusing footnote in the saga of Harry Potter. And anyone who thought Harry wouldn’t defeat Voldemort was clearly not paying attention.

When I put my dad’s stuff up online, I told him it would be better to convert his PDFs to easily readable HTML. He didn’t want to because they could be stolen. I pointed out that the PDFs are actually easier to rip (download, done), and the HTML was just being generated by copy/pasting from the PDF anyway, so he was already there. The point was to make his work more accessible.

Does this mean you shouldn’t protect your data? Of course not! But the point is if you publish it, online or offline, it can, and will, be stolen. The only thing online media does is make it ‘easier’ to steal and re-publish under someone else’s name. Without getting into the myriad laws of copyright, I want to point out that your publish work is only part of your brand. If someone steals a post I make here, yes, they’re taking away from my audience, but realistically, they’re not hurting my bottom line. The reason you’re here, reading my posts, is because I’ve reached you. Either you read my social media outlets, my RSS feed, or you know me and follow me directly. But the people who are going to read this, you, are here because of work I’ve already done. The work I continue to do keeps you here, and you become my promoters. The only thing the thieves do is hurt my search engine rankings, and not even that in my experience. A brand is more than just your work. It’s you, your image, your representation. Spending all your time worrying about your SEO ranking means you’re missing the point. Of course a high result on a Google Search is important, but that’s just one piece of the pie.

Someone is bound to tell me that all of this is nice and dandy, but why, technically, is it a bad idea to try and protect your media/data?

Disabling right-clicks prevents people from being able to download your media. If I then view the page source, I get the URL of your image, load that into a new browser window, and download your stuff. Or I can drag-and-drop the image to my desktop, if you disable view-source. Those don’t work? Why not check the cache of what my browser automatically downloaded when I visited your page? Or how about a screen shot of what’s active on my screen right now? That’s all stuff I can do without even needing to be code-savvy.

Google for “download image when right click is disabled” and you’ll get millions of hits. There’s no way to protect your media, once it’s online. The best you can do is to brand it in such a way that even if someone does get a copy, it is permanently marked as ‘yours’. Watermarks are the normal way to do this and probably the only ‘good’ way to do it, as they tend to be the hardest thing to remove.

Don’t bother disabling right-click, or trying to stop people from downloading/stealing your stuff. Don’t put it online unless you’re willing to have it nicked. Make your brand identifiable with your site, and people will know to come to you.

31 January, 2011
Fighting Spam – We’re Doing it Wrong

I’ve talked about this before (Spam/Splog Wars) but instead of telling you how to fight spam, I’m going to talk about how it works.

There are two main kinds of spambots, email and forum. Forum bots are a misnomer, as they’re really ‘posting’ bots, but we won’t get into the semantics of names. These posting bots basically are written to scan the web, via links, looking for anything that accepts data (like a comment form on your blog) and submit spam content. There is a limited number of ways for data to be entered on a website, and generally it’s all done via a form. All a bot has to to so is search the code of the page for form data and input it’s crap content. Given that coders tend to label forms things like ‘content’ and ’email’ and ‘username’ for ease of support and modification, it’s not too hard to see how the spambots are written to exploit that functionality.

The main ways to prevent spam are to block the spammers or make users prove their real people. Obviously manual spammers (and yes, there are people paid to post spam on blogs) can’t be defeated this way. Blocking spammers is either done by their behavior (if they access your site via anything other than a browser, or if they post data in any way other than from the ‘post comment’ link on your site, they’re probably a spammer) or by a known-bad-guy list. Those both have problems, as spammers have multiple methods of attack. Making a person prove their humanity is generally done via CAPTCHA, which I’ve mentioned before has its own issues. (CAPTCHA Isn’t Accessible ) Another way is to go at it in reverse. Knowing that the spam bots scan a webpage for fields they can fill in, you set it up so there’s a hidden field which only a spammer would fill in. If they fill that field in, they’re spam and they go to the bin.

I think that our methods for blocking spammers is wrong.

Everything we’re doing is defensive. We wait for a spammer to attack, sort out what they did, and write better code to stop them. Sure, we have some Bayesian spam tools out there, and some fairly impressive algorithmic learning engines that check the content of a post and punt it to the trash. But in the end, it’s all reactive medicine.

Anyone who does support will tell you that even one email or comment sent to the spam bin accidentally is bad. How bad? How many times have you said ‘Did you check your spam emails?’ to someone who didn’t get an email? Any time you impede your real users, the regular people who visit your site, the worse your spam-fighting is. If a user has to jump through hoops to comment, they’re not going to do it. If a user has to register to post, they probably won’t do it (forums are an exception to this).

The goal of spam fighting should be to prevent the spam from being sent.

We all know why spammers spam. If even one person clicks on their links and buys a product, they’ve profited. It’s just that simple. They make money for a low-cost output, it works, so they keep doing it. Therefore the only way to make them stop is to remove the profitability. Of course if it was that easy, we’d have done it already.

This begs the question ‘Who runs these spam bots?’ It’s two fold. There are people out there who run spamming programs, certainly, but there are also people who’ve had their computers infected by a virus or malware, and unknowingly spam people. That’s part of why I don’t like blocking IPs. However the software had to come from somewhere. Cut off the head of the snake and the spam stops! We already have the Spamhaus project, which patrols the internet and looks for sites pushing malware, as well as sending spam email. There’s also Project Honey Pot, which fakes links and forms to trick a spam-bot.

Sadly, until we have more resources dedicated to stopping spam service providers, we’re going to have to keep trying to out think them and play defense.

26 January, 2011
Failure of Imagination

This is not an excuse.

People make mistakes, and we all accept that. But what I find astounding is that a lot of users look at the software developers and say things like “I find it unacceptable that you let this critical error slip through.” They seem to think that anything less than a perfect piece of software coded by perfect people with no errors is cause to demean the programmers.

The other day my father (a seasoned risk analyst – see http://woody.com) passed me an article by Herbert Hecht called Rare Conditions – An Important Cause of Failures, where in Hecht explains that “rarely executed code has a much higher failure rate (expressed in execution time) than frequently executed code during the early operational period.” The point, for those of you who felt your eyes glass over at the big words, is that the less often a piece of code is used, the more likely it is to break.

I read this article and immediately shouted “Yes! Exactly what I’ve been saying!” And it’s not because we don’t try to write the best code we can, either. It’s because most of us use conventional test case preparation. We say ‘This is what we want the code to do, we shall test it and see if it works. Great! Now what if I did this…’ The problem there is you need a skilled person coming up with the ‘what if.’

Back in my desktop-software support days, I was testing a piece of software I’d never used before, and I crashed the program. Hard. I had to reboot. And I found I could crash it repeatedly doing that. I called the vendor, who sent their best techs out to look at it. I showed him what I did and he started laughing. What I was doing was something no one familiar with the software, and it’s purpose, would do, because it was simply wrong. Like putting in your phone number in the field for your first name. They agreed that they should error trap it, however, and it certainly shouldn’t crash the system.

Why didn’t that problem ever show up in their tests? To quote Frank Borman on the Apollo 1 fire “Failure of imagination.” They simply couldn’t conceive of a world where in someone could be that ignorant of the right way to do things. They didn’t document it, because it wasn’t a requirement of their software, but of the process being completed, and they didn’t error-trap it because no one in their right mind would do that.

The world keeps making bigger and bigger fools, doesn’t it? When you test code, you’re always going to have a bias. You’re looking at it from the perspective of someone familiar with both the program and its purpose. When we test WordPress (and I include me on this as I beta test WordPress and file trac reports when I find problems) we test it from the perspective of experienced WordPress users. We’re the people who read the documentation on what changed. We’re familiar. And that’s the problem.

So we look at Hecht’s second suggestion. “Random testing over a data set that is rich in opportunities for multiple rare conditions.” Basically it’s making a list of everything that could go wrong, the really wild and rare errors you’ve seen (in the case of WordPress, you could probably cull some great ones from the forums, at work I review my trouble tickets and make a list of the most common), and testing that. Testing stuff you KNOW will break. Again, this has the problem of bias, but it allows you to make sure when your code fails, it fails elegantly. This kind of testing has the other problem of finding the right data set. This really is the hardest part, and takes some seriously dedicated people to come up with one that limits bias.

Hecht’s final suggestion is “Path testing, particularly where semantic analysis is used to eliminate infeasible paths” but he quickly points out the problems:

[This] technique can be automated and is the only one for which an objective completeness of test criterion can be identified. However, it is costly and will practically be restricted to the most essential portions of a program.

So what can we do, other than be smarter? We can test better, certainly, but that’s more difficult than it should be. And why does all this happen in the first place? We’re smart people! We should code better! In a related article (also sent by my father) by David Lorge Parnas discusses Software Aging. In that paper, he posits that too often we concentrate on getting the first release done, rather than looking at the long term goals of the software.

This article I find has particular relevance to the open source community, which is filled with people who become software ‘engineers’ via non-traditional paths. How non-traditional? In Parnas’ article (dated 1994) he mentions that software designers come either from the classic computer science background or they’re home-grown from their business specialty. This means that a person who writes code for a bank either is a programmer who knows their coding languages and the basics of how to think ‘in computer’, or it’s a bank employee who picks up code and learns it on the job. There are, obvious, drawbacks to both backgrounds. A banker understands the desired end functionality of the program, and the CompSci guy understands who to write it, but not how it’s used. They are, both, too specialized.

I inherited some code that was undocumented and had the problem of inelegant failures on rarely run processes. Over the last five years, I’ve steadfastly cleaned this up to error-trap correctly and output meaningful errors. I don’t have a CompSci degree. Actually, I don’t have a degree at all, in anything, and while I never declared my major, it was Anthropology (by virtue of the courses I’d taken). As luck would have it, I’m also somewhat ignorant of the purpose of most of the code I write. This means every time I’m tasked with a new project, I look at it with fresh eyes, and I can see the flaws people in their little boxes are unable to see. I’m perpetually on the outside, which means my perspective is almost always ignorant, but rarely is it unintelligent. On multiple occasions, my simple question of ‘Why are we doing it this way? Is it to make it easier on the end user or the programmer?’ elicits astounding reactions. I can help pull the programmers out of their heads and look at the long-view.

Parnas calls that ‘designing for change’ (a familiar ad slogan in the 1970s). Part of the problem, he thinks, is that people don’t have the appropriate education to their job and, thus, are untutored in the basics of programing and the related thought process. I disagree, and not just because I am inappropriately educated to my day job. It’s true my ‘training’ didn’t prepare me to write code, but my education did. I was taught, from a very early age, to think and reason and question. Those basic principals are why I’m able to do my job and follow along with Parnas’ work. Perhaps he would be better put to say people have not made the effort to learn the basic groundings of good software design, independent of your education and ‘purpose.’ The banker can learn software design irregardless of the intent of his code, and the CompSci programmer can master enough of banking to understand the purpose of what he writes. The middle ground that has the view of goal and design is what allows us to design for the future and write code that can grow and age.

That all depends on if the code is well documented. Parnas’ rightly twigs to the most common problem I’ve seen in programing. No one likes to write documentation, and when we do, it’s clear as mud to someone who is unfamiliar with our goal and design. If you don’t the goal of the programmers when they came up with Drupal (to pick on someone else for a change), then none of the documentation will help you. Software, being based on mathematics, should have documentation that reflects its parentage, says Parnas. This should not be confused with user documentation, which should have none of that. Developer documentation should resemble mathematical documentation if it has any hope of being useful and lasting.

While Parnas’ paper was written in 1994, I wonder if the problem of crossover between developer and user was as prevalent as it is today. Today there’s no clear line between the developer who writes the code and the end user who wants to use it. This is the case most noticeably in open source projects, like WordPress, Drupal, Joomla, and so on. These projects are championed by the developer/user, a creature that may not have existed as such as widespread phenomenon 17 years ago. While Parnas does mention the possibility in use-cases, he only does it to highlight the problem of isolation among developer groups, and not as a potential root cause to why problems are missed. They are missed in the isolated groups because we cannot see outside our ‘box’, for lack of a better room, and envision that particular ‘what if.’

This is why we have a great need for reviews. When a doctor tells you that you have cancer and you’re going to die, you seek out a second opinion. The same is done with software. The code works for you, so you let it out into the world for other people to test. We need the outside sources to come and bang on the rocks and tell us what’s wrong. That’s why you see people like @JaneForShort asking more people to join the WordPress beta tests. She knows the core developers can’t possibly test everything, and the more use-cases we can come up with, the more we can make the end result better.

Would having more professionals solve this problem? Parnas seems to think so. He thinks that having more people trained in engineering and science combined will produce better programmers. After all, what is coding but math applied to engineering? I feel it’s the techniques that are more important. A grounding in basic algebra (and some calculus, certainly) should be enough to be able to program in most languages. And a well-formed understanding of the disciplines of engineering should allow a person to look past ‘This is my problem, how do I solve it?’ You need innovation, understanding, and a respect for how things work in order to write effective programs.

Why did we miss a critical error? Because we didn’t see it. It’s always going to remain that simple. With better education, will we be able to see it in the future? Perhaps. But all the traditional book education in the world cannot teach a person how to think. Even if we can perfect the creation of the well-thinking human, we will always be loosing a battle against the universe creating bigger fools. But those thinking people can find the problem, once it’s reported, solve it, and learn to make it better the next time.

And that should be our goal.

Make the code better, test it smarter, document it clearly, and plan it thoughtfully.

13 January, 2011