Half-Elf on Tech

Thoughts From a Professional Lesbian

Author: Ipstenu (Mika Epstein)

  • Google vs Splogs – Part 1

    Google vs Splogs – Part 1

    I am not an SEO expert. In fact, there are only a handful of people whom I feel can claim that title without making me roll my eyes so hard I get a migraine. Anyone who tells you they have all the answers to get your site listed well in Google is a liar, because there’s only one good answer: Make a good site. That’s really it. How then do all those spam sites get listed in Google, Bing and Yahoo to begin with, and will the techniques the search engines are using to eradicate those sites hurt you?

    Everyone’s done that search for something and been nailed by a bunch of splogs. Like you look for ‘Laurence Fishburne’ news and you get 100 websites that claim to have news, but really it’s just total, useless crap? Those are splogs, a neologism for spam blogs, and they suck. Splogs are blogs where the articles are fake, and are only created for search engine spamming. They sometimes steal your hard work and use them, by scraping RSS feeds or who knows what else, and generating fake content. Why? Some people do it to infect your PC with a virus, and others do it to trick in into clicking on their ads.

    The problem is spam blogs are so prevalent that they’re adversely affecting search engines, making it harder and harder for you to find real, factual content. This year, rurmors started flying that Google was going to go on the warpath against Search Engine Spam, and in doing so, would downgrade perfectly valid sites with ‘duplicate content.’ Having read and re-read the articles posted by Google on the matter, I’m quite certain that, yet again, people are playing Chicken Little. Nowhere, in any of the articles I’ve read, has there been any discussion of the intent to penalize legitimate, valid, websites for containing internally duplicated content.

    In order to understand the duplicate content penalty problem, and yes, it is a problem, you need to understand how most content management systems (CMS – this includes sites like Drupal, Joomla and WordPress) display their data to the users.

    You write a blog post and the content is stored in the database, along with any tags, catgories, or meta data you put in. When someone goes directly to the blog post, they see. However, they can also see the post if they go to a list of posts in that category, with that tag, on that date, in that year, etc etc and so on and so forth. So the question a lot of new webfolks ask is “Is that duplicate content?” No. It’s not. Nor is having ipstenu.org and ipstenu.org point to the same page. In fact, that’s good for your site. The more, valid, ways you have of providing your user with information, the easier it is for them to find what they want, and they happier they are. Happy users means repeat users, which means profit (in that oh so nebulous “web = profit” theory).

    So what is this mysterious duplicate content penalty?

    Let’s take this from the horse’s mouth (or at least Google):

    Let’s put this to bed once and for all, folks: There’s no such thing as a “duplicate content penalty.” At least, not in the way most people mean when they say that.(Demystifying the “duplicate content penalty” – Friday, September 12, 2008 at 8:30 AM)

    Google goes on to outright state that so long as the intention is well meaning (like making it easier for people to find your valid and useful content), then you will receive no adverse effects in searches for your blog. That means 99.999% of you out there can relax and walk away. What about people who use things like WordPress MU Sitewide Tags Pages (which takes the excerpts of all posts on a WordPress MultiSite installation and duplicates them onto another site), or BuddyPress’s activity stream (which records everything in multiple places)? Again, the answer is the same. You’re doing this to make the site more available and accessible, ergo no harm ergo no foul.

    Google also makes the claim that since CMSs generally don’t handle duplicate content ‘well’ (their word, not mine), non-malicious duplication is common and fairly harmless, though it will affect search results. Here’s where things get sticky. Personally, I disagree with Google’s claim that CMSs handle duplicate content poorly. A well written CMS, knowing that no two people think the same way, takes that into consideration when crafting a site. You want an index, but if you know someone looks for things by subject matter or year, you need to have a way to provide that information for the reader. Google’s problem is that in doing so, you have also provided it for the GoogleBots who patrol your site and pull in the data for searches, which makes the dreaded duplicate content.

    Perhaps Google has forgotten (or not made the connection) that they do the exact same thing. They want to show you what you want to see, and while I may search for “Laurence Fishburne actor” and you might look for “Morpheus Actor”, in the end, we both want to see sites about this actor guy named Laurence Fishburne. How do you make sure we get the right information? You have the content sortable in myriad manners. Does that make it duplicate content? Of course not (unless you’re Bing, which is a whole different subject). Google points out:

    Most search engines strive for a certain level of variety; they want to show you ten different results on a search results page, not ten different URLs that all have the same content. To this end, Google tries to filter out duplicate documents so that users experience less redundancy. (Demystifying the “duplicate content penalty” – Friday, September 12, 2008 at 8:30 AM)

    Thankfully, you can eliminate redundancy by providing Google with a sitemap of your website.(About Sitemaps – Google Webmaster Central) With a good sitemap, you can tell search engines how to weigh your site’s content. Which pages are more important, which can be ignored, etc etc. With WordPress and a good plugin, this can be done automatically by making a few choices in an admin interface. You also want to spend a little time understanding your robots.txt file. Perishable Press has a great article on optimizing it for WordPress.

    Now that you know about the myth behind the duplicate content penalty, tomorrow we can get into content farms!

  • Has your site been exploited or victimized?

    Has your site been exploited or victimized?

    Nothing frosts my lizard more than someone saying ‘WordPress has been hacked!’ and I’ve finally decided it’s just a case of ignorance.

    I’ve been using WordPress since around the famous 2004 MovableType bait’n’switch, when they decided to go paywall. That was not what made me switch to WP. In fact, I had already installed a b2 site in order to allow anyone in my family post a story about my grandmother (and I should really put that back online one day…). It was a lot of little things that made me switch, and I don’t really regret my choice. MT is very useful, very cool and very impressive, but it wasn’t what I wanted or needed.

    Yesterday, Byrne Reese posted about how WordPress Won the Blog War. He’s a former Movable Type project manager, so I presume he knows what’s he’s talking about. As a former member of the MT community (under a non-Ipstenu handle) and current heavy user of WordPress, it’s very entertaining to hear a behind-the-scenes view of the ‘war.’ I never saw it as a war, and as anyone who knows me can attest to, I’ve never been a rabid fanboi for one OS or another, one product or another, simply because of who makes it. I like my iPad, but if it doesn’t work for you, I’m more than happy to help you find an alternative. I believe in finding the product that works for you.

    What really caught my attention in the post were the comments. The very first had this gem by Matt Haughey:

    Now that they’ve won the battle, I think the biggest problem for WP now is two-fold: One is the constant threat of exploits with your own WP install. It’s crazy and like running Windows 95 without patches. Everyone I know with a self-hosted WP has been exploited in the last year or two and worries about it regularly.

    Two facts:
    1) My WordPress install has never been hacked in the 7 years I’ve had it.
    2) I do worry about it constantly.

    About a year ago, my server was hacked. Ironically it came three days after I’d posted about WordPress security. How was I hacked? Because I followed insecure practices. I’ve touted, for a while now, that security is a tripod:

    • The Web Host is responsible for making sure the sever itself is up to date with the latest patches etc, and that the server is configured in a safe way.
    • Web-apps are responsible for not unleashing needless insecurities to the system.
    • The end-user we pray to the flying spaghetti monster that they’ve not done something to violate security out of ignorance.

    I was hacked because I violated security, which made my server open to attack, which thankfully resulted in my Web Host bailing me out (have I mentioned I love them?). I went to a website on an non-virus-protected PC (yes, Windows), I got what I thought looked suspicious pop-up in IE from a site I knew and trusted, and while the pop-up was there, I opened an FTP (not secure FTP!) connection to my server. I seriously could not have been stupider. Thankfully it was easy to fix, and I since turned off FTP (it’s SFTP or nothing). Actually I also wiped Windows XP off my computer, but previously it was required for my work.

    On Byrne’s post, Mark Jaquith (a WP developer) remarked this:

    I haven’t seen an up-to-date WordPress install get directly exploited in around five years. Seriously.

    I thought about this for a moment, and had to nod. This is true for me as well. Every WordPress install I’ve seen with problems has been due to the web-host or the end-user being insecure. Even when that end-user is me, I’ve yet to have WordPress itself hacked. This does not mean I think WordPress can’t be hacked, just that it’s actually a pretty secure little tool by itself.

    Then Mark went on to say this:

    All of the large scale instances of WordPress being compromised lately were because of web hosts who don’t prevent users on one account from accessing files on another account. In these cases, WordPress wasn’t exploited so much as it was victimized due to a lower level security issue on the server.

    He was far more succinct then I’ve been able to be on the matter, but I’ve touted for a long time that the problem is WordPress, but it’s not WordPress’s fault. Ask anyone in IT why Windows has more viruses than a Mac, and most of us will tell you it’s because Windows is more popular. More people use it, so more hackers/spammers/crackers target it. I wouldn’t say, in 2011, that Windows 7 is more vulnerable than OS X, but I would feel comfortable saying that it is targeted more.

    The answer is the same when I’m asked why WordPress gets so much spam. Because it’s used a lot! The more prevalent your product is (i.e. the more successful it is), the higher the likelihood is that some jerk with a kiddie script will try to attack it. This is just a fact of life, and I’m not going to get into how to solve it.

    What I feel we need to be aware of is the education of the user base for any product. My father once gave a memorable lecture I caught when I was about six or seven, about our expectations with computers and why AI was never going to be like we saw on Star Trek. “Ignore the man behind the curtain!” he said to the crowd. Back then, I had no idea what he meant. Today I realize that it was two-fold. On the one hand, we think ‘Automate everything! Make it all just work!’ That’s the magic box theory of computers. It all just works and we don’t have to do anything. The reality is that there is always a man behind the curtain, making the magic happen.

    The ‘two-fold’ meaning is that (1) we want everything to work perfectly without manual intervention, and that’s just not possible and (2) we don’t want to have to learn WHY it all works, just make it magically work.

    My savvy readers are, at this point, thinking “But if I don’t know why it works, how can I fix it?” To them I shrug and agree that you cannot be expected to fix anything you do not understand. Furthermore, the less you understand something, the more likely you are to inaccurately blame someone/something. Which brings us back to why I hate when people say ‘WordPress has been hacked!’ Actually, I hate it when they say anything has been hacked (Drupal, Joomla, WordPress, MovableType, etc etc etc).

    We have a few choices at this point. We can stop ignoring the man behind the curtain and learn how the levers work ourselves, or we can accept that we’re not clever enough and hire someone. Either way, we should always take the time to sort out what’s wrong. When my cat was, recently, in the kitty ER for bladder stones (she’s fine now), racking up a $1000+ bill for services, I wanted to know all about what caused them, why did the food work, etc etc. I’m not a vet. I would never make it through medical school (I don’t like blood). But I know how to use my brain. As my professor, Dr. Lauer, told me in high school, “We’re teaching you how to think, how to talk to adults while you’re a child, so you know how to be a person.”

    Teach people how to think. You’d never take your Mercedes Benz to Jiffy Lube for an overhaul, so why are you trusting a $5/month webhost without a phone number to support your business? You wouldn’t take your child to a back-alley doctor, so why are you hiring some guy with blink-tags on his site to fix your website? Use your brain. If your webhost tells you ‘Sorry, we can’t help you,’ then take your money someplace else. Website support should always include them taking backups at least every day (you may only get yesterday’s backups, but they should still have ’em). A good host will help you when you ask specific questions.

    My host (there’s a link on the top right) will answer the phone 24/7, they helped me craft a backup strategy, un-do the hack on my server, trace down what was using up so much CPU, bead mod_security into submission … the list goes on and on. My point here is not that you should use them (though if you do, tell them I sent you!), but that you should find a host who supports you to the level you need. The brunt of what you pay for hosting is an insurance policy. You’re paying them to bail you out when (yes, when) you need help, and if you’re only paying $5 a month, then you should only expect that level of return.

    Educate yourself, educate your host, but have realistic expectations.

  • Don’t Bother Disabling Right-Click

    Don’t Bother Disabling Right-Click

    Every now and then I see someone ask ‘How do I disable right-clicking on images on my site?’ My answer, invariably, is ‘You don’t.’ The real question I suppose is ‘How do I stop people from ripping off my work on the net?’ and the answer to that is still ‘You don’t.’

    Is it online? Yes? Then it can, and will, be stolen. Does that matter? Kind of. The only way to make your works un-steal-able is to never publish them anywhere.

    When the last Harry Potter book came one, some diligent nerd spent countless hours photographing every page of the book, uploaded it online, and oh look, we all knew how it ended. That did not stop everyone from buying the book though, and in the end, was pretty much an amusing footnote in the saga of Harry Potter. And anyone who thought Harry wouldn’t defeat Voldemort was clearly not paying attention.

    When I put my dad’s stuff up online, I told him it would be better to convert his PDFs to easily readable HTML. He didn’t want to because they could be stolen. I pointed out that the PDFs are actually easier to rip (download, done), and the HTML was just being generated by copy/pasting from the PDF anyway, so he was already there. The point was to make his work more accessible.

    Does this mean you shouldn’t protect your data? Of course not! But the point is if you publish it, online or offline, it can, and will, be stolen. The only thing online media does is make it ‘easier’ to steal and re-publish under someone else’s name. Without getting into the myriad laws of copyright, I want to point out that your publish work is only part of your brand. If someone steals a post I make here, yes, they’re taking away from my audience, but realistically, they’re not hurting my bottom line. The reason you’re here, reading my posts, is because I’ve reached you. Either you read my social media outlets, my RSS feed, or you know me and follow me directly. But the people who are going to read this, you, are here because of work I’ve already done. The work I continue to do keeps you here, and you become my promoters. The only thing the thieves do is hurt my search engine rankings, and not even that in my experience. A brand is more than just your work. It’s you, your image, your representation. Spending all your time worrying about your SEO ranking means you’re missing the point. Of course a high result on a Google Search is important, but that’s just one piece of the pie.

    Someone is bound to tell me that all of this is nice and dandy, but why, technically, is it a bad idea to try and protect your media/data?

    Disabling right-clicks prevents people from being able to download your media. If I then view the page source, I get the URL of your image, load that into a new browser window, and download your stuff. Or I can drag-and-drop the image to my desktop, if you disable view-source. Those don’t work? Why not check the cache of what my browser automatically downloaded when I visited your page? Or how about a screen shot of what’s active on my screen right now? That’s all stuff I can do without even needing to be code-savvy.

    Google for “download image when right click is disabled” and you’ll get millions of hits. There’s no way to protect your media, once it’s online. The best you can do is to brand it in such a way that even if someone does get a copy, it is permanently marked as ‘yours’. Watermarks are the normal way to do this and probably the only ‘good’ way to do it, as they tend to be the hardest thing to remove.

    Don’t bother disabling right-click, or trying to stop people from downloading/stealing your stuff. Don’t put it online unless you’re willing to have it nicked. Make your brand identifiable with your site, and people will know to come to you.

  • WordPress 3.1 and Network Menu

    WordPress 3.1 and Network Menu

    This one’s really fast, but the word should get out there. It’s extremely important if you’ve built a BuddyPress plugin, because the BuddyPress menu has been moved. If you do not do this, your BuddyPress menus will vanish, and you will have very angry users.

    WordPress 3.1 has moved the admin menu for MultiSite. Instead of having a Super Admin menu available on the sidebar of your Admin section all the time, there’s a new link in the header bar for Network Admin. Thats right, it’s its own page! The problem with this is that a lot of plugins aren’t ready for that and because of the move, their admin menu links vanished.

    Thankfully it’s an easy fix!

    The WP Codex Doc on Admin Menus tells us to use “admin_menu” when adding menus. Well, if you want your plugin to show up on the Network Admin page, you have to use “network_admin_menu” instead. (Trac #14435)

    See? I said it was easy.

    add_action( 'network_admin_menu', 'MENU_FUNCTION', ## );
    

    There’s on catch to this. What if your plugin is for Multisite and NON MultiSite? Honestly I don’t know if this matters, but just to be safe, I would do this:

    if ( is_multisite() ) { 
         add_action( 'network_admin_menu', 'MENU_FUNCTION', ## );
    } else {
         add_action( 'admin_menu', 'MENU_FUNCTION', ## );
    }
    

    Or this:

         add_action( 'network_admin_menu', 'MENU_FUNCTION', ## );
         add_action( 'admin_menu', 'MENU_FUNCTION', ## );
    

    That’s right! If it’s there and not-needed, it does no harm! I’m not 100% certain right now if you need to do this for the non-menu calls people make (like calling a function on admin_menu to check for installation), but I’ve been adding it in to no ill effects. I figure, the BuddyPress core code does it, so it can’t hurt!

  • Fighting Spam – We’re Doing it Wrong

    Fighting Spam – We’re Doing it Wrong

    I’ve talked about this before (Spam/Splog Wars) but instead of telling you how to fight spam, I’m going to talk about how it works.

    There are two main kinds of spambots, email and forum. Forum bots are a misnomer, as they’re really ‘posting’ bots, but we won’t get into the semantics of names. These posting bots basically are written to scan the web, via links, looking for anything that accepts data (like a comment form on your blog) and submit spam content. There is a limited number of ways for data to be entered on a website, and generally it’s all done via a form. All a bot has to to so is search the code of the page for form data and input it’s crap content. Given that coders tend to label forms things like ‘content’ and ’email’ and ‘username’ for ease of support and modification, it’s not too hard to see how the spambots are written to exploit that functionality.

    The main ways to prevent spam are to block the spammers or make users prove their real people. Obviously manual spammers (and yes, there are people paid to post spam on blogs) can’t be defeated this way. Blocking spammers is either done by their behavior (if they access your site via anything other than a browser, or if they post data in any way other than from the ‘post comment’ link on your site, they’re probably a spammer) or by a known-bad-guy list. Those both have problems, as spammers have multiple methods of attack. Making a person prove their humanity is generally done via CAPTCHA, which I’ve mentioned before has its own issues. (CAPTCHA Isn’t Accessible ) Another way is to go at it in reverse. Knowing that the spam bots scan a webpage for fields they can fill in, you set it up so there’s a hidden field which only a spammer would fill in. If they fill that field in, they’re spam and they go to the bin.

    I think that our methods for blocking spammers is wrong.

    Everything we’re doing is defensive. We wait for a spammer to attack, sort out what they did, and write better code to stop them. Sure, we have some Bayesian spam tools out there, and some fairly impressive algorithmic learning engines that check the content of a post and punt it to the trash. But in the end, it’s all reactive medicine.

    Anyone who does support will tell you that even one email or comment sent to the spam bin accidentally is bad. How bad? How many times have you said ‘Did you check your spam emails?’ to someone who didn’t get an email? Any time you impede your real users, the regular people who visit your site, the worse your spam-fighting is. If a user has to jump through hoops to comment, they’re not going to do it. If a user has to register to post, they probably won’t do it (forums are an exception to this).

    The goal of spam fighting should be to prevent the spam from being sent.

    We all know why spammers spam. If even one person clicks on their links and buys a product, they’ve profited. It’s just that simple. They make money for a low-cost output, it works, so they keep doing it. Therefore the only way to make them stop is to remove the profitability. Of course if it was that easy, we’d have done it already.

    This begs the question ‘Who runs these spam bots?’ It’s two fold. There are people out there who run spamming programs, certainly, but there are also people who’ve had their computers infected by a virus or malware, and unknowingly spam people. That’s part of why I don’t like blocking IPs. However the software had to come from somewhere. Cut off the head of the snake and the spam stops! We already have the Spamhaus project, which patrols the internet and looks for sites pushing malware, as well as sending spam email. There’s also Project Honey Pot, which fakes links and forms to trick a spam-bot.

    Sadly, until we have more resources dedicated to stopping spam service providers, we’re going to have to keep trying to out think them and play defense.

  • WordPress Google Libraries

    WordPress Google Libraries

    A lot of people would rather use Google Hosted JavaScript Libraries. Why? Here are three good reasons. Okay, great. How do you do it in WordPress? DigWP has you covered.

    But if you want to do it for WordPress MultiSite, for all sites on your network, you can toss this into your mu-plugins folder. I named my file googlelib.php and dropped it in. Bam.

    Oh and there’s also the Use Google Libraries plugin, by Jason Penney, which works great too. Just drop the use-google-libraries.php file into your mu-plugins folder and call it a day.

    <?php
    /*
    Plugin Name: Google Lib
    Plugin URI: http://digwp.com/2009/06/use-google-hosted-javascript-libraries-still-the-right-way/
    Description: Use Google Hosted JavaScript Libraries (... Still the right way)
    Version: 1.0
    Author: Mika Epstein
    Author URI: http://ipstenu.org/
    */
    
    if( !is_admin()){
       wp_deregister_script('jquery');
       wp_register_script('jquery', ("http://ajax.googleapis.com/ajax/libs/jquery/1.4.4/jquery.min.js"), false, '1.4.4');
       wp_enqueue_script('jquery');
    }
    
    ?>