Half-Elf on Tech

Thoughts From a Professional Lesbian

Tag: spam

  • Not Mailbag: Where Contact Forms Fail

    Not Mailbag: Where Contact Forms Fail

    My friend Andy, reading last Friday’s post, remarked no one should have to put up with crap like that. He’s right, and I mentioned that most contact forms don’t allow you to filter via your WordPress blacklists or comment moderation settings.

    Surprised?

    You should be.

    Back in March 2014, I raised this with Jetpack, saying that the Feedback ignores Blacklists.

    You have a moderation list and a blacklist.

    You have a user you want to block from commenting forever. You add them to the blacklist. Surprise! They can still use the feedback form!

    This should behave just like the blacklist on comments: It blackholes them. Done and gone. After all, you didn’t want them around.

    Logically I can see why it doesn’t use the comment checks. If you have a check to only let users who have an approved comment, leave more comments freely, this would be a problem. There’s no ‘pending’ value for feedback.

    And the first reply … Well it made me mad back then. I say this as someone who is good friends with the fellow who commented, but back in 2014, I wanted to smack the back of his head.

    This would be super easy to get around, just changed the alleged from email address. Besides, blacklist tends to be things that shouldn’t be displayed publicly automatically, allowing contacts would let them appeal the blacklist.

    I could see grounds for adding a filter to have grunion follow the commenting blacklist though. Less sold on an admin option.

    Now go back and read last week’s post. I have not blacklisted the rather vile word used in that comment because I have a friend who is dyslexic and often says ‘cuntry’ instead of ‘country.’ It’s an honest mistake on her part. We added in an autocorrect to her phone and tablet. But blocking short words is hard. Still. The IP address? You bet that hit my blacklist.

    If I still had a comment form, that moron could still harass me.

    As I replied to George:

    Sure, and it’s just as easy to get around the current blacklists in WP. The point is, though, if you’ve put someone’s email on your comment blacklist, the assumption can be made that you have a good reason. You DON’T want this person commenting on your site, so why are you making it easy for them to harass you? And yeah, I used ‘harass’ intentionally.

    Certainly I can and do block their emails on the server, but I still have to go in and clean out the messages in feedback once and a while, and I for one get a lot of pretty vile garbage from people. So having one less place to have to read their BS would be beneficial.

    It’s always been relatively easy to work around if you’re a dedicated troll, but if the blacklist just blackholed their contact messages, it does a lot for your mental health.

    Because he’s right that a dedicated asshole will work around the blacklists. They do it today. Still, I feel there’s no reason to make it easier for them. And while I can block from a server level, not everyone has my skills. And for those people, should we not introduce Akismet level scans on feedback forms?

    You see, the reason I was mad at George back then is his argument felt like he was saying “since it can be worked around, this is a bad idea.”

    That is absolutely not what he meant.

    Even if I didn’t know George well, I have simple proof he didn’t think this was a stupid idea, he thought it was an idea that begat caution. What proof? He didn’t close the issue. In fact, he gave it a milestone to review.

    Now, sadly, it’s been two years with no traction. Every so often someone bumps the milestone, which means it’s among the 600+ tickets that need attention. But it lingers. It’s not a priority.

    Jetpack and Akismet are both owned by the same company. If you have the Akismet plugin installed and activated, and have an active subscription, every form submission will be checked for spam.

    They need to take it to the next level. So do all forms plugins. From what I can tell, Ninja Forms has a field simple spam prevention but no blacklists. Gravity Forms has an old, not-updated, 3rd party plugin for a Gravity Forms Email Blacklist.

    In fact … the only contact form plugin I could find that actually uses WordPress’ built in blacklist would be Takayuki-san’s Contact Form 7.

    Let us protect ourselves from abuse.

  • Referrer Spam In Adsense

    Referrer Spam In Adsense

    You may have heard of Semalt.com. I’ve heard them argue that they’re not spammers, they’re not evil, they’re not bad people.

    You know what? They are. They are spamming, they are doing evil, and they’re bad people.

    The other day I was checking my top-sites in Google Adsense, trying to think of how to increase revenue on my passive income, when I saw this random domain showing up on my list of sites. A site that wasn’t mine. A site that looked like a spammer:

    Adsense top sites shows one of Semalt's URLs

    Why is this happening?

    According to Google, this happens when a site loads cached content of your domain (Google does this). It can also happen when someone copies your whole webpage into an HTML email, or if someone uses a bad iframe.

    There’s also the obvious, but rare, case where someone uses your code without your knowledge.

    Do I need to block them?

    No. Except for the part where they screw up your analytics metrics and cause load on your server. Keep reading, I’ll explain.

    Will I Be Penalized by Google?

    My first thought was “Oh shit, Google’s going to yell at me!” I quickly checked that I had site authorization on, which means only domains I’ve approved and added can show my ads. Whew.

    This is a big deal by the way. While it would be nice to earn more views, if a site that isn’t mine uses my ads without knowing, I can get in trouble. More than once I’ve told off plugin developers about using Adsense in their plugins. This is for a couple reasons, first is that you can use it to track who uses your plugin (bad), but also because Google doesn’t want you to. They outright say that you cannot put ads “on any non-content-based page.” An admin dashboard is not a content page. Done and done. No ads in your plugins, thank you.

    But that’s exactly why I was worried!

    Where is Semalt showing my ads?

    What is this URL for anyway?

    The URL was http://keywords-monitoring-your-success.com/try.php?u=http%3A%2F%2Fexample.com (not my real URL). The only reason I could find it was I dug into my Google stats and found it as a referrer. If you happen to pop that into a browser, you will be redirected to http://semalt.com/ — Real nice.

    That is, by the way, how I knew it was Semalt.

    What is Semalt?

    Semalt is a professional SEO and marketing service. They literally make their money ‘crawling’ websites. When their site started, it was really the scamiest looking thing I’d seen in a long time. A year and a half later, they’ve cleaned up their act a bit but back in 2014 we all looked at them with a massive Spock eye.

    As it turned out, they were using infected computers to scan the web. My personal guess was that they are leveraging hacked computers and using them to scan for vulnerable websites. Once they find a site, they hack it and use it to push malware.

    That’s a guess. I have no proof. But based on their search patters and behavior, it’s looking pretty likely to me.

    Can I block them?

    Yes! But there’s a catch.

    You see, everyone says you can do this:

    # Block visits from semalt.com 
    RewriteEngine on 
    RewriteCond %{HTTP_REFERER} ^http://([^.]+\.)*semalt\.com [NC]
    RewriteRule .* - [F]
    

    And while that works, it’s obvious that Semalt is on to us because now they use keywords-monitoring-your-success.com and other URLs as passthroughs.

    How do I get them out of my analytics?

    Do you use WordPress.com? Or Jetpack? Great! Report the referrer as spam! WordPress.com blocked Semalt back in 2014, but obviously they’re on the rise again.

    If you’re using Google Analytics, Referrer Spam Blocker is probably your best bet.

  • Contact Form 7 and Anti-Spam

    DreamHost has a fairly simple anti-spam policy, which can be summed up as this: You cannot send email from an address that isn’t your domain.

    If that was greek to you, don’t worry. What that means is that my WordPress blog here can only send emails as elftest.net. That poses a small problem if you’re not using your domain-name to send email (a rare occurrence in WordPress), and a large one if you happen to be using the popular Contact Form 7 plugin.

    Contact Form 7 lets you create robust contact forms for your site, however it has one minor ‘flaw’ (and I hesitate to use that word). When it sends email, it sends it from the user who submits the form. DreamHost, naturally, doesn’t like this. joe@gmail.com isn’t an elftest user!

    Thankfully there’s a work-around for you, and it’s really easy. For most people, the plugin SMTP Configure, once installed and activated, will automatically fix this for you! It’s written by a reliable and trusted programmer, and I highly recommend it. Remember! Once you install the plugin, just activate it. For the vast majority of people, this was it. Everything magically worked.

    Then there were some people who came and said “No, this does not work.” I’ve yet to reproduce it, but one person told me that after putting in his SMTP credentials, just like you would setting up email clients, it worked perfectly.

    Additional: If you’re using Jetpack’s contact form, and you’ve changed the ‘to’ email address, you will also need this plugin. You’ll know you’re using that option because you’ll see this in your contact form shortcode:

    to="me@myotherdomain.com"
  • Google vs Splogs – Part 2

    Google vs Splogs – Part 2

    Now that you know all about the myth of the duplicate content penalty, we can look into spam.

    This year, Google got slammed because the quality of their search was being degraded by spammers. Mostly splogs, I will admit, but Google rightly points out that their ability to filter out spam and splogs in all languages is actually much better than it was five years ago. (Google search and search engine spam – 1/21/2011 09:00:00 AM) No, Google isn’t getting worse, there are just more spammers out there. They also take the time to differentiate between “pure webspam” and “content farms.”

    “Pure webspam” is what you see in a search result when a website uses meta data or hidden content in order to bully their way into being highly ranked in unrelated searches, or just basically game the system. A decade ago, this was horrific. Now it’s nearly negligible. This type of spam grew pretty organically out of people trying to understand the algorithm behind search engines and manipulate it legally. As we gained greater understanding of meta keywords and in-context content, we came up with more and more tricks to legitimately make our sites more popular. There was a point in time where having hidden text with as many keywords related to your site was not only common place, but lauded. It didn’t last long, as shortly after the good-guys sorted that out, the bad-guys did too.

    “Content farms” are the wave of the future, and Google calls them sites with “shallow or low-quality content.” The definition is vague, and basically means a content farm is a website that trolls the internet, takes good data from other sites, and reproduces it on their own. Most content farms provided automatically inserted data. There is no man behind the scenes manually scanning the internet for related topics and copy/pasting them into their site. Instead, this is all done via software known as content scrapers. The reasons why they do this I’ll get to in a minute, but I think that Google’s statement that they’re going to spend 2011 burning down the content farms is what’s got people worried about duplicate content again.

    A content farm is (partly) defined as a website that exists by duplicating content. Your site’s activity feed/archives/post tags pages are duplicating content for the users. Does that mean your site will be adversely affected because of this?

    No. It will not.

    Google’s algorithm is targeting sites of low content quality. While your stolen post is a beautifully written piece of art on its own, it’s the site as a whole that is used to generating a search ranking. As I’ve been touting for a decade, the trick to getting your site promoted in Google searches is to make a good site. Presuming you made a good site, with good content, and good traffic, and it’s updated regularly, there is very little risk that Google will peg your site as being of “low content quality.” Keep that phrase in mind and remember it well. Your site isn’t highly ranked because of low content, remember! It’s the reverse. If you’re being ranked for good behavior, good content, and good work, you will continue to be rewarded. In a weird way, content farms are actually helping Google refine their search so that it can tell the difference between good sites and bad! (Why The Web Needs Content Farms – by Eric Ward on February 16, 2011)

    The next Google algorithm update will focus on cleaning content farms from positions of unfair advantage in our index. This will likely affect websites with considerable content copied from other online sources. Once this update is complete, preference will be given to the originators of content. We expect this to be in effect in no less than 60 days. (Google search and search engine spam – 1/21/2011 09:00:00 AM)

    What Google is doing is not only laudable, but necessary. They are adapting to the change of how spam is delivered, and doing so in a way that should not impact your site. The only ways I can see this affecting ‘innocent’ sites are those blogs who use RSS feed scrapers to populate their sites. This is why anytime someone asks me how to do that, I either tell them don’t or I don’t answer at all. While I certainly use other news articles to populate my site, I do so my quoting them and crafting my own, individual, posts. In that manner I both express my own creativity and promotion the high quality of my own site. I make my site better. And that is the only way to get your site well-ranked. Yes, it is work, and yes, it is time consuming. Anything worth doing is going to take you time, and the sooner you accept that, the happier you will be.

    For most small to medium sites, there’s not a thing you need to do in order to maintain your ranking. There are no magic bullets or secrets behind the SEO, to manipulate your site to a better ranking. In point of fact, doing so can be seen as gaming the system and can downgrade your results! Once again. Make a good site and you will be rewarded. Certainly, as I said yesterday, optimizing your robots.txt file and getting a good sitemap will help, and I really do suggest a Google Webmaster Tools account to help you with that. In 2011, Google is still king, so once you get your site well listed within Google’s machine, you’re pretty much going to be tops everywhere.

    Why do splogs and content farm game the system in order to get highly ranked? Profit. Some do it to get their domain highly ranked and then sell it for a lot of money, others do it to infect your computer with a virus, and then there’s the rare hero who thinks this will get them money because of the ads on their site. Sadly, this still works enough to generate just enough of a profit to keep the splogs going. This is also true of spam emails. Yes, that means your grandmother and Carla Tallucci are still falling for the Nigerian Princess scam emails. The only way to stop all of that is to stop those methods from being productive money makers for the spammers, and that is something that will take us all a very long time and a great deal of education to the masses.

    Your take aways are pretty simple. Make a good site with good content. Update it regularly. Use a sitemap to teach search engines what’s important. You’ll be fine. Don’t sweat internal duplication.

  • Google vs Splogs – Part 1

    Google vs Splogs – Part 1

    I am not an SEO expert. In fact, there are only a handful of people whom I feel can claim that title without making me roll my eyes so hard I get a migraine. Anyone who tells you they have all the answers to get your site listed well in Google is a liar, because there’s only one good answer: Make a good site. That’s really it. How then do all those spam sites get listed in Google, Bing and Yahoo to begin with, and will the techniques the search engines are using to eradicate those sites hurt you?

    Everyone’s done that search for something and been nailed by a bunch of splogs. Like you look for ‘Laurence Fishburne’ news and you get 100 websites that claim to have news, but really it’s just total, useless crap? Those are splogs, a neologism for spam blogs, and they suck. Splogs are blogs where the articles are fake, and are only created for search engine spamming. They sometimes steal your hard work and use them, by scraping RSS feeds or who knows what else, and generating fake content. Why? Some people do it to infect your PC with a virus, and others do it to trick in into clicking on their ads.

    The problem is spam blogs are so prevalent that they’re adversely affecting search engines, making it harder and harder for you to find real, factual content. This year, rurmors started flying that Google was going to go on the warpath against Search Engine Spam, and in doing so, would downgrade perfectly valid sites with ‘duplicate content.’ Having read and re-read the articles posted by Google on the matter, I’m quite certain that, yet again, people are playing Chicken Little. Nowhere, in any of the articles I’ve read, has there been any discussion of the intent to penalize legitimate, valid, websites for containing internally duplicated content.

    In order to understand the duplicate content penalty problem, and yes, it is a problem, you need to understand how most content management systems (CMS – this includes sites like Drupal, Joomla and WordPress) display their data to the users.

    You write a blog post and the content is stored in the database, along with any tags, catgories, or meta data you put in. When someone goes directly to the blog post, they see. However, they can also see the post if they go to a list of posts in that category, with that tag, on that date, in that year, etc etc and so on and so forth. So the question a lot of new webfolks ask is “Is that duplicate content?” No. It’s not. Nor is having ipstenu.org and ipstenu.org point to the same page. In fact, that’s good for your site. The more, valid, ways you have of providing your user with information, the easier it is for them to find what they want, and they happier they are. Happy users means repeat users, which means profit (in that oh so nebulous “web = profit” theory).

    So what is this mysterious duplicate content penalty?

    Let’s take this from the horse’s mouth (or at least Google):

    Let’s put this to bed once and for all, folks: There’s no such thing as a “duplicate content penalty.” At least, not in the way most people mean when they say that.(Demystifying the “duplicate content penalty” – Friday, September 12, 2008 at 8:30 AM)

    Google goes on to outright state that so long as the intention is well meaning (like making it easier for people to find your valid and useful content), then you will receive no adverse effects in searches for your blog. That means 99.999% of you out there can relax and walk away. What about people who use things like WordPress MU Sitewide Tags Pages (which takes the excerpts of all posts on a WordPress MultiSite installation and duplicates them onto another site), or BuddyPress’s activity stream (which records everything in multiple places)? Again, the answer is the same. You’re doing this to make the site more available and accessible, ergo no harm ergo no foul.

    Google also makes the claim that since CMSs generally don’t handle duplicate content ‘well’ (their word, not mine), non-malicious duplication is common and fairly harmless, though it will affect search results. Here’s where things get sticky. Personally, I disagree with Google’s claim that CMSs handle duplicate content poorly. A well written CMS, knowing that no two people think the same way, takes that into consideration when crafting a site. You want an index, but if you know someone looks for things by subject matter or year, you need to have a way to provide that information for the reader. Google’s problem is that in doing so, you have also provided it for the GoogleBots who patrol your site and pull in the data for searches, which makes the dreaded duplicate content.

    Perhaps Google has forgotten (or not made the connection) that they do the exact same thing. They want to show you what you want to see, and while I may search for “Laurence Fishburne actor” and you might look for “Morpheus Actor”, in the end, we both want to see sites about this actor guy named Laurence Fishburne. How do you make sure we get the right information? You have the content sortable in myriad manners. Does that make it duplicate content? Of course not (unless you’re Bing, which is a whole different subject). Google points out:

    Most search engines strive for a certain level of variety; they want to show you ten different results on a search results page, not ten different URLs that all have the same content. To this end, Google tries to filter out duplicate documents so that users experience less redundancy. (Demystifying the “duplicate content penalty” – Friday, September 12, 2008 at 8:30 AM)

    Thankfully, you can eliminate redundancy by providing Google with a sitemap of your website.(About Sitemaps – Google Webmaster Central) With a good sitemap, you can tell search engines how to weigh your site’s content. Which pages are more important, which can be ignored, etc etc. With WordPress and a good plugin, this can be done automatically by making a few choices in an admin interface. You also want to spend a little time understanding your robots.txt file. Perishable Press has a great article on optimizing it for WordPress.

    Now that you know about the myth behind the duplicate content penalty, tomorrow we can get into content farms!

  • Fighting Spam – We’re Doing it Wrong

    Fighting Spam – We’re Doing it Wrong

    I’ve talked about this before (Spam/Splog Wars) but instead of telling you how to fight spam, I’m going to talk about how it works.

    There are two main kinds of spambots, email and forum. Forum bots are a misnomer, as they’re really ‘posting’ bots, but we won’t get into the semantics of names. These posting bots basically are written to scan the web, via links, looking for anything that accepts data (like a comment form on your blog) and submit spam content. There is a limited number of ways for data to be entered on a website, and generally it’s all done via a form. All a bot has to to so is search the code of the page for form data and input it’s crap content. Given that coders tend to label forms things like ‘content’ and ’email’ and ‘username’ for ease of support and modification, it’s not too hard to see how the spambots are written to exploit that functionality.

    The main ways to prevent spam are to block the spammers or make users prove their real people. Obviously manual spammers (and yes, there are people paid to post spam on blogs) can’t be defeated this way. Blocking spammers is either done by their behavior (if they access your site via anything other than a browser, or if they post data in any way other than from the ‘post comment’ link on your site, they’re probably a spammer) or by a known-bad-guy list. Those both have problems, as spammers have multiple methods of attack. Making a person prove their humanity is generally done via CAPTCHA, which I’ve mentioned before has its own issues. (CAPTCHA Isn’t Accessible ) Another way is to go at it in reverse. Knowing that the spam bots scan a webpage for fields they can fill in, you set it up so there’s a hidden field which only a spammer would fill in. If they fill that field in, they’re spam and they go to the bin.

    I think that our methods for blocking spammers is wrong.

    Everything we’re doing is defensive. We wait for a spammer to attack, sort out what they did, and write better code to stop them. Sure, we have some Bayesian spam tools out there, and some fairly impressive algorithmic learning engines that check the content of a post and punt it to the trash. But in the end, it’s all reactive medicine.

    Anyone who does support will tell you that even one email or comment sent to the spam bin accidentally is bad. How bad? How many times have you said ‘Did you check your spam emails?’ to someone who didn’t get an email? Any time you impede your real users, the regular people who visit your site, the worse your spam-fighting is. If a user has to jump through hoops to comment, they’re not going to do it. If a user has to register to post, they probably won’t do it (forums are an exception to this).

    The goal of spam fighting should be to prevent the spam from being sent.

    We all know why spammers spam. If even one person clicks on their links and buys a product, they’ve profited. It’s just that simple. They make money for a low-cost output, it works, so they keep doing it. Therefore the only way to make them stop is to remove the profitability. Of course if it was that easy, we’d have done it already.

    This begs the question ‘Who runs these spam bots?’ It’s two fold. There are people out there who run spamming programs, certainly, but there are also people who’ve had their computers infected by a virus or malware, and unknowingly spam people. That’s part of why I don’t like blocking IPs. However the software had to come from somewhere. Cut off the head of the snake and the spam stops! We already have the Spamhaus project, which patrols the internet and looks for sites pushing malware, as well as sending spam email. There’s also Project Honey Pot, which fakes links and forms to trick a spam-bot.

    Sadly, until we have more resources dedicated to stopping spam service providers, we’re going to have to keep trying to out think them and play defense.