The other day I was surprised to learn that Google still looks for tech.ipstenu.org
Kind of.
If you go search for it, Google still believes that URL is a real thing: https://www.google.nl/search?q=site:tech.ipstenu.org
Some of those URLs were made long after I mapped the domain, by the way. And yes, of course I have a 301 redirect for the subdomain.
<If "%{HTTP_HOST} == 'code.ipstenu.org' || %{HTTP_HOST} == 'tech.ipstenu.org' "> RedirectMatch 301 (.*) https://halfelf.org$1 </If>
What’s going on here? Strictly speaking, Google’s right and stupid. The URLs are correct, but Google should be honoring the 301 redirect. Because it’s not, you have to tell it not to trawl your subdomains and use a robots.txt file, just for your mapped subdomains.
First we’ll need to make a special robots.txt file, like robots-mapped.txt, and put the following in it:
User-agent: * Disallow: / User-agent: Googlebot Noindex: /
This tells Google to sod off. Then you need to specify when to use this special file, and that brings us to the lands of options. Since .htaccess is a top-down file, that is it reads from the top of the file down, you can get away with this:
RewriteCond %{HTTP_HOST} = (code|tech).ipstenu.org RewriteRule ^robots\.txt$ /robots-mapped.txt [L]
Just have that above any redirect rules for other things. But what if, like me, you’ve got Apache 2.4?
<If "%{HTTP_HOST} == 'code.ipstenu.org' || %{HTTP_HOST} == 'tech.ipstenu.org' "> RedirectMatch 301 ^/robots\.txt /robots-mapped.txt RedirectMatch 301 (.*) https://halfelf.org$1 </If>
Of course, that sends tech.ipstenu.org/robots.txt
to https://halfelf.org/robots-mapped.txt
which is scary but still works, so don’t panic.
Another way to do it would be to have a massive rewrite for all my subomains:
# All Mapped <If "%{HTTP_HOST} == 'code.ipstenu.org' || %{HTTP_HOST} == 'tech.ipstenu.org' || %{HTTP_HOST} == 'photos.ipstenu.org' "> RedirectMatch 301 ^/robots\.txt /robots-mapped.txt </If>
I will note, it should be possible to have (code|tech).example.com
work in there, instead of all those OR statements, but I’ve yet to sort that out (corrections welcome in the comments!).
The last step is to fight with Google Webmaster Tools. Add your subdomains and you should get this on the robots.txt checker:
If you don’t, don’t panic. Go to the Fetch as Google page and tell it to fetch robots.txt. That will force it to recache. Once you have it right, ask Google to remove the URL from their index, and in a few days it’ll sort out.
It’s very annoying and I don’t know why the 301 isn’t honored there, but oh well. At least I can make it work.