I was writing a script for a rather complex series of checks on a website. I wanted to do what I thought would be simple. I wanted to grab the website headers and parse them to check if a site was WordPress or not.
That was a weird and wild trip.
In theory, this is all the code you need:
filter_var($url, FILTER_VALIDATE_URL)
But as it turns out, that’s actually not the best thing! I started using it but found that I could break it pretty easily and, since I was writing a tool I knew would be used by end-users (who are exceptionally creative when it comes to breaking things), I googled around and found this blog post on Why URL validation with filter_var might not be a good idea.
Yikes! All I was mad about was that FILTER_VALIDATE_URL thinks http://foo is okay, even when you tell it you want the damn host.
In the end I used this Strict URL Validator code but even then I had to wrap it around this:
// Sanitize the URL
$gettheurl = (string) rtrim( filter_var($_POST['url'], FILTER_SANITIZE_URL), '/' );
$gettheurl = (string) $_POST['url'];
if (preg_match("~^https://~i", $gettheurl)) {
$gettheurl = "https://" . $gettheurl;
} elseif (!preg_match("~^(?:f|ht)tp?://~i", $gettheurl)) {
$gettheurl = "http://" . $gettheurl;
}
// Is it a real URL? Call StrictURLValidator
require_once 'StrictUrlValidator.php';
if ( StrictUrlValidator::validate( $gettheurl, true, true ) === false ) {
// Do the needful
}
In many ways, this makes some sense. What is and isn’t a URL can be tetchy to check. http://foo is real. I can use it locally. That’s why http://localhost can exist. And we can’t just say “if not .com/org/net” anymore (if we ever really could). But boy is my code clunky.

