URL Validation - Half-Elf on Tech

I was writing a script for a rather complex series of checks on a website. I wanted to do what I thought would be simple. I wanted to grab the website headers and parse them to check if a site was WordPress or not.

That was a weird and wild trip.

In theory, this is all the code you need:

filter_var($url, FILTER_VALIDATE_URL)

But as it turns out, that’s actually not the best thing! I started using it but found that I could break it pretty easily and, since I was writing a tool I knew would be used by end-users (who are exceptionally creative when it comes to breaking things), I googled around and found this blog post on Why URL validation with filter_var might not be a good idea.

Yikes! All I was mad about was that FILTER_VALIDATE_URL thinks http://foo is okay, even when you tell it you want the damn host.

In the end I used this Strict URL Validator code but even then I had to wrap it around this:

	// Sanitize the URL
	$gettheurl  = (string) rtrim( filter_var($_POST['url'], FILTER_SANITIZE_URL), '/' );
	$gettheurl  = (string) $_POST['url'];

	if (preg_match("~^https://~i", $gettheurl)) {
		$gettheurl = "https://" . $gettheurl;
	} elseif (!preg_match("~^(?:f|ht)tp?://~i", $gettheurl)) {
	    $gettheurl = "http://" . $gettheurl;
	}

	// Is it a real URL? Call StrictURLValidator
	require_once 'StrictUrlValidator.php';

	if ( StrictUrlValidator::validate( $gettheurl, true, true ) === false ) { 

		// Do the needful
	}

In many ways, this makes some sense. What is and isn’t a URL can be tetchy to check. http://foo is real. I can use it locally. That’s why http://localhost can exist. And we can’t just say “if not .com/org/net” anymore (if we ever really could). But boy is my code clunky.