Looping LinksWith The WP HTML Processor

Here’s your backstory.

You need to search all the links in a post and, if the link is to a specific site (wikipedia.com) you want to add it to an array you output at the bottom of your post, as citations. To do this, you will:

Search for tags for every single link (<a href=.... )
If the link contains our term (Wikipedia), put it in the array
If the link also has a title, we’ll use that

If we do all that, our output looks something like this:

Source: https://en.wikipedia.com/wiki/foobar

Source: Foobar2

While you can do this with regex, you can also use the (new) HTML Processor class to do it for you.

RegEx

As I mentioned, you can do this with regex (I’ll spare you the drama of coming up with this in the first place):

$citations = array();

preg_match_all( '#<\s*a[^>]*href="([^"]+)"[^>]*>.*?<\s*/\s*a>#', get_the_content(), $matches );

// Loop through the matches:
foreach ( $matches[0] as $i => $match ) {

    // If the URL contains WikiPedia, we'll process:
    if ( str_contains( $match, 'wikipedia.com') ) {

        // Build initial data:
        $current_citation =[
            'url'   => $matches[1][$i],
            'title' => $matches[1][$i],
        ];

        // If there's a title, use it.
        if ( str_contains( $match, 'title=' ) ) {
              $title_match = preg_match( '#<\s*a[^>]*title="([^"]+)"*[^>]*>.*?<\s*/\s*a>#', $match, $title_matches );
              $current_citation['title'] = ( ! empty( $title_matches[1] ) ) ? $title_matches[1] : $current_citation['title'];
        }
    }

     $citations[] = $current_citation;
}

?>
<ol>
    <?php foreach ( $citations as $citation ): ?>
        <li itemprop="citation" itemscope itemtype="https://schema.org/CreativeWork">
             Source: <a rel="noopener noreferrer external" itemprop="url" class="wikipedia-article-citations__url" target="_blank" href="<?php echo esc_url( $citation['url'] ) ?>"><?php echo esc_html( $citation['title'] ) ?></a>
         </li>
     <?php endforeach; ?>
</ol>

This is a very over simplified version, but the basis is sound. This will loop through the whole post, find everything with a URL, check if the URL includes wikipedia.com and output a link to it. If the editor added in a link title, it will use that, and if not, it falls back to the URL itself.

But … a lot of people will tell you Regex is super powerful and a pain in the ass (it is). And WordPress now has a better way to do this, that’s both more readable and extendable.

HTML Tag Processor

Let’s try this again.

What even is this processor? Well it’s basically building out something similar to DOM Nodes of all your HTMLin a WordPress post and letting us edit them. They’re not really DOM nodes, though, they’re a weird little subset, but if you think of each HTML tag as a ‘node’ it may help.

To start using it, we’re going to ditch regex entirely, but we still want to process our tags from the whole content, so we’ll ask WordPress to use the new class to build our our tags:

$content_tags = new WP_HTML_Tag_Processor( get_the_content() );

This makes the object which also lets us use all the children functions. In this case, we know we want URLs so we can use next_tag() to get things:

$content_tags->next_tag( 'a' );

This finds the next tag matching our query of a which is for links. If we were only getting the first item, that would be enough. But we know we have multiple links in posts, so we’re going to need to loop. The good news here is that next_tag() in and of itself can keep running!

while( $content_tags->next_tag( 'a' ) ) {
    // Do checks here
}

That code will actually run through every single link in the post content. Inside the loop, we can check if the URL matches using get_attribute():

if ( str_contains( $content_tags->get_attribute( 'href' ), 'wikipedia.com' ) ) {
    // Do stuff here
}

Since the default of get_attribute() is null if it doesn’t exist, this is a safe check, and it means we can reuse it to get the title:

if ( ! is_null( $content_tags->get_attribute( 'title' ) ) ) {
    // Change title here
}

And if we apply all this to our original code, it now looks very different:

Example:

		// Array of citations:
		$citations = array();

		// Process the content:
		$content_tags = new WP_HTML_Tag_Processor( get_the_content() );

		// Search all tags for links (a)
		while( $content_tags->next_tag( 'a' ) ) {
			// If the href contains wikipedia, build our array:
			if ( str_contains( $content_tags->get_attribute( 'href' ), 'wikipedia.com' ) ) {
				$current_citation = [
					'url'   => $content_tags->get_attribute( 'href' ),
					'title' => $content_tags->get_attribute( 'href' ),
				];

				// If title is defined, replace that in our array:
				if ( ! is_null( $content_tags->get_attribute( 'title' ) ) ) {
					$current_citation['title'] = $content_tags->get_attribute( 'title' );
				}

				// Add this citation to the main array:
				$citations[] = $current_citation;
			}
		}

		// If there are citations, output:
		if ( ! empty( $citations ) ) :
			// Output goes here.
		endif;

Caveats

Since we’re only searching for links, this is pretty easy. There’s a decent example on looking for multiple items (say, by class and span) but if you read it, you realize pretty quickly that you have to be doing the exact same thing.

If you wanted to do multiple loops though, looking for all the links but also all span classes with the class ‘wikipedia’ you’d probably start like this:

while ( $content_tags->next_tag( 'a' ) ) {
    // Process here
}

while ( $content_tags->next_tag( 'span' ) ) {
    // Process here
}

The problem is that you would only end up looking for any spans that happened after the last link! You could go a more complex search and if check, but they’re all risky as you might miss something. To work around this, you’ll use set_bookmark() to set a bookmark to loop back to:

$content_tags = new WP_HTML_Tag_Processor( get_the_content() );
$content_tags->next_tag();

// Set a bookmark:
$content_tags->set_bookmark( 'start' ); 

while ( $content_tags-> next_tag( 'a' ) ) {
    // Process links here.
}

// Go back to the beginning:
$content_tags->seek( 'start' ); 

while ( $content_tags->next_tag( 'span' ) ) {
    // Process span here.
}

I admit, I’m not a super fan of that solution, but by gum, it sure works!