Half-Elf on Tech

Thoughts From a Professional Lesbian

Tag: cache

  • Cron Caching

    Cron Caching

    WordPress' relationship with cron is touchy. It has it's own version, wp-cron which isn't so much cron as a check when people visit your site of things your site needs to do. The problem is that if no one visits your site… nothing runs. That's why you sometimes have posts that miss schedules.

    One possible solution is to use what we call 'alternate cron' to trigger your jobs. That works pretty well as it means I can tell a server "Every 10 minutes, ping my front page and trigger events."

    But in this case, I didn't want that. I receive enough traffic on this site that I felt comfortable trusting in WP cron, so what I wanted was every hour for a specific page to be visited. This would prompt the server to generate the cached content if needed (if not, it just loads a page).

    WordPress Plugin

    I'm a huge proponent of doing things the WordPress way for WordPress. This method comes with a caveat of "Not all caching plugins will work with this."

    I'm using Varnish, and for me this will work, so I went with the bare simple code:

    class LWTV_Cron {
    
    	public $urls;
    	
    	/**
    	 * Constructor
    	 */
    	public function __construct() {
    
    		// URLs we need to prime the pump on a little more often than normal
    		$this->urls = array(
    			'/statistics/',
    			'/statistics/characters/',
    			'/statistics/shows/',
    			'/statistics/death/',
    			'/statistics/trends/',
    			'/characters/',
    			'/shows/',
    			'/show/the-l-word/',
    			'/',
    		);
    
    		add_action( 'lwtv_cache_event', array( $this, 'varnish_cache' ) );
    
    		if ( !wp_next_scheduled ( 'lwtv_cache_event' ) ) {
    			wp_schedule_event( time(), 'hourly', 'lwtv_cache_event' );
    		}
    	}
    
    	public function varnish_cache() {
    		foreach ( $this->urls as $url ) {
    			wp_remote_get( home_url( $url ) );
    		}
    	}
    
    }
    
    new LWTV_Cron();

    Yes it's that site. This very simple example shows that I have a list of URLs (slugs really) I know need to be pinged every hour to make sure the cache is cached. They're the slowest pages on the site (death can take 30 seconds to load) so making sure the cache is caught is important.

  • Torrenting Cache

    Torrenting Cache

    There’s a new cache in town, CacheP2P.

    The basic concept is that you can use BitTorrent to seed your cache across the world, making it even faster for everyone. Setting it up is fairly simple. Configuring it is not. At least not in an automated fashion.

    Traditional web browsing is a direct connection between user and server. Traditional caching is done by having the server (or a proxy of the server) create a static copy of the page and display that. In the case of WordPress and any other dynamic CMS, that works by taking the load off of PHP and MySQL having to generate a new page on every visit.

    By using BitTorrent, this is changed so that you would instead be getting a cached copy not from a server but from someone else’s computer. If you and I were on the same network, I might get the page from you instead of the server. That sounds really weird, doesn’t it? Via two javascript files combine to signal the torrent’s API, and a third file uses the unique page hash to determine freshness. Keep your eye on that last part, it’s what makes the idea of a plugin for WordPress such a pain.

    To get the content for that last file, you have to look at your page in dev tools to grab the security hash:

    [CacheP2P] this page's security hash: (2)
    "c72d19b8ed03be98ceebd06f7c93dc06410b4de4"
    "(http://www.cachep2p.com/api.html)"
    

    On Safari it looks like this:

    Example of what the hash looks like

    Now if it works, and you can see an example on the cachep2p.com domain, it would show results similar to this:

    Example of the cache working

    This did not actually work for me on Safari. At all. It was fine on Chrome, but Safari never served up the cache which is odd.

    My first concern was about cache contamination. That is, if someone downloads the page and messes with it, could they have my site show content I didn’t want it to show? By using hashes, this is minimized. I have a file that defines the valid hashes, and if the copy doesn’t match, it downloads my content, not the bad one.

    However the greater concern is that of accidentally releasing content I shouldn’t. Take this example. I accidentally publish something I shouldn’t, like the plan to tear down the Berlin Wall. Without caching, I can quickly redact it and if Google didn’t scrape my page, it’s like it never happened. With caching (and Google…) the bad content (my destruction plans) remain out there unless I visit the cache provider and flush things. If you’ve ever used a 3rd party proxy like Cloudflare to cache your content, this is the situation when you update your CSS files and have to go force them to refresh.

    With the BitTorrent situation this becomes worse, because the cache is in the hands of the masses. If you were a politician and I your rival, I would have someone constantly visiting your site and saving the cache. Then I could go through it and look for accidental leaks.

    Now of course this could happen today. I could set up a simple content scraper and have it ping your site every so often to save the data. You could, in turn, block my IP, and I would retaliate by setting up a Tor connection to do it from obfuscated IPs. The difference here is that you’re actually encouraging me to cache your data with this plugin.

    An additional concern is the dynamic aspect of WordPress. The only way to grab the hash right now is to view the page. That hash will change when I save a page. In fact, it might change on every page load, in some situations. I didn’t get too far into testing at this point, since I realized that in order for this to work I would have to load a page, grab a hash, edit a file, save that file up on the server, and then it would cache…

    That would be terrible on WordPress. For this to work on any large site, the generation of that hash file would have to be automated. No matter if the site is dynamic or not, to make people manually do that is preposterous. A vaguely WordPress solution I dreamed up was to somehow catch the cache has as the page is saved, store it in a post-meta value, and then use WordPress to generate a ‘fake’ page with the URL and the hash for the cache tool to use.

    It might be easier to do that via something like WP Super Cache of W3TC, and have it save the file as it saves the cached page (and point to the static page instead of the dynamic one) but even then, the rapid changing of WordPress content would make it difficult for a cache to seed far enough out.

    Right now, I think this is something that might only be useful for a small, mostly static, site.

  • Using JSON with Hugo

    Using JSON with Hugo

    I know I’ve been talking about Hugo a lot, but the whole reason I say and wrapped my head around it was that I wanted to make a website using JSON.

    Making something out of the JSON API in WordPress isn’t easy, and I decided to start with something hard. I thought I would make a Multisite network to be the back end of my site, and then call the JSON API to generate the content.

    Why?

    It doesn’t make my site (much) faster, though it makes it more cacheable. Mostly what it does is lets me separate church and state. I can use WordPress to write content, and then call it however I want. That would be Hugo. If I build the site via Hugo locally, then every time I generate the files it can pull in the data on the fly.

    Enable the API

    This is easy. Install WordPress 4.4 and install v2 of the Rest API plugin. I also installed Jetpack for reasons of using Calypso.

    You absolutely have to use the Rest API plugin. WordPress 4.4 adds in some of the bits, but the API plugin gives you the actual endpoints. You’ll need these later.

    Call the API

    This was also easy. Kind of. Making a jquery file that called the content was fairly straightforward but… Making it responsive and reactive was hard.

    What I really wanted to do was have some web app, js powered without node, that just called the data and kept the url structure. And for that what I need is a JSON client.

    It was incredibly hard to find how to make a static HTML website powered by JSON. I’d see Rachel Baker’s presentation on making a WordPress theme with the JSON API but that wasn’t what I wanted to do. I mean, in a way it was. The theme itself was what I wanted, but I didn’t want to run it with WordPress.

    I did a lot of research and finally stumbled onto restful.js by marmelab. I read their blog post on the product and it looked like what I wanted. I needed something that would run without Node.js since I wanted to do this on Apache.

    Consuming a RESTful Web Service Dynamically

    And this is when I starting banging my head against a table. None of the solutions I’d found were working for me. Nothing. Rachel was right. I was in a bit over my head. Oh she was nice enough not to say it that way, but they way she looked at me was very Jewish Mom of her.

    In order to stop feeling useless, I decided to learn Hugo. It had come up a few times on my search for a dynamic static site. Hugo has LiveReload which, if you’re familiar with most static generators, is pretty cool. The idea is the pages rebuild as you edit the content (or theme) files.

    If you run hugo server locally on your laptop, you can see this today. Hugo upped the ante by tossing in the ability to run it on the server and the ability to watch the data files. It also has the ability to call JSON remotely.

    I feel it’s important to note that you cannot trigger that LiveReload when you’re using external URLs. But anytime you edit a file and LiveReload is triggered, Hugo will read the URL content from your cache. If you’ve disabled cache, it will download it fresh. Don’t disable cache unless you have an unlimited API. And no, I don’t know how long the cache is good for.

    Sadly, OAuth or other authentication methods are not working yet, but for now that’s okay because I’d been in over my head for so long, I stepped back.

    Understanding JSON First

    Okay fine. Once I realized I’d tried to do too much at first, I went back to that static filmography I had.

      {
      	"role": "actor",
        "title": "White Christmas",
        "slug": "white-christmas",
        "type": "Movie",
        "character": "Phil Davis",
        "notes": null,
        "dates": 1954
      }
    

    This is basic stuff. A simple chunk by chunk. And to call the file in a shortcode, I did this:

    {{ range $.Page.Site.Data.filmography }}
    	{{ if eq $role .role }}
    		<li>{{ .title }}: <strong>{{ .character }}</strong> </li>
    	{{ end }}
    {{ end }}
    

    Actually I did a lot more but you get the idea. And if you’re wondering about the if statement at the top, it’s so I can pass a parameter to my shortcode:

    {{< filmography role="actor" >}}

    It took me longer than I’d like to figure that out. At first I had multiple filmography files (actor, writer, producer) and I was trying to look through them all. In the end, I realized this simple JSON and a simple call was better.

    A Little More JSON

    There was another case, though. I had lists of episodes compiled into separate JSON files, one per show. This is logical and maintainable after all, but it meant I couldn’t use the same logic loop in the same way.

    What I really needed was to loop through $.Page.Site.Data.episodes.SHOWNAME in order to get the data from that show. And I couldn’t add on a variable to my range call.

    But as it happens, Hugo is clever and if I say {{ range $.Page.Site.Data.episodes }} it will give me the file episodes.json if that exists. But if there happens to be a folder called episodes then it will give me all the JSON files in there.

    I read through the data documentation on Hugo a few times to understand the hierarchy but it breaks down really logically:

    ├── data
    |   ├── episodes
    |   |   ├── continum.json
    |   |   └── fringe.json
    |   |   └── sense8.json
    |   ├── movies.json
    |   ├── musicals.html
    

    Data {dot} episodes {dot} continum

    And if I ran this:

    {{ range $.Page.Site.Data.episodes }}
    	<li>{{ .show }}<li>
    {{ end }}
    

    Then I got this:

    • continum
    • fringe
    • sense8

    I already knew I could do a range within a range, and finally I came up with this:

    {{ $show := $.Page.Params.show }}
    
    {{ range $.Page.Site.Data.episodes }}
    	{{ if eq .show $show }}
    		<li>{{ .title }}</li>
    	{{ end }}
    {{ end}}
    

    The shortcode checks the post it’s on and, if the Front Matter has ‘show’ defined, it will give me a list of all titles (episode titles) for that show.

    I added in an extra check for the season parameter. This one is passed through the shortcode:

    {{< episodelist season="1" >}}

    And that, in turn is called in the shortcode above the output for title:

    {{ range where .episodes "season" "==" $season }}

    So nothing will display unless the show in the JSON file matches the show on the page with the shortcode, and the season matches the one defined in the shortcode.

    Now I actually have a more robust check. If no season is defined, it shows all episodes (useful for one-season shows, right?). But you get the idea.

    How Did This Help?

    Remember my JSON example before? Here’s what it looks like for Fringe:

    {
    "show" : "fringe",
    "episodes": [
      {
        "epnum": "1",
        "season": "1",
        "title": "Pilot",
        "slug": "pilot",
        "airdate": "2008-09-08T22:00:00.000Z",
        "rating": "5",
        "summary": "Walter gets a cow named Gene to experiment on ethically."
      },
      ...
    ]}
    

    Instead of just started with those naked chunks, I have a definition of the show name (fringe – lowercase because I want to compare it to other lowercase things and I’m lazy) and the definition of a section for episodes. That’s the whole reason why that call for seasons works the way I want it to.

    This helped me to understand and visualize JSON. By taking a simple JSON list (the filmography) I learned the basics of connecting code and content. Then by taking the more complex scenario, where I needed nested data, I understood that relationship.

    The next step will be making a simple WordPress site and calling the JSON remotely. Obviously I have to figure out how to trick it into making fake permalinks, but baby steps. Baby steps.

  • Varnish Cache and Cache-Control

    Varnish Cache and Cache-Control

    In our quest for speed, making websites faster relies on telling browsers when content is new and when it’s not, allowing them to only download the new stuff. At their heart, Cache Headers are what tell the browser how long to cache content. There’s a special header called Cache-Control which allows each resource to decide it’s own policy, such as who can cache the response, when, where, and for how long. By default, they time we set for the cache to expire is how old a visitor’s copy can be before it needs a refresh.

    A lot of the time, I see people setting Cache-Control to none and wondering why their site is slow.

    Since I spend a lot of time working on DreamPress, which uses Varnish, I do a lot of diagnostics on people with slow sites. One of my internal scripts checks for Cache-Control so I can explain to people that setting it to none will tell Varnish (and browsers) literally not to cache the content.

    The way it works is that they actually set things to ‘no-cache’ or ‘no-store.’ The first one says that the content can actually be cached, but it’s going to check and make sure the resources haven’t changed. It’s not really ‘no-cache’ but ‘check-cache.’ If nothing’s changed, there’s no new download of content, which is good, but it’s still not caching.

    On the other hand, ‘no-store’ is really what we think about when we say not to cache. That tells the browser and all intermediate caches that every time someone wants this resource, download it from the server. Each. Time.

    What does this have to do with Varnish? Well here’s the Varnish doc on Cache-Control:

    no-store: The response body must not be stored by any cache mechanism;

    no-cache: Authorizes a cache mechanism to store the response in its cache but it must not reuse it without validating it with the origin server first. In order to avoid any confusion with this argument think of it as a “store-but-do-no-serve-from-cache-without-revalidation” instruction.

    Since Cache-Control always overrides Expires, setting things not to cache or store means you’re slowing down your site. Related to this, if you set your Max-Age to 0, then you’re telling visitors that the page’s cache is only valid for 0 seconds…

    And some of you just said “Oh.”

    Out of the box, WordPress actually doesn’t set these things poorly. That generally means if your site kicks out those horrible messages, it’s a plugin or a theme or, worst of all, a rogue Javascript that’s doing it. The last one is nigh-impossible to sort out. I’ve only been able to do it when I disable plugins and narrow down what does it. The problem is that just searching for ‘Cache-control’ can come up short when things are stashed in Javascript.

    But there’s some kind of cool news. You can tell Wordpress to override and not send those headers. I’ve not had great success with using this when it’s a script being an idiot, but it works well for most plugins and themes that seem to think not caching is the way to go.

    From StackExchange:

    function varnish_safe_http_headers() {
        header( 'X-UA-Compatible: IE=edge,chrome=1' );
        session_cache_limiter('');
        header("Cache-Control: public, s-maxage=120");
      if( !session_id() )
      {
        session_start();
      }
    }
    add_action( 'send_headers', 'varnish_safe_http_headers' );
    

    And yes, it works on DreamPress.

  • March into CloudFlare

    March into CloudFlare

    “Eating your own dogfood” is a colloquialism that describes a company using its own products or services for its internal operations. Microsoft supposedly invented it in the 1980s.

    DreamHost is partners with CloudFlare. I’ve tried time and again but I get hung up when we start talking about the delay in caching and the proxying of caches. Still, I know me. I can do things better and provide better support if I use a thing. I’m so good with PageSpeed and wp-cli now because I use them regularly.

    So it was time to knuckle down and use it for more than just a week. I was committing to a month on CloudFlare!

    Pick One Domain

    I decided to only do this for one domain. My busiest, and not a multisite, because I wanted to create a ‘real user experience.’ I didn’t use my company account. I made a new CloudFlare account, added this domain, and started there. I didn’t want Multisite because that’s where my store is, and if I screw up the busy site, I don’t lose money or risk people’s information. While I’m sure it can be used safely, I knew I was going to experiment a little, so I wanted to protect myself and them

    Turn on CloudFlare, Turn off PageSpeed

    You heard me. I turned PageSpeed off for the domain I’m testing on. I love PageSpeed, but after talking to some people, I’ve been wondering about how well it handles things. Also with SPDY and HTTP/2, compressing HTML is less and less of a concern. I wasn’t sure if there was a benefit to having everything be filtered and compressed before it loaded. Was I making the experience worse? After making sure I still had mod_cloudflare active and up to date, I used .htaccess to turn off PageSpeed.

    Break Your Code Flow

    I need to point out that CloudFlare warns you about this one. They tell you that if you need to SSH, you should use a different record or the IP. That wasn’t a big deal. I’d kept ftp.example.com separate for FTP anyway. All I had to do was change my SSH aliases to point there as well.

    I forgot that I use Git on my own server. I wanted to update a script and Coda hung. It took my brain a moment to remember that I was using Git with SSH so I had to remember how to change the remotes on Git:

    $ git remote -v
    # origin  me@example.com:USERNAME/REPOSITORY.git (fetch)
    # origin  me@example.com:USERNAME/REPOSITORY.git (push)
    

    This was a problem. My example.com main domain hit CloudFlare. So I had to change that to use ftp.example.com as well. That wasn’t too hard, just running this 6 times.

     $ git remote set-url origin me@ftp.example.com:USERNAME/REPOSITORY.git
     $ git remote -v
     # origin  me@ftp.example.com:USERNAME/REPOSITORY.git (fetch)
     # origin  me@ftp.example.com:USERNAME/REPOSITORY.git (push)
    

    Still, I felt pretty silly!

    Break Your Email

    Same song, second verse. I had to add in a CNAME for smtp.exmaple.com on CloudFlare because I use that to send emails and I use mail.example.com to receive. By default they know mail should be ignored. The scan didn’t pick up smtp. I forgot about it until I tried to reply to an email.

    Break Your Tools

    Not the ones on my computer. My other CMSs broke. CloudFlare is used to WordPress, and has Five Easy First Steps to using WordPress and CloudFlare.

    In that document, they specify this:

    Create a Page Rule to exclude the wp-admin or wp-login sections from CloudFlare’s caching and performance features. You can access PageRules in your CloudFlare ‘Settings’ options.

    e.g.

    *example.com/wp-admin/*
    *example.com/wp-login/*
    

    Why do this?

    While there is not always an issue, we have seen instances where optional performance features like Rocket Loader may inadvertently break certain functions (editors, etc.) in your WordPress back end.

    Except there are major problems (besides the fact that the second example should be *example.com/wp-login* without the trailing slash)! The free version only gives you three rules. I’m using four apps (WordPress, ZenPhoto20, Yourls, and MediaWiki). That means if I need to white list all of them, I’m out of luck.

    Then there’s the problem that *.example.com matches blog.example.com and www.example.com but does not match example.com and guess what? I’m using example.com without the WWW. I hate WWW. And yes, you can use a naked domain with CloudFlare.

    Break The Site (For One Person)

    Someone pinged me to let me know the site was down in Scotland:

    Website is Offline Message from CloudFlare

    It wasn’t in Manchester or Dublin. It wasn’t in the US. It wasn’t in Canada. I opened a support ticket after making sure that it wasn’t really me. The server was up (all other sites, including this one were up) and I could get the site via anonymous proxies. Only that user had an issue, and I was 100% positive I had whitelisted everything in CSF (it’s the same thing I do for Jetpack). But the website claims a 522 is my server.

    This was never resolved, and demonstrates a major issue in the process. The user was a non-web savvy user. She shouldn’t have to be, though. She just wanted to visit a site and read things. It was very annoying.

    Drop Server Load

    Okay. So this part worked.

    Graph showing Server load on day 3 leveled out

    It only shows up starting day three because I didn’t flip DNS over right away. I had to turn off PageSpeed, upgrade PHP, make sure nginxcp was going to work with it… There was prep work. Day three, the little spikes vanished. You still get big ones because that’s when the server runs backups and upgrades. The little spikes are, normally, when I have a new post on the site.

    There’s been no perceptible change to bandwidth. All other sites on the server are, however, notably faster.

    End Result?

    It looks like CloudFlare worked. The stats say I’m using 25% less CPU with CloudFlare, which is interesting, but now that it’s baked, I want to try something else, just for grins and giggles.

  • Why You Can’t (Always) Catch Cache

    Why You Can’t (Always) Catch Cache

    Or rather, why you don’t want to cache.

    When speeding up a site, the first thing we always talk about is streamlining the site. Ditch the redundant plugins, clean up the code, get a lightweight theme, dump the stuff you don’t use, and upgrade. Once you’ve done that, however, we talk about caching, and that is problematic. Caching is tricky to get right, because you can’t cache everything.

    Actually you can, but you don’t want to cache everything, and you really shouldn’t.

    The basic concept of a cache is ‘take a picture of my website in this moment.’ We think of it as a static picture, versus the dynamic one we use for WordPress. That static picture is not so static, though. You call CSS and JS still, and you call images. But the idea there is not to call PHP or the database and thus your site will be faster, as those are the slowest things.

    Great, so why don’t I want you to cache everything?

    Missed Catch

    The obvious answer is latency. If you’re designing a site or making a lot of style changes, you need to disable caching because otherwise you get a lag between edits and displaying text. But there’s also server latency.

    When we talk about things like PageSpeed (I use mod_pagespeed my server) to improve web page latency and bandwidth usage, we’re talking about actually changing the resources on that web page to the best practices. This sounds great but we have to remember that by asking the webserver to do the work before WordPress (such as having PageSpeed minify my CSS and JS), we’re still making the server do the work. Certainly it’s faster than having WordPress do it, but there will still be a delay in service.

    The next obvious answer is security. There’s some data you flat-out don’t want to cache, and it’s pretty much everything in wp-admin (on WordPress). Why? If you have multiple users, you don’t want them getting each other’s content. Some are admins, some aren’t, and I know I don’t need my guest editor seeing the post about where I’m firing her and why.

    Actually, we’ll extend this. Very rarely do I want my logged in users to see a cache at all. They’re special, they’re communicating and making edits on the fly. Having them see cached content, and constantly refresh it, will be more draining on my server than just loading the content in the first place. Remember the extra load caused by PageSpeed (or even your plugins) generating the cache? That would be constantly in progress when a logged in user made a change, so let’s skip it all together.

    Tagging on to that, you also don’t want to your admins and logged in users to generate a cached page. This isn’t the same as seeing a cached page, I don’t want non-logged in users to see the version of the site a logged in one gets. A logged in user on WordPress gets the toolbar, for example. I don’t want the non-logged in ones to see it.

    Finally we have to round back to security. If I have SSL on my box and I’m using HTTPS to serve up a page, no way no how do I want to cache anything related to users. In fact, I may not even try to cache JS/CSS either. The basic presumption of https is that I need security. And if I need security, I need to keep users out of each other’s pockets. The best example of this is a store. I don’t want users to see each other’s shopping carts, do I? And your store is going to be https for security, so this is just one more feature there of.

    Of course, there are still things to cache. I setup PageSpeed on my https site so it will compress images, make my URLs root-relative, and compress and minify HTML/CSS/JS. But I don’t have a traditional cache with it at all. This does mean as we start to look towards an https-only world (thank you Google) we’re going to run into a lot of caching issues. All the quick ways we have to speed up our sites may go away, to be replaced by the next new thing.

    I wonder what that will be.