Tag: stats

Processing Numbers with WordPress

The very idea of ‘I should make statistics’ or ‘what are the metrics of this’ starts from the same place. We have a desire to understand what a thing is. Statistics, like traffic, and metrics, like speed, can tell us obviously important information about our sites. Faster sites do better. More traffic gets you more… whatever.

But those are the obvious things. There are easy to understand numbers and there are difficult to process numbers. And it all matters where you save the data.

Getting At The Data

When I set about making statistics for LezWatchTV, the biggest problem I faced was determining what I wanted to show. Some things were simple. How many characters died and what percent of all characters was that? How many shows have dead characters?

Since I chose to use WordPress features, like custom taxonomies, for the majority of the aspects of the site, getting those numbers was simple. There were, of course, some that were very difficult to get at, and this is fully of my own design. Sometimes there will be data you want to use that is just harder to get at than others.

This means the question of understanding your numbers begins with understanding where they belong.

Save Data in Smart Places

I say this over and over. Use WordPress’ native features first.

I mean use the taxonomies and the custom post types and the post meta wisely. But. When you’ve got a lot of data that needs to be cross related, consider saving it someplace else. For example, the reason FacetWP is so damn fast is that it doesn’t query WordPress all the time, and instead uses it’s own tables.

Having it’s own table means there’s less overhead as they can make direct SQL calls to pull the data. When you have data spread across three post types, this becomes pretty much an imperative. You just have to script the code to save it properly.

External Data

While FacetWP does save data to it’s own tables, there is another option, and that is external locations. You’re most familiar with this with regards to Google Analytics. Some data makes sense to keep local, but keep in mind what you’re doing and what you’re generating with the data. When it’s just posts, local is perfectly logical. When you get into statistics… Well. Maybe you should export it.

That brings up the next question. What data to you export, and to where.

20 February, 2018
Hot Hands And Playoffs

Today I’m wandering off topic into a world of baseball and statistics.

My family have been Cleveland Indians fans since they came to the United States and settled in the city. My grandmother was an accountant, my father a mathematician, and I a web developer who works on software used by 26% of the Internet. Give or take. I’m also a third (and probably final) generation Clevelander. Yes, I root for my home team.

October of 2016 marked the first time since 2007 that Cleveland was in the American League Championship Series (ALCS). In the intervening years, my family had all migrated to iPhones and iMessage, allowing us to converse in real time across two continents, two countries, four time-zones, and five cities.

My father, the mathematician and risk analyst, kept a close watch on Nate Silver’s FiveThirtyEight project, especially the MLB Predictions, as Mr. Silver has been quite spot on for things for a while, understanding the implications of probability and chance.

On October 18th, FiveThirtyEight gave Cleveland a 53% chance of winning the ALCS game 4, a 94% chance of making the World Series, and a 38% chance of winning it all for the first time since 1948. The time Cleveland won before that? 1920. Not quite Cubs level of history, but it’s been a long enough time than my grandmother Taffy never got to see them win a third time (she was born July 7, 1920).

The game on the 17th was nothing short of incredible. The starting pitcher was yanked after 2 outs because his cut pinky was dripping blood. There are, you see, a bevy of incredible rules about what pitchers can and cannot wear. More than the normal player. And we’re talking about a sport than demands all players use a glove that has colors only within a PANTONE® color set lighter than the current 14-series. These guys are nuts. And one of the rules is no bandages on the pitchers’ hands.

a pitcher’s person cannot include any unessential or distracting thing (including jewelry, adhesive tape, or a batting glove), especially on his arm, wrist, hand, or fingers.

Bauer’s 11 stitches in his pinky split and was incredibly nasty, so he was replaced. Cleveland used seven pitchers, pretty much their entire relief bullpen, to get through the game. My family began to argue the intelligence of the move. Instead of using the rookie Merritt to start game 4, possible win-it-all game, Manager Terry Francona decided to start his ‘ace,’ Kluber.

To understand this, you have to start with the odd fact that Cleveland is down three of their best pitchers to injuries. This is including Drone Boy Bauer. Such a situation is rare for the playoffs, if not unheard of. That means they are more reliant than ever on their bullpen, so using every single pitcher possible on Monday meant they would all be a little tired on Tuesday. And Kluber would be starting 3 days rest when a pitcher normally gets 4 or 5.

Clearly Francona was banking on the team not needing to use the bullpen much on Game 4, but why would he make that decision knowing that the odds of winning on Tuesday were insanely low. As my dad said:

Winning 7 games straight is an outlier. They won 6 in a row twice, of course the 14 streak, 4 games three times. I’m betting they will lose the next two in Toronto.

Then he started emailing us all homework.

Before we get to the math, let’s look at the baseball logic. The reason you would play Bauer is that the odds are Cleveland will lose on the 18th, and a good manager would know that and bet on it, like my father. Teams winning 7 games in a row is crazy. It’s rare. It’s risky. By playing Kluber, an experienced pitcher, you solve two problems. First, Merritt is a rookie. Him losing will have a deep psychological impact on the young guy. Kluber can take a hit and keep going. Second, it means if Cleveland does win, Kluber will be well rested for the World Series. If Game 4 is lost, Merritt will pitch the safer Game 5.

The psychology of math is exactly why no one would discount the Cleveland Indians winning seven games in a row in the post season, however.

[…] what Terry is seeing is momentum, the intangible. You gotta measure the odds with numbers, but making good decisions goes beyond the odds … beyond just the odds. Like CoCo’s diving catch.

This is where the homework comes into play. Nine papers about hot streaks later, I came to the conclusion I had always felt had to be true. There is no such thing as a winning streak. They are nothing more than standard deviations from the mean. Models of the math have told us that there is only one event in baseball that has happened outside of the frequency of said models. Everything, the longest runs of losses and wins, are exactly as they should be and happen as often as they ought.

Except for one: Joe DiMaggio. Joltin’ Joe’s 56–game hitting streak in 1941 doesn’t make any sense. As we read in Streak of Streaks by Jay Gould, in order to make it mathematically probably to have a run of 50 games with a hit, we should have had four batters with a lifetime average of .400, and 52 with .350 or higher over 1000 games. Instead, three players have achieved a batting average over .350 and not one has managed .400 lifetime.

You’re thinking “But Ty Cobb!” right now, and guess what? His lifetime is .367, followed by Hornsby at .358, and Shoeless Joe Jackson brings up the rear at .356 for his short career.

DiMaggio’s streak does not make sense.

Most MLB records we consider to be unbreakable are only that way due to changes in the way the game is played. Pitchers no longer play complete games on the regular, nor do they start 60+ games a season. The weirdness of DiMaggio is that his numbers are off the charts for that year, and actually the entirety of MLB history.

The Hot Hand: A New Approach to an Old “Fallacy”. Notice the quotes? The theory behind the Sloan paper is that the Hot Hand (or streak) is a fallacy because we’ve always been working under bad assumptions. To whit:

However, prior research hinges on the assumption that player shot selection is random, independent of player-perceived hot or coldness. Said differently, it assumes that players will take the same types of shots, with the same level of defensive coverage, regardless of whether they have just made or missed three shots in a row. We find this assumption difficult to believe – if players have been shooting well, it seems logical that they would begin to attempt more difficult shots and opposing defenses would begin to cover them more tightly. This would potentially counteract the Hot Hand effect.

To make this more obvious to the conversation at hand, basketball is not baseball and men are not potatoes. Baseball is a rarity in sports. The defense has control of the ball and, barring injury, everyone who plays will have an at-bat (designated hitter rules aside). Basketball has no promise that everyone who plays will have a chance to shoot a basket, or even touch a ball. Baseball hitting streaks come down to one person versus a hundred. The batter versus every pitcher they face. Provided they’re not walked, the batter remains in control of their destiny.

All of this is quite fanciful. There are hundreds of articles, like Phil Birnbaum’s quest for evidence of the Hot Hand effect and Tangotiger’s Sabremetric blog on the impact of the Zone on streaks. The best we can say is ‘Streaks exist, but generally they do so within the expected norm of percentages.’

None of this considers the psychological impact of a streak. The longer a streak goes on, the more stress and nerves are put on a player. At the same time, the more ease is given a player, as the expectation of winning becomes a short-term norm.

Per FiveThirtyEight, the Cleveland Indians had a 53% chance of winning Game 4 of the ALCS on October 18, 2016. The Epstein family gave it much less of a chance. We were right.

28 October, 2016

Post Meta, Custom Post Meta, and Statistics

I do a lot with weird stats. I do a lot with CMB2 and custom meta. I do a lot with CMB2 and weird post meta.

Is it any surprise I wanted to add my new year dropdown data to my site for some interesting statistics?

Of course not. But I started small. I have a saved loop called $all_shows_query

// List of shows
$all_shows_query = new WP_Query ( array(
	'post_type'       => 'post_type_shows',
	'posts_per_page'  => -1,
	'no_found_rows'   => true,
	'post_status'     => array( 'publish', 'draft' ),
));

This really just gets the list of all the shows and holds on to it. The meat of the code runs with this:

$show_current = 0;

if ($all_shows_query->have_posts() ) {
	while ( $all_shows_query->have_posts() ) {
		$all_shows_query->the_post();
		$show_id = $post->ID;
		if ( get_post_meta( $show_id, "shows_airdates", true) ) {
			$airdates = get_post_meta( $show_id, "shows_airdates", true);
			if ( $airdates['finish'] == 'current' ) {
				$show_current++;
			}
		}
		// More IF statements are here.
	}
	wp_reset_query();
}

As mentioned, there are more if statements, like if ( get_post_meta( $show_id, "shows_stars", true) ) {...} which gets data for if a show has a star or not. And if ( get_post_meta( $show_id, "shows_worthit", true) ) {...} which checks if a show is worth it or not. I was already doing a lot of this, so slipping in one more check doesn’t hurt.

To show the data, without Chart.js, it looks like this:

	<h3>Shows Currently Airing</h3>

	<p>&bull; Currently Airing - <?php echo $show_current .' total &mdash; '. round( ( ( $show_current / $count_shows ) * 100) , 1); ?>%
	<br />&bull; Not Airing - <?php echo ( $count_shows - $show_current )  .' total &mdash; '. round( ( ( ( $count_shows - $show_current ) / $count_shows ) * 100) , 1); ?>%
	</p>

With Chart.js, it looks like this:

<canvas id="pieCurrent" width="200" height="200"></canvas>

<script>
// Piechart for Currently Airing Data
var pieCurrentdata = {
    labels: [
        "On Air (<?php echo $show_current; ?>)",
        "Off Air (<?php echo ($count_shows - $show_current ); ?>)",
    ],
    datasets: [
        {
            data: [ <?php echo '
	            "'.$show_current.'",
	            "'.($count_shows - $show_current ).'",
	            '; ?>],
            backgroundColor: [
	            "#4BC0C0",
	            "#FF6384",
	            "#E7E9ED"
            ]
        }]
};

var ctx = document.getElementById("pieCurrent").getContext("2d");
var pieTriggers = new Chart(ctx,{
    type:'doughnut',
    data: pieCurrentdata,
    options: {
		tooltips: {
		    callbacks: {
				label: function(tooltipItem, data) {
					return data.labels[tooltipItem.index];
				}
		    },
		},
	}
});
</script>

Right now the data’s a bit off, since I’ve only updated 200 of the 325 shows, but it’s enough to get the information.

29 August, 2016

Tag: stats

Processing Numbers with WordPress

Getting At The Data

Save Data in Smart Places

External Data

Hot Hands And Playoffs

Post Meta, Custom Post Meta, and Statistics