For the last ten years, I worked for DreamHost, which meant I had access to a lot of awesome commands that everyone ran to diagnose things.
Well now I’m gone and I’m still a webadmin for my domains. And I have, as you all know, a weird guy who keeps going after me. I also have been running fandom sites for longer than WordPress has existed. I’ve had to learn a lot of tricks to sort out ‘Is this person so-and-so again?!’
Now… I’m going to tell you a secret. You ready? Okay, most of those scripts hosts run? They’re just cleaned up shell commands you run on the server via command line (aka command line interface aka cli). And those commands? They’re actually pretty common, well known, and public.
So here are some of the ones I use and why!
Before You Begin…
I have to step back a moment.
Do you know where your log files are? DreamHost posted in their KB how you do that, but you will want to check your hosts:
There are three caveats, and I know one is weird.
1. Logs rotate
Server space matters, so logs are regularly deleted to prevent your data from killing things.
Right now I see this:
-rw-r--r-- 1 root root 3.5M Sep 16 09:08 access.log
lrwxrwxrwx 1 root root 21 Sep 16 00:49 access.log.0 -> access.log.2022-09-15
-rw-r--r-- 1 username server 1.6M Sep 12 00:51 access.log.2022-09-11.gz
-rw-r--r-- 1 username server 1.4M Sep 13 00:54 access.log.2022-09-12.gz
-rw-r--r-- 1 username server 1.6M Sep 14 00:11 access.log.2022-09-13.gz
-rw-r--r-- 1 root root 9.7M Sep 15 00:21 access.log.2022-09-14
-rw-r--r-- 1 root root 11M Sep 16 00:49 access.log.2022-09-15
Tomorrow I’ll loose the 9-11 log.
2. You need to know what your logs look like
Every host tweaks the format of apache logs in a different way. You’ll see I use things like print $1
in my code, and for me I know that means “$1 is the IP address.” But that may not be what your host does.
Look at the logs:
192.0.114.84 - - [16/Sep/2022:00:49:05 -0700] "GET /wp-content/uploads/2019/10/Pure.jpg HTTP/1.1" 200 257552 "-" "Photon/1.0"
And then count things. IP is #1, URL is #7, and so on.
It can be a pain so please feel free to experiment and mess with it to get exactly what you want.
3. You may need to use http logs for everything
This is specific to DreamPRESS (the managed WP hosting) and is the weird thing, you always have to use the http folder even if you use https.
Why? Well that has to do with how the server processes traffic. DreamPress (as of the time of this post) uses Varnish to cache and Nginx as an SSL proxy. That means when you go to https://example.com
the server has nginx check the HTTPS stuff and passes it to Apache, which runs HTTP. Those logs are your apache logs, not your Nginx ones.
Can you view the Nginx logs? Not at this time. Also they really are pass-throughs, so you’re not missing much. If you think you are, please open a ticket and tell them what you’re looking for in specific. Those help-desk folks are awesome, but the more clear you are about exactly what you’re looking for, the better help you get.
Okay! On with the show!
Top IPs
Sometimes your site is running super slow and you want to know “Who the heck is hitting my site so much!?”
awk '{ print $1}' access.log | sort | uniq -c | sort -nr | head -n 10
This command will list the top 10 IPs that hit your site. I find this one super helpful when used in conjunction with an IP lookup service like IPQualityScore, because it tells me sometimes “Hey, did you know Amazon’s bots are hitting the heck out of your site!?”
You can change that 10
to whatever number of top IPs you want to look for. That tends to be enough for me.
If you know you have a lot of ‘self’ lookups (like you wrote something that has your server do a thing) you’ll want to try something like this to exclude them:
awk '{print $1}' access.log | grep -ivE "(127.0.0.1|192.168.100.)" | sort | uniq -c | sort -rn | head -10
Popular pages (excluding images/css/js)
Sometimes you just want to know what pages are being hit, right?
Remember how I said you actually need to know what your log looks like? For me, $7 is the 7th ‘item’ in my access log:
192.0.114.84 - - [16/Sep/2022:00:49:05 -0700] "GET /wp-content/uploads/2019/10/Pure.jpg HTTP/1.1" 200 257552 "-" "Photon/1.0"
Counting is weird, I know, but the 7th is ‘/wp-content/uploads…’ so I know that the command has to use $7. BTW Photon there just means I use WordPress’s image stuff via Jetpack.
awk '{print $7}' access.log | grep -ivE '(mod_status|favico|crossdomain|alive.txt)' | grep -ivE '(.gif|.jpg|.png|.js|.css)' | \
sed 's/\/$//g' | sort | \
uniq -c | sort -rn | head -25
That returns a unique list:
862 /xmlrpc.php
539 /wp-admin/admin-ajax.php
382 /wp-login.php
75 /wp-cron.php?doing_wp_cron
And it’s not a shock those are the high hits. Nice try folks. I use things to protect me. But before we get into that…
IPs Hitting a Specific Page
Now let’s say you’re trying to figure out what numb nut is hitting a specific page on your site! For example, I have a page called “electric-boogaloo” and I’m pretty sure someone specific is hammering that page. I’ll do this:
awk -F'[ "]+' '$7 == "/electric-boogaloo/" { ipcount[$1]++ }
END { for (i in ipcount) {
printf "%15s - %d\n", i, ipcount[i] } }' access.log
That spits out a short list:
12.34.56.789 - 3
1.234.567.890 - 4
It’s okay that the command spans multiple lines. Check those IPs and you might find your culprit.
What ModSecurity Rule Hates Me
I have a love/hate relationship with ModSecurity. My first WP post (not question) in the forums was about it. It’s great and protects things, especially when you tie it into IPTables and have it auto-ban people… Until you accidentally block your co-editor-in-chief. Whoops!
For this one, you’ll need to ask the person impacted for their IPv4 address. Then you can run this:
zgrep --no-filename IPADDRESS error.log*|grep --color -o "\[id [^]]*\].*\[msg [^]]*\]"|sort -h|uniq -c|sort -h
That will loop through all the error logs (on DreamHost they’re in the same location as the access logs) and tell you what rules someone’s hitting. Then you can tweak the rules.
Of course, if you’re not the root admin, you’ll want to ping your support reps with “Hey, found this, can you help?” They usually will.
Don’t feel bad about this, and don’t blame the reps for this. ModSecurity is constantly changing, because jerks are constantly trying to screw with your site for funzies and profit (I guess). Every decent host out there is hammering the heck out of their rules constantly. They update and tweak and change. Sometimes when they do that, it reveals that a rule is too restrictive. Happens all the time.
Long Running Requests
Another cool thing is “What’s making my site slow” comes from “What processes are taking too long.”
awk '{print $10,$7}' access.log | grep -ivE '(.gif|.jpg|.jpeg|.png|.css|.js)' | awk '{secs=0.000001*$1;req=$2;printf("%.2f minutes req time for %s\n", secs / 60,req )}' | sort -rn | head -50
That gets me the 25 top URLs. For me it happened to list MP4s so I added that into my little exclusion list where .gif etc are listed.
Who’s Referring?
A referrer is basically asking “What site sent people here.”
awk '{print $11}' access.log | \
grep -vE "(^"-"$|/www.$host|/$host)" | \
sort | uniq -c | sort -rn | head -25
This one is a little weird to look at:
15999 "-"
31 "www.google.com"
8 "example.com"
4 "binance.com"
The ‘example.com’ means “People came here from here” which always confuses me. More impressive is that top one. It means “People came here directly.” Except I know I’m using Nginx as a proxy, so that’s likely to be a little wonky.
What are your favourite cli tools?
Do you have one? Drop a line in the comments! (Be wary about posting code, it can get weird in comments).