A Year of Hits and Spiders

Right around this time last year I set up a simple hit logger on my server. Nothing complex, just a dump of the server superglobal with some extra metadata. This is something that is partially captured by the server itself in access logs: what file was requested, who was requesting it (user agent and IP), how the server responded (http response), time of the request, etc. Now, one year later, I finally opened it up to take a look at what I've collected.

Disclaimers: I'm not a sysadmin (although I try to be at times), this is based on roughly half-a-million hits, and this is the first time I've really tried to analyze a server log.

So, the basics. Every time a spider (like, Googlebot or a RSS feed reader) or a user requests anything from one of my web servers a 'hit' occurs. This could be a web page or it could be robots.txt, a photo, or a stylesheet. This makes it a valuable tool compared to traditional client-side analytics, like Google Analytics. Server hits can record spiders and asset downloads while client-side analytics really only track web page visits. So, what kind of resources are being requested from my servers?

  1. // requested resource breakdown

  2. 49% HTML

  3. 30% image

  4. 8% robots.txt

  5. 4% RSS

  6. 4% javascript

  7. 3% CSS stylesheet

  8. 1% favicon

  9. 1% sitemap.xml

This information is really valuable. First, the sheer volume of hits hitting robots.txt is a good indicator that there are a bloody ton of spiders hitting me. This file contains some basic rules that they should follow (although it's totally up to them if they listen or not). I was surprised, after seeing how many requests that robots.txt had, to see how low sitemap.xml. I thought that spiders would gobble that up since it's written specifically for them to understand site architecture.

The second thing that struck me was how many images were requested. I try to do a lot of optimization, keeping filesize low and implementing different forms of client-side caching, but I still have a lot of photos that are being downloaded. They are easily the largest and slowest part of my blog (and the upcoming waterfall site). Being almost a full third of my server requests I'll need to put a larger focus on optimizing them in the near future.

  1. // spiders vs users

  2. 59% spiders

  3. 38% users

  4. 3% unknown

The important caveat with this breakdown is that a user may cause multipe hits. On a typical page load they could request anywhere between 1 to 15 different resources from my server (depending on their cache level). Spiders will usually create a single hit per visit. Also, I based this breakdown on user agent, which may be spoofed either way.

  1. // spider breakdown

  2. 39% Google

  3. 21% Bing

  4. 12% Baidu

  5. 5% Ezooms

  6. 4% Turnitin

  7. 6% Yandex

  8. 2% Ahrefs

  9. 11% other

When I tried to narrow down on spider user agents I was surprised at just how much Google focused on my site. 39% of my server hits came from a Google-related bot, be it a normal web crawler, mobile crawler, feed reader, or previewer. I'm sure that they spoofed a few user agents for validation too, but overall a direct Google-related user agent was quite persistent throughout my logs.

There were some interesting other ones too. Baidu and Yandex are both foreign search engines that seemed fairly interested in my site (although the Chinese don't seem very interested in returning my domain on a typical name search). Ezooms appears to be useless spider just meandering around. Turnitin is focused on plagiarism, and Ahrefs is a tool for checking backlinks and similar SEO metrics. There was a lot of miscellaneous user agents with tiny shares that I didn't dig into as much, like Rogerbot and whatnot.

Most of these scripts are trying to be useful. They are populating tools and search engines with content information to better help users and webmasters get information. There are some that are not as innocent, though. One thing I wanted to drill down was the evil spiders, the creepy crawlers that try to find vulnerabilities and create havoc. What were they trying to do on my site?

  1. // malicious spiders target

  2. 59% PhpMyAdmin

  3. 16% Wordpress

  4. 11% generic forum

  5. 14% other

There wasn't many: less than 1/2 a percent out of all hits that were looking for obvious weaknesses. I was surprised to see that many of them were targeting PhpMyAdmin first instead of other CMSes. They usually carried along some GET or POST data, default passwords or such, and try to hack their way in. Since all of my sites are custom, with no admin areas and few direct writes, and I don't run PhpMyAdmin I'm not too worried. However, I'm still vulnerable to DDOS or server-level hacks, so I might start paying more attention to these malicious spiders and start denying their IPs, maybe even on the server level.

I'm not sure if I'll dig around deeper in the short term. It'd be nice to have a regular report with a full breakdown to view on a regular basis, with most popular assets and frequency of bot hits and all sorts of yummy sysadmin-level stats. First I want to button up some more side projects, which could take most of this year. Still, it was nice to check out the logs and see what was going on, even on a general overview level.