Web Crawlers are Difficult

In my last post I explained what a web crawler was and some of the basic steps that someone could break up the logic into. I also dived into some of the difficulties I had when building out my own crawler. Granted, some of the difficulties were self-imposed: I was using PHP, had a limited timeframe, and didn't spend enough time thinking about the structure before wading into the code. There were other difficulties, though, that cropped up because of the way the web works, and these issues are the same things that all web crawlers must address.

When I initially built the script I decided to keep all of the logic in one codebase, separting out the processes on runtime. This did allow me to have nice shared abstraction for different sections at the cost of cleaniness, something I would think a bit more about this in the future. The three main functions of a crawler - fetching, parsing, and analyzing - could be broken into three separate codebases, each optimized for that specific function.

Like I mentioned in the last post, fetching the content can take a long time and eat up a lot of bandwidth, especially if you start looking at different resources outside of just plain HTML pages. It's easy to see that Google has poured a lot of attention to this by looking at their webmaster tools, where they place an emphasis on crawler issues and delays. While their systems are very sophisticated, with a single server capable of making numerous requests simultaneously, there is still a pattern that they follow to optimize what they're pulling when based on how often the page changes and what value they give that content. If they can't connect to a resource (404 or other response) then they wasted valuable resources, something that Google does account for warns aginst.

With all the issues involved with fetching new data it only makes sense to cache it locally. Once the crawler had gone through the trouble of reaching out, connecting, and downloading the content, it is easiest just to save that version locally to allow your processing to happen at its pace. How long a page should be cached and what weight should be given to older pages are all factors that the codebase needs to allow variety on. This does open up some interesting opportunities for seeing websites changing over time.

Just based on the fetching and caching that crawlers perform one can make some interesting conclusions on how to work with them better. Downtime is bad (for both crawlers and end users), crawlers should be allowed to choose their own rates, and a single bad update on a site could be cached within their system for a very long time.

Structuring your content is also important. There are a lot of things to keep in mind when building out your content already, just be aware that crawlers are looking for a nicely formatted structure and that easy-to-understand markup can be helpful. By keeping things simple you can help out the crawler a little bit and lower the risk of it misunderstanding your content. There are few hard rules or standards for web developers to follow (mostly best practices and industry standards) but the engineers behind web crawlers are expecting content to follow some sort of general guidelines.

One of the most pivotal pieces of functionality with my web crawler involved assigning value to content. I got off easy with being able to run defined rules with variable breakpoints… The title was too long, a duplicate h1 existed, etc. It would be easy to give different websites a score based on them passing or failing these rules. Search engine crawlers have to be much more fickle. They don't exist to run general validation or rules but to connect end users with desired content. After building my crawler I have a much higher respect for the relative value assigned to content and will be keeping a closer eye on what the big players (cough cough Google) see as valuable content. No, I'm not talking about link farming.

I stepped away from this project a long time ago. It was invaluable in helping me understand some of the rudimentary functions of web crawlers and in helping me as a web developer determine some ways of structuring pages and content. The project was also much more complicated than I thought it would be. Sure, the crawler works and will spit out detailed reports for a website, and I constructed to make it easy to add and edit rules with minimal rewriting Still, my first attempt is still bulky and slow. Someday I'd like to circle back, if only to have an internal testing tool for my websites, but for now I'm fine with that letting the beast rest for a few more months.