Building a Web Crawler

A web crawler is a script that goes out and crawls the web. Pretty surprising, eh? There are many crawlers out there, including the famous GoogleBot that attempts to index every site it bumps into and nefarious spambots that look for and exploit vulnerabilities. The logic behind them seems simple: pull and process each page it bumps into. It wasn't until I decided to build my own crawler that I understood some of the complexities these scripts have to deal with.

My project started with a few basic specs. There was a cap on time, perhaps the most severe and obvious limitation on complexity. The idea was to provide statistics on page structure, running a series of tests that looked at traditional SEO best practices (keyword density, meta usage, anchor text, etc). These rules would be flexible in their scoring and usage. Reports needed to be rendered and saved so someone could see their different scores change as they made updates to their site. To some extent, I was building SEOmoz's rogerbot.

Outside of the limit on time I was also limited with my programming skill. As much as I like to delve into alternate languages I am most familiar with PHP. It made sense to start there. PHP does have some crazy overhead and is single-thread by design so I knew that this script was going to be slow and cumbersome. Then I thought about it some more and realized that there are several base steps that the process could be broken up to help with that.

Fetch the Web Page

Response times are highly variable. Sub-second responses are what all the cool kids are aiming for but there are a lot of not-cool-kid websites out there. I estimated the request time could be up to 30 seconds, which means that in a ten minute run time I would only crawl 20 pages in a site. My blog is somewhere over 400 pages, so in a worst-case-scenario it would take approximately 5 hours to crawl it. And that's not even counting the parsing and rule executions. It was obvious that this step should be run separate and that the other processes should exist completely outside of it.

Run Page-Wide Rules

Once the script has fetched a few chunks of content I could start parsing them one at a time. Some of the scripts would check for general best practices (title length, h1 usage, alt tags for image), others would run more complex rules (keyword density, URL structure), and yet others would pull links out and hand it back for the crawler to go out and fetch new pages.

One of the most frustrating things with this step is trying to figure out how to effectively parse the content. Ideally HTML would be written in such a way that you could throw the content straight at SimpleXML, but many websites have open tags or orphan nodes that would throw errors (I did not go the DOM extension route, although in retrospect this may have been a valid way to go). It is difficult to write a script to pull chunks of content because there is so many ways for a web developer to structure that content.

Run Site-Wide Rules

Some rules are more dependent on the entire site, like duplicate titles or page content. Once all of the pages have been downloaded and parsed into a more flexible format I could start checking for these items. Unlike the first few steps which can be run simultaneously, this one had to wait for the very end.

There are some additional steps that would be nice to look at as well. HTTP responses, CSS optimization, breakdown of response time (DNS Lookup vs actual send), sitemap.xml comparisons, and so much more. As I mentioned early on, though, I did have a time limit and was aiming more for a proof of concept at this stage.

In the next post I'll talk more about structure of the crawler and some of the roadblocks that I ran into during the build and testing phases. There were plenty of frustrations with this project, enough to make me hesitant to wade back into this project a year later and try to salvage the work.