Archiving (Scraping) a Site

Last week I wrote a little post and script to archive a directory. This can be really helpful if you're looking to backup some legacy code, especially a deeply nested old website, but what if the files only live online? How do you reach out to the internets and archive (scrape) a front-end?

There are a few different steps needed to accomplish this. Each step involves its own logic steps and PHP extension dependencies so I found it helpful to break up the script into obvious stages. This also makes it easier for users to jump in and modify pieces if they need to, as my initial version does make a lot of assumptions.

Grabbing a Resource

Reaching out and downloading a resource from a website isn't that hard, regardless if the resource is an HTML page, JPG image, or CSS asset. I leaned on the old cURL library to do this step. There are plenty of ways to manipulate the headers sent in case you need to worry about scrape blockers (like I did when I was playing with LinkedIn) with cURL plus some helper functions to get the sent content-type response, which will be helpful for the next step.

  1. // first, we scrape and save locally

  2. $curl_handle = curl_init();

  3. curl_setopt($curl_handle, CURLOPT_HEADER, false);

  4. curl_setopt($curl_handle, CURLOPT_RETURNTRANSFER, true);

  5. // $link_array is a list of the links to scrape, start with domain

  6. $link_array[] = $domain;

  7. for ($i = 0; $i < count($link_array); $i++) {

  8. curl_setopt($curl_handle, CURLOPT_URL, $link_array[$i]);

  9. $curl_result = curl_exec($curl_handle);

  10. $curl_header = curl_getinfo($curl_handle, CURLINFO_CONTENT_TYPE);

  11. }

Parsing the Resource

Once you get the first page how do you get more? Well, there are a few ways to crawl a site. You can depend on the sitemap.xml (if it exists), parsing the xml to get a list of all the HTML pages. You could pre-populate $link_array if you know the full structure of the site. Or you could parse each response and look for more links. Like a crawler.

Here is where my assumptions start to leak in. I only wanted HTML, CSS, and images from the site, not Javascript or other assets linked. For parsing of the HTML I leaned on DomDocument, a helpful PHP class that has some shortcomings when it comes to improperly formed docments. I assumed that HTML files would be linked in 'a' tags, CSS files in 'link' tags, and images in 'img'. Here's a snippet that uses DomDocument to grab these links and append them to the array of links.

  1. // only run if content-type is html

  2. $document = new DOMDocument();

  3. @($document->loadHTML($curl_result)); // darn you invalid html

  4. // grab all normal 'a' links

  5. $a_node_list = $document->getElementsByTagName('a');

  6. foreach ($a_node_list as $a_node) {

  7. $link = $a_node->attributes->getNamedItem('href')->nodeValue;

  8. $link = get_scrapeable_link($link, $link_array[$i]);

  9. if(should_add_to_scrape_list($link, $link_array))

  10. $link_array[] = $link;

  11. }

  12. // grab css file links

  13. $link_node_list = $document->getElementsByTagName('link');

  14. foreach ($link_node_list as $link_node) {

  15. $link = $link_node->attributes->getNamedItem('href')->nodeValue;

  16. $link = get_scrapeable_link($link, $link_array[$i]);

  17. if(should_add_to_scrape_list($link, $link_array))

  18. $link_array[] = $link;

  19. }

  20. // grab image links

  21. $image_node_list = $document->getElementsByTagName('img');

  22. foreach ($image_node_list as $image_node) {

  23. $link = $image_node->attributes->getNamedItem('src')->nodeValue;

  24. $link = get_scrapeable_link($link, $link_array[$i]);

  25. if(should_add_to_scrape_list($link, $link_array))

  26. $link_array[] = $link;

  27. }

DomDocument has a few handy methods to lean on, allowing me to target specific nodes and their attributes without complicated regular expressions. I did abstract out a few pieces of logic, like the clean up of the URLs (to get rid of get attributes and anchor tags, plus figure out relative link logic) and the check to make sure we wanted to crawl the link (only want to scrape internal links).

This is great, but what about image links in the CSS? DomDocument doesn't do CSS. For this I used regular expressions. There are some CSS parsers out there that other people have built that seemed to hefty for such a simple task.

  1. // only run if content-type is css

  2. preg_match_all('/url\([\'"]?(.+?)[\'"]?\)/i', $curl_result, $matches);

  3. foreach ($matches[1] as $link) {

  4. $link = get_scrapeable_link($link, $link_array[$i]);

  5. if(should_add_to_scrape_list($link, $link_array))

  6. $link_array[] = $link;

  7. }

So, assuming that Content-Type is accurate, I now had a loop that would go through a html page, grab all the links to other pages, images, and stylesheets, and then continue to loop through those and parse them until everything (everything linked) was grabbed from a single domain. I could save it all directly into the archive object. Instead I saved it to a temp directory. Between juggling cURL requests and an Archive object I was concerned about memory usage. Saving each resource locally (after parsing out the wanted links) seemed like a good way to shelf it until I wanted to archive them.

  1. // original loop

  2. for ($i = 0; $i < count($link_array); $i++) {

  3. // curl execution step here

  4. // content-type detection and parsing here

  5. // now, figure out what to name the file locally and save

  6. $local_path = $link_array[$i];

  7. if (substr($link_array[$i], -1) == '/')

  8. $local_path .= 'index.html';

  9. $local_path = str_replace($domain, '', $local_path);

  10. $local_path_list = explode('/', $local_path);

  11. $local_file = array_pop($local_path_list);

  12. $path = $directory_path;

  13. foreach ($local_path_list as $local_path_piece) {

  14. $path .= $local_path_piece . DIRECTORY_SEPARATOR;

  15. if (!is_dir($path))

  16. mkdir($path);

  17. }

  18. $file_handle = fopen($path . $local_file, 'w');

  19. fwrite($file_handle, $curl_result);

  20. fclose($file_handle);

  21. }

There are two main pieces to keep in mind here. First, many websites simplify their URLs by not referencing an exact file (like my websites). So, in those cases, I saved the response as an 'index.html' within the URL structure. And that's the second piece - I wanted to maintain directory structure, which meant I had to map the URL structure to a directory structure.


Things got easy from this point. Using a script similar to my archiver all I had to do was loop through the temp directory, adding files to the archive object, and then save. Since the files were saved in a temp directory I also needed to delete the directory. One catch - you need to save the archive before deleting the files referenced or else the archive won't capture the files. So, again, taking the task one small step at a time.

And that's it! The final script can be found on my github account (scraper). It's still very beta, since it's hard to predict how web pages are structured. I did test it on a handful sites and it seems to handle well on basic structures, and tweaking the script to handle side cases shouldn't be too tough.