Working with the Twitter Archive

Over the last few months I've been having a lot of fun with the Twitters. It all started simply enough, just upgrading my lifestream to pull from an authenticated feed. Now I'm playing with a few Twitter bots, looking into the streaming API, and even working on a PHP library with David Kryzaniak. Deciding to 'go back to the basics' I recently downloaded an archive of total tweet history with the plan to complete my lifestream all the way back to March of 2009.

Right now my lifestream goes back to March 2010, so I'm only missing a year. However, I'm missing some other pieces too: retweet count, favorite count (a recent addition to the statuses response), and a more elegant 'media' entity display (actually show images inline on my lifestream instead of an external link).

So, the first step was to download the archive. Twitter made archives available last December (twitter archive blog post). The link to download the archive is available on the settings page. It was easy and there was almost no wait for them to package up my zip.

Once I downloaded and extracted the files, though, I was surprised to see all the extra junk. They didn't structure this to only contain tweet data. This was a full-working front-end site. There was CSS, images, and javascript. The statuses were locked up in javascript objects (JSON markup, but still not ready for parsing). This wasn't going to be as straightforward as I expected. So I built a quick little script to chew through my archives and dump them into a temp MySQL table.

After everying was parsed out it was time to see what all was there. I had about 4700 tweets (although a good chunk of them were already in the system from my cron jobs). Then I hit my second disappointment - the archives didn't include some standard fields like retweet count or favorite count.

The next step is an ugly one. I'm going to have to loop through every individual tweet and do a single API call to pull retweets and favorites. Sure, it'd be more efficient to pull a list of tweets and limit it with max_id or since_id, but now I'm a bit frustrated at Twitter. After that I'll have both a full collection and all of the meta data attached to each tweet.

I guess I understand why Twitter structured the archives this way, for more visual and marketing people, but in the end it caused me a good amount of extra work. If you're interested in pulling your tweets and doing something with them, whether it's just parsing out the fields or something more involved like my lifestream, be ready for some extra work!