Check your Assumptions

NOTE: yes, I’m getting around to this 2 weeks after the fact. I have long-lived drafts, okay?

If you haven’t been keeping up with tech news in a particularly frantic way, you might not have noticed that Google is changing their privacy policy. On one hand, this change will result in a somewhat longer policy, but without the fragmentation across all of Google’s services, so that one need not read 5 different policies to use 5 different Google services. On the other hand, this means that data is share-able across application boundaries, such that the bountiful web searches can be used to customize which youtube videos are shown to a person, and vice versa.

The EFF being who they are, issued an article alerting us to this state of events and how to ameliorate it, by deleting our web history. However, looking over it, I noticed that this is a ton of data. These records go all the way back to 2005; in other words, it’s a treasure trove of personal data. And you ought to know that I’m somewhat obsessed about personal metrics, and I’m not going to just straight-up delete perfectly good data if I can’t get my hands on it.

The problem is that Google didn’t offer a nice way to download the data: there’s an RSS feed, but that means stepping through an RSS feed manually, since I couldn’t just generate a feed with 29k searches in it: instead, I had to step through by 1000 items, and who would download 30 xml files? Obviously, this was a job for lazy-in-some-ways programmer.

There was some prior work, but I threw it out for dumb reasons (CSV?? Flash?? It’s like we’re back in the stone ages!) (Also, don’t tell curl I didn’t choose it) Plus, I wanted to learn how to work Chrome extensions, and by just thinking about it, it seemed like I could inject javascript into the history page, use AJAX calls to get the subsequent RSS feed files (since the injected javascript would side-step the same-origin policy), and save it using the HTML5 FileSystem API. But would this all work? Just because you work it out in your head doesn’t mean it should work, especially with technologies one is not familiar with.

After reading up on it, though, it still seemed feasible, and I spun up a git repo and started writing a Chrome extension. Then, I ran into a small snag: how should I make sure I don’t download and save more files than needed? So I tried to see if asking for an RSS feed too far into the past would just pass back a 404. Instead, I found that Google had somehow squashed my 29k searches into 5 XML files, instead of 30.

Oh.

Now, downloading 5 files does not sound like much work: neither does 30, in the grand scheme of things, but 30 somehow manages to be more fearsome with the mere addition of a brotherly digit. Cut down to size, I just downloaded and saved the XML files containing essentially my entire search history, packed into a little over a megabyte of data. Also, it was 30 minutes until midnight, and tarrying wouldn’t do me good (of course, the privacy policy might have already shifted, say, when midnight appeared over the International Date Line, but nevermind…) (also, I couldn’t find where Chromium actually, you know, saved the File data).

However, it might have balanced out to the best: I learned how to work basic Chromium extensions, and I also got my data down, safe and sound. I suppose there’s a moral in here somewhere… oh! In the title! Check your assumptions, fail fast, etc etc.

Methinks I have to work on my fable telling.