Blogs

RedMonk

Skip to content

Using Cribs to Make Sense of Your Web Traffic

“In cryptanalysis, a crib is a sample of known plaintext, or suspected plaintext; the term originated at Bletchley Park, the British codebreaking operation during World War II (WWII).” – Wikipedia

If you read the headline quickly and assumed that I’d synthesized a way to apply MTV Cribs in a fashion that made intelligible your web traffic, you’re going to be disappointed. My apologies. I’m referring, as you’ve undoubtedly gathered from the quote above, about a very different kind of crib. One which is less flashy than multi-million dollar mansions and blinged out stretch Humvees, but still interesting. At least to me.

In the world of codebreaking, a crib is a shortcut to trying to crack a given cipher – it’s cheating, essentially. Faced with a difficult encryption scheme that defied conventional analysis, codebreakers would guess at message contents in an effort to shortcut the decryption process. In World War II, as an example, one might reasonably guess that messages originating from U-boats operating in the Atlantic might contain words like “destroyer” or “convoy” or “sunk.” By comparing the anticipated content against intercepted encrypted transmissions, Bletchley Park could dramatically reduce the time needed to break codes.

This practice only works if you can successfully anticipate the message content, however. If the other side is aware of the usefulness of cribs as an attack technique – as the Germans most certainly were – it’s inevitable that they’ll educate their personnel on the subject. Instruct them to avoid common words and phrases, thus neutralizing the potential for having their cipher attacked via a crib. Which is where seeding comes in.

As the Wikipedia article discusses, when the usual cribs failed due to strict adherence to radio protocols on the part of the Germans, Bletchley Park would have the Royal Navy “seed” a particular area with mines – thus guaranteeing that transmissions containing the name of that mined bay or whatever would be subsequently sent. They manufactured themselves, in other words, a crib.

While the above is hopefully an adequate if cursory dissertation on the basics of cribs as they pertain to codebreaking, it is not an explanation of how they relate – if at all – to web log analysis. Web log analysis, after all, is a slightly less earnest – not to mention complex – endeavor than codebreaking.

But consider for a moment the trail that we all leave when we visit a webpage, say when I visit your blog. Assuming that you have a webserver package capable of logging the visit and presenting it in some readable fashion, I implictly deliver to you a.) my IP address and b.) my pageview history. Taken individually, these are not terribly helpful. In combination, however, my IP and history can tell you quite a bit.

IP’s can be fairly easily traced back to a general point of origin. If I visit your site, as an example, you might be able to trace it to me as an individual, but you’d know that someone originating from a Comcast IP in Denver visited your site and looked at this page(s).

Still fairly cryptic. But what if you have a crib? Say, you mention my name in your article, and a few hours later a user from a Comcast IP in Denver arrives at your piece from a Technorati search for “stephen o’grady.” Or maybe you IM’d me a random link that a Comcast IP from Denver visited in approximately the right timeframe. You still can’t be entirely sure that it’s me, but given that your usually anonymous visitor is a.) visiting from Denver and b.) searching for articles about me, it’s fairly likely that you’ve correctly ascertained my identity.

And what if you saw – from that same IP – a del.icio.us URI beginning http://del.icio.us/sogrady?...? Post-modernists might still claim that you can’t really know that it’s me because you can’t really know anything, but the rest of us would provably acknowledge the obvious: that’s me. My IP – which was once opaque to you – has been decrypted into plaintext: my name.

As it happens, I’m a MyBlogLog user so if you’ve got their little bit of Javascript enabled on your page I’ll save you the effort and tell you that I stopped by – and even give you a poorly cropped picture of me as a bonus – but the identification procedure above requires no permission and not a whole of intelligence.

I’m not inclined to debate the questions of whether this is good or bad, because ultimately it’s irrelevant. Whether you like it or not, if someone’s paying attention you’re announcing yourself every time you visit a webpage – even in situations where your IP is not static but rather dynamically assigned. This will be news to some of you, and obvious to others, but I thought it worth mentioning. Largely because I’m beginning to wonder if some of the sharper folks out there haven’t caught on to the value of cribs, and begun to employ them strategically to identify visitors and gain better intelligence on who’s visiting and what they’re reading.

This technique is likely to have implications for you, whether you’re looking to see if someone you’ve written about has read your piece, or whether you’re concerned about the tracks you’ve been leaving all

Categories: Trends & Observations.

  • Guy Creese

    Or a web site could send a newsletter or e-mail to you and include web bug or personalized landing page so that when you visit the site via the included hyperlink, the system knows who you are (e.g., Prospect 44075 or Customer 12890) — even though you don’t explicitly identify yourself during the visit. While a company could code such a solution, vendors such as Eloqua and Manticore Technology have been offering such pre-built packages for a number of years now.

    The method you mention is especially easy when the site readership is specialized and the potential readers are known to you. For example, I went to a small college and I’m the webmaster for my college class (www.williams75.org). Given that I know the majority of my 420 classmates, where they live, and have a lot of person-specific content (pictures, news, citations of articles) I can pretty much tell who visited when based on what they browse and where they come from, even though I don’t demand they login.