What’s a Document?

Share via Twitter Share via Facebook Share via Linkedin Share via Reddit

One of the most interesting byproducts of the transition, fully underway around the world, to XML based document formats from binary alternatives, is the ability to treat the asset as a container of items rather than a discrete item itself. Both ODF and OOXML allow applications to manipulate the contents of assets that were previously opaque at a minute, granular level, even as their respective proponents would doubtless argue their respective superiority at that particular game.

For those of you – and there are one or two at least, I’m sure – that are not office format wonks, here’s the English translation of the above: the files that you today produce in Excel, Powerpoint, or Word can now be carved up, dynamically reassembled and presented. Annual reports can contain continually updating economic data, mortgage applications real-time interest rates, or – nearer and dearer to my heart – baseball scouting reports, moving performance data.

Documents today can have, as IBM’s Doug Heintzman noted last Wednesday at IBM’s annual analyst event, more in common with a web page than the document you or I might have authored a few years – or a year – ago. Parts of it might be static, parts of it might be dynamic, but each of those parts might arrive from separate, external sources of record. The days of static documentation are drawing to a close, thanks to innovation – finally – in an area that should have seen it years ago.

While we at RedMonk are so far out on the bleeding edge that we can’t even see the mainstream when it comes to our own work habits (though not our coverage, hopefully), it’s nevertheless worth noting that I really don’t create documents at this point. Customer, expense and other operational spreadsheets are kept in Google Docs, and frankly they’re more webpage – even database – than they are spreadsheet at this point. At no point in their lifecycle, generally, are they transmitted as ODF, OOXML, or PDF: I can’t honestly remember the last time I exported one for the purposes of sending. When we need to collaborate with an external party, we simply share the asset. Even the pieces I author for this space are documents only in a nominal sense. Each is composed in emacs, then pasted to WordPress. There, it is reforged as an entirely different asset, pulling in pictures, videos, or other embedded assets, all while collecting comments, trackbacks, and revisions to become something new and distinct.

Is that a document? I’d argue not.

The closest I come to creating documents, at least in the traditional sense, is in Impress – the OpenOffice.org Powerpoint alternative. This I use to create the presentations I deliver at conferences, customer events and the like. The presentations tend to be discrete, unevolving assets that I “share” simply by posting them to the web. We do reuse presentations (occasionally) and slides (frequently) within RedMonk, but for the most part presentations are not living documents in the way that a customer spreadsheet is.

But that’s the exception to the rule, which is living assets, and it’s driven primarily by technical limitations. Limitations that I hope are removed. Soon.

For us then, settling on the definition of a “document” is problematic, because it reflects a lifecycle and a lifespan that are, at best, antiquated. Much, if not most, of our output is collaborative, rather than singularly authored, and most of it has a life expectancy far beyond any of the Word documents I authored in my capacity as a systems integrator. Particularly the content that lives on the web. A document, for me, has become a snapshot of the real, living asset, rather than an asset in and of itself. If our Google Doc’s spreadsheet is the Platonic ideal, the ODF capture of it is merely the shadow on the wall.

Which begs the question: are we creating documents, really, anymore? What does document mean in a networked, composable, and programmatically manipulable age? Or perhaps your natural inclination might be – like mine – to view the above as splitting hairs, a pointless, unresolvable debate of semantics.

Whatever my natural inclination might be towards such questions, however, my considered opinion is that the question matters. Maybe a lot.

Not to me, personally. First, because as mentioned, I live on the cutting edge and I’m not terribly relevant relative to the average office user of today, or maybe three to four years out. But more because I’m in a position to realize how documents are evolving, and what they might be capable of if we can get creative. The terminology is not going to have much bearing on what I think of a given technology.

Not everyone is so lucky, however.

As I see it, the danger in continuing to call the content we’ll be creating – using a rapidly evolving set of tools – over the next few years “documents” is that it will stunt the imagination. An example: when I was approached, years ago, about attending the ODF Summit, I had to explain in detail why I believed that messaging (email) and collaboration (wiki) vendors should be included in thee discussion. So tight was the focus on an “office productivity” format, it was non-obvious even to some ODF experts that wikis might, at some point, become consumers and producers of ODF.

The term document, in my view, is a legacy term, and as such, it brings with it preconceived notions of what a document is, should be, and can be. My concern, then, is that these preconceived notions end up predetermining the perceptions of what the assets are capable of.

To be sure, we should not – must not – try to reframe the traditional definition of a document. For those mainstream folks that will make up the bulk of the user population for the foreseeable future, their definition of what a document is is set, and it would be folly to try and change this.

But neither should we let that definition carry forward, tainting more capable formats with the legacy of its limited capabilities. No, we need a new definition or term, I believe. Something more accurately descriptive, and yet non-threatening. Database? Too intimidating, too misleading. Web page? Likewise. Container? I don’t love it.

So I don’t have the replacement term worked out yet: sue me. That doesn’t change the fact, in my opinion, that we’ll need one.

And if the format advocates have their way, probably soon.


  1. I’ve been investigating the term “document” for the last few months, as I’ve been writing the forthcoming O’Reilly book on CouchDB, which is ostensibly a document database.

    While it’s use of the term is inspired by Lotus Notes (which itself produces the “post-documents” that you are writing about here) I think the term is quite helpful in the technical context. In distinguishing between what tools like CouchDB, Notes, and Google Docs manage, and the normalized data patterns we deal with in the relational database world, the notion of a document as a relatively stand-alone piece of data holds some water.

    As you mention – documents, like web pages, are getting more intertwined with dynamic data sources, but they are still much closer to their archaic paper and PDF counterparts, than the rows and tables of a SQL store. The singular feature we’ve picked out, in writing the book, is that documents are meant to (more or less) make sense on their own. Documents carry at least some of their own context. This is why documents (like web pages) can travel well. Taking them away from their original location does not strip the meaning from them. This does not hold for a SQL row.

  2. One aspect of ‘legacy-style’ documents that you did not touch on is the common need for accountability and tracking as the content develops. It’s not unusual to need to be able to see a history: who changed what and when, and what did the content look like after they had done so.

    Different users and organizations have different requirements in this space. Those who place very high priority on such tracking will inevitably continue to use static document formats into the foreseeable future, just as legal folks today will send around fully rasterised PDFs of important documents in the belief that those are more reliable and trustworthy.

  3. For our internal programming projects, nearly all of our “documentation” these days is wiki pages. Previously (3 years ago) it was nearly all Word docs. Easy to see which has a better chance of being updated as requirements change.

    We produce lots of “documentation”, but little of it in discrete “documents”.

  4. Let us not allow our applications to define our language.

    Regular folk are going to need a word that refers to a particular collection of words that are gathered to a particular purpose, and word is “Document”. Don’t take that away just because applications are changing.

    And, what is changing? The file and data formats. So, say “file and data formats”.

    The whole mistake was assuming that the terms “document” and “document file” was synonymous. There not.

    So, our data formats for documents are changing and will surely end up as some distributed collection of dynamic content. What you’re looking for is a new word for a collection that represents such a distributed dynamic document.

    So, these “DynaDocs” or “DistDocs” or “CollectoDocs” are just a new data format. Let’s not have another Buzz-Word Bandwagon that looks down upon those of us who continue to use words according to their traditional meaning.

  5. Well, I wrote that too fast. Please forgive the various grammatical errors and such.

  6. a document only has one role- attestation. a document is a dead thing that captures an agreement at a point in time. a document is a mere snapshot.

    the idea of “active documents” leaves me cold.

  7. […] be able to work on the content in place on the blog; we’re still struggling to understand what a document is in an age of web pages, remember. But it can’t be debated that the embedded mechanisms do […]

  8. […] becoming obsolete in a variety of settings. Here’s how I’ve described the transition in the past: Documents today can have, as IBM’s Doug Heintzman noted last Wednesday at IBM’s annual analyst […]

Leave a Reply

Your email address will not be published. Required fields are marked *