XML and Office Formats: There’s More to it Than Accessibility

Share via Twitter Share via Facebook Share via Linkedin Share via Reddit

While speaking with the folks from Scalix this afternoon, I happened to reiterate a point I’ve been discussing more and more of late: the programmability of the Open Document Format (ODF). The context for mentioning it was the fact that Scalix has the ability to dynamically transform certain types of documents – RTF in the example I saw – on the fly to other formats depending on how and where they are viewed.

It’s precisely that type of transform that’s being lost in the shuffle, in my opinion, in the fascination with the politics [1] behind the State of Massachusetts’ decision to mandate ODF as a standard. What’s brand new with the advent of the ODF and its Microsoft counterpart the Office Open XML Formats are the implications vis a vis the ability to manipulate documents programmatically. With the advent of these formats, we could see the advent of far more intelligent messaging, workflow, etc systems because they’ll have the ability to deconstruct these documents at very granular levels.

To many of you, no doubt, that’s just a bunch of meaningless technical babble, so here’s an example that may make it more real: you know how organizations have been bitten in the past by sending out documents that, unbeknownst to them, included all sorts of unflattering or embarrassing comments and markup? With the previous generation of binary office formats, there wasn’t too much you could do because the documents were opaque to any application besides the ones that produced it. The new XML based formats, on the other hand, are essentially zip containers with a bunch of XML components in them. What if before you emailed someone external to the system, the messaging server could deconstruct the document into various pieces, remove any comments or markup, then reconstruct the document without touching the content? That’s what the kind of document manipulation made possible by the transition from binary to XML formats.

That’s just one example, of course; there are likely hundreds if not thousands of other such scenarios buried within corporate workflow and routing procedures. This is, in part, what ODF advocates such as Gary Edwards and Sam Hiser mean when they talk about the format as a “Universal Transformation Layer.” What’s been interesting, however, is that in the probably half a dozen conversations I’ve had with messaging/worklow/etc providers over the past couple of weeks, I’ve heard a lot of enthusiasm for the ODF itself, but little in the way of plans – NDA or otherwise – to fully leverage the XML nature of the format.

Sooner rather than later, however, I expect one or more of the messaging/workflow/etc providers to actively embrace the ODF – and it would be foolish to believe that Microsoft doesn’t have plans in this regard for its MSXML format. The interesting question for many vendors will be – how might the new XML formats overlap – or not – with electronic forms?

Either way, we should expect to see a lot more pieces like this. While the promised longevity of XML based documents may be exciting to governments and libraries, developers are likely to be far more interested in what the formats allow them to do now that they couldn’t do before than whether or not the document’s will be readable in 50 years. Fortunately for them, the implications of the new formats are profound.

[1] The term is used here in both the figurative and literal senses. One commenter I spoke with the other day suggested that the ODF might play a role in the much anticipated bid for the Presidency by current MA Governor Mitt Romney.


  1. Even the COM interfaces only allow you to get at what the COM API developer wants you to — which might not be everything.

    And besides, COM is Hard, and XML is Easy, right? πŸ˜‰

  2. Anthony: i should have added that qualifier i suppose, though my contention would be with Fraxas, i.e. that the transformations now made possible – on non-COM platforms, to boot – are far more extensive.

    on the XML front, couldn't agree more. i'll be very interested to see what kind of XML is generated in Office 12. my experience with past exports to XML or even HTML have been poor to quite poor.

    Fraxas: great point, Fraxas. not so sure about the XML is easy bit πŸ˜‰

  3. You’ve been able to do this for a while using COM.

    The XML Word generates is particularly nasty for parsing. We’ve used InfoPath from MS Office which is cleaner but more targeted at form generation.

  4. Poor to quite poor? MSXML breaks the most basic promise of XML. There is no transformation capability short of reverse engineering the software dependent binary key that can be found in the header of every MSXML file. MSXML might be wonderful for Microsoft software stacks, but it's useless to the rest of the world.

    The binary key makes MSXML interoperability adverse, and near useless in an SOA Β– ESB environment.

    One of the things that makes ODF such an important innovation is that it is entirely software independent. This also means that the "intelligence" or information about the file is also contained and carried with the file. With software dependencies, critical information about the file remains within application and platform bound software constructs.

    Yes, you can insert a binary blob anywhere you please in ODF. XML is after all meant to be extensible. But the moment you do that, you threaten the transformational qualities of the file format. To preserve the fidelity of your transformation qualities, information about these extensions should be placed in the metadata.xml container of the file. This enables applications, developers and information managers to figure out what to do with the extended aspects of your file.

    The expansion of the ODF metadata structure is now a key issue before the ODF TC. The realization being that everyone's life would be much easier, and the quality of transformation fidelity kept as clean as possible, if commonly used extensions and methods were standardized in ODF. By expanding the metadata capabilities in a structured and standardized way, proprietary file formats will have access to a new, easy to adapt, measure of interoperability with other file formats- and with ODF. This interoperability is gained by simply grafting on the ODF metadata layer.

    By slapping a hard coded, software dependency into every MSXML header, Microsoft breaks the most fundamental promise of XML, and does so with 100% of their implementations. With ODF, compromises to transformation fidelity are the exception. With MSXML, these zero tolerance compromises are the rule. What are the odds though that Microsoft's would at the least consider a common metadata model they could share with ODF, and important information stacks like those provided by Adobe, Lotus Notes, and WordPerfect?

    ODF was designed to serve three purposes. One is that of a structured XML file format for Office Productivity environments.

    The second is perhaps far more important. ODF was also designed to be used as a universal transformation layer. It's very useful as a means of shuttling information between legacy systems, desktops, and emerging Web 2.0 systems. In this respect, ODF is targeted to become the life blood of every SOA effort.

    The third is that ODF has a better than excellent shot at becoming the Open Internet successor to the HTML Β– XHTML legacy. Simply put, ODF is a wrapper for the Open XML technologies pouring out of the W3C. Unlike XHTML, which expands from the rather confined and limited browser space, ODF comes into the Open Internet net space carrying the load of desktop productivity environments and legacy information systems.

    What we are watching unfold at breath taking speed, is the emergence and recognition that ODF has what it takes to become the Open Internet's universal information layer. One able to handle to the complex demands of connecting and moving legacy information systems and productivity constructs into the compound content Β– presentation Β– metadata layers of next generation collaborative computing.

    Why ODF? Well, for one thing it's got exactly the right blood lines. The bridge to the past has been built. ODF as a universal transformation layer does work, and works as promised. ODF is self contained and entirely application independent, with an expanding intelligence and awareness metadata model geared to run the life of the Open Internet and beyond. ODF as a desktop productivity layer has five years of real world road testing and improvement, and is now making it's way into many different applications and platforms. ODF development and processing libraries are now starting to show up. (The upcoming KDE Β– KOffice leap to Windows will rock the world in terms of developer tools and processing libraries). ODF is managed by a multi vendor open standards group, and is now seeking recognition from multiple standards organizations. The ODF copyright is guaranteed to be open forever and without patent encumbrances, permissions or other restrictions.

    And the bridge from XML to RDF and the Semantic Web are also well underway.

    If anyone else out there has a candidate that can match these blood lines, speak now or forever hold your peace. If and when the premier AJAX engine providers such as Google, JotSpot, Zimbra, and Amazon embrace ODF at the engine level, this race is going to be over before it even gets started. Which is to say; don't look at the desktop. Look at the Open Internet. ODF has arrived. Knock knock Google.


  5. Stephen: It is because of what you talk right now that I said earlier that Oracle's support for ODF was important (Oracle being a big representative of Document Management solutions).
    Gary: I had that fear with Microsoft's XML format. Do you have any aditional info about Microsoft's format and that "Software Dependant Binary Key"?

Leave a Reply

Your email address will not be published. Required fields are marked *