tecosystems

Heterogeneous Data Layers? Check. What’s Next?

Share via Twitter Share via Facebook Share via Linkedin Share via Reddit

For about a year now, I’ve been making the rounds with vendors arguing a very simple point: that the traditional approach to persistence – cram everything into a relational database – was not appropriate for a substantial number of development challenges. I’m certainly not claiming credit for this insight; in many respects, I was merely the messenger for the developers that I was working with, who were desperately canvasing the field of specialized data storage and retrieval products in search of something – anything – that would meet their needs.

In the end, the majority of the relational critics did demphasize the relational store in favor of alternative or hybrid approaches, some even wrote their own data engines from scratch. It’s not that they – or I – would contend that the relational database doesn’t have a place of importance; it does, and will continue to. The argument, rather, was that the relational store shouldn’t be the only means of storing and manipulating data. The data layer, in other words, was bound to become more heterogeneous than it is today. Seems obvious, I know. But trust me, it wasn’t to lots of folks – and there are many who still don’t buy it.

But the problem was acute enough, and felt by enough high profile developers, that I expected a response sooner or later. Probably later, if the relational database vendors had anything to do with it. And while there’ve been glimmers here and there that indicate that others perceive the nature of the problem and the opportunity it represents – Oracle picking up Sleepycat, for example – there’s been no sea change from an architectural perspective.

Is that poised to change, however? The folks from O’Reilly – whom I’ve discussed this problem with previously, in an interchange with Nat – seem to think so. Tim O’Reilly just ran a terrific series of database focused entries, entitled database war stories, you can go get them here (1 (Second Life), 2 (Bloglines / memeorandum), 3 (Flickr), 4 (NASA), 5 (craigslist)). While it’s difficult to draw any general conclusions from the feedback, as the approaches taken differ significantly, it’s clear that heterogeneity is the rule, not the exception.

While all of the above is interesting, none of it really surprised me, because those stories are similar to ones I’ve heard before. Based on stories like the above, I think we’ll be seeing more and more hybrid data layers employed to address less traditional workloads, but the question that’s now occupying me is what’s next? Most of the approaches described above – leveraging distributed file systems, flat files, and so on – are not revolutionary, but harken back to mainframe era approaches. Is there room for anything new? I think so, and GData might be the first indication of what that “new” thing looks like.

Late in ’04, Adam Bosworth penned a piece called “Where Have All the Good Databases Gone?” He began it by saying:

About five years ago I started to notice an odd thing. The products that the database vendors were building had less and less to do with what the customers wanted. This is not just an artifact of talking to enterprise customers while at BEA. Google itself (and I’d bet a lot Yahoo too) have similar needs to the ones Federal Express or Morgan Stanley or Ford or others described, quite eloquently to me. So, what is this growing disconnect?

In that entry, Bosworth goes on to describe the three major problems not being adequately solved by traditional commercial databases: 1) Dynamic schema, 2) Dynamic partitioning, and 3) Modern indexing.

Why is this relevant? Because like Dare Obasanjo and Jeremy Zawodny, it seems to me that GData is almost certainly the product of Bosworth’s efforts. While the above entry details some of the problems, it doesn’t go into much depth as to what the solution to those problems might look like. Fortunately, he went into significant detail during a talk he gave at the MySQL Users Conference last year, but to my way of thinking it is this presentation (Powerpoint warning) that actually spells out his vision for a wire protocol, based on standards, that is:

  • Massively Scalable
  • Fully federated
  • Completely loosely coupled
  • Easy to implement
  • Extend existing web protocols/formats

In case you’re following along at home, GData would appear to be a good start towards those goals. What comes next? It’s too early to say, but when I begin to think about repositories and engines with GData APIs, as Zawodny proposes for MySQL and the Lucene folks might build, things get very interesting.

Why? Because it could turn the web into a massive, writable, repository. It could enable a whole new generation of repository technologies.

Either way, I’m a believer that the next couple of years of should be interesting to follow; the current experimentation in data layer approaches could be nothing compared to what we see when GData and similar protocols become more common.

3 comments

  1. When using a new term like “heterogeneous data layers” it is useful to define the term. All I can infer is that the term means “non-relational”.

    Abandonment of the relational model usually means either abandonment of expectations (“we don’t have enough money to do this right”, “we don’t need perfection – 80% correct will do.”) or lack of understanding of the relational model (“Normalization? We don’t need no stinkin’ normalization!”). In some applications (e.g., CAD) an OODB may be more appropriate. Even in those cases there is inevitably a follow-on requirement for an interface to a relational database, and so much of the work defining a relational database must still be done.

    I want to ask: for these non-relational “data layers” where will the semantics (meaning) of the data be defined? How will relationships between data elements be defined/enforced? Will you use pointers or unique keywords to point to the children of a parent?

    Until someone produces an artificial intelligence product that can think, has multiple domain knowledge and can negotiate meaning with other similar entities, then there will be no database “silver bullet”.

    Meanwhile we know what’s available today: relational, hierarchical, network, and object-oriented databases. I don’t see anything revolutionary around the corner. But vendors will enhance their products to address the changing markets.

  2. I read it as heterogenous as in employing more than one storage system in the data layer. A simple, common, case being a file system for bulk storage and a database for metadata.

  3. G. Roper: thanks for the comment. apparently i wasn’t clear, but fortunately Andy’s explained my point better than i could. when i talk about heterogeneous data layers, i’m not trying to define some new term – i’m merely talking about the trend of application developers to employ multiple persistence approaches side by side, as it were. relational, as you point out, is one of them more often than not for the reasons that you mention.

    but it’s equally true that for some problems – partitioning, e.g., relational models are often less than ideal.

    are they going away? of course not. are they still hugely relevant? you bet.

    but i think we’re beginning to see a relaxation in the “relational = the one and only acceptable approach.” when you talk about databases, incidentally, i would also be sure to include filesystems as another mechanism.

    Andy: thanks very much for clarifying; spot on.

Leave a Reply

Your email address will not be published. Required fields are marked *