For a while now, I’ve been keeping tabs on the progress of the MySQL fork, because it could be argued that it’s the most interesting – and important – project going. That’s a bit of hyperbole, of course, because relative to, say, Linux, Drizzle is a speck, visibility-wise.
But look closer, and the significance of Drizzle becomes more apparent. In challenging long held assumptions, whether that’s architecturally or commercially, Drizzle is likely to point the way forward for a number of software projects, both open source and not. Here’s why.
In discussing the project as it was launched, I invoked Adam Bosworth’s seminal database piece, “Where have all the good databases gone.” The macro relevance was simple: databases had, Adam argued, taken one path while customers took another, and Drizzle can be viewed – to some extent – as a response to that. Not necessarily in the way that Adam meant it – although per his comment at the time, he’s fully behind Drizzle – but certainly as a break with the direction that MySQL was heading. A direction that largely paralleled its larger erstwhile competitors, DB2 and Oracle.
Drizzle, on the other hand, breaks quite fundamentally with the traditional conception of a database, on both the hardware – as we’ll see – and software fronts. Regarding the latter, it aggressively reverses the 5.0 era additions of stored procedures, triggers and so on to MySQL. According to Krow, “Stored procedures are the dodos for database technology.”
But those who would dismiss Drizzle as merely a stripped down MySQL miss the point entirely; the project is, if anything, a fundamental rethinking of what a database should be and the deployment context for it. Drizzle is emphatically more than a refactoring. It is, rather, a database being built expressly for scale out clouds running Map/Reduce like architectures at immense scale. How to get there is still in question, but the models discussed are these:
One direction is obvious, map/reduce, the other direction is the asynchronous queues we see in most web shops. There is little talk about this right now in the blogosphere, but there is a movement toward queueing systems. Queueing systems are a very popular topic in the hallway tracks of conferences.
Drizzle might not be precisely the database envisioned by Adam four years ago, then, but it would seem to be pretty close.
Aside from its popularity, MySQL is perhaps best known for its dual license development model. Here’s how I’ve explained this model in the past:
In the first model, a single entity such as MySQL is responsible for the overwhelming majority of all development on a given codebase. Anything they don’t produce themselves, they license. Very often this is practiced in conjunction with the dual-license model; because MySQL is responsible for virtually all of the development of the core code, they own or have licensed appropriately all of the involved IP. As such, they’re free to issue commercial licenses to those who cannot or choose not to comply with the terms of the open source license – the GPL, in this case.
At the risk of oversimplifying a complex model, dual licensing trades an open development model for the right to exclusively license the asset for commercial gain. MySQL’s perhaps the best known proponent of this model, but it is hardly rare.
What’s interesting about Drizzle – which is developed by MySQL employees that are in turn employed by Sun – is that it actively rejects this model in favor of an open development paradigm. How open? This open: “Today 2/3 of our development comes from outside of the developers Sun pays to work on Drizzle. Even if we [Sun] add more developers, I expect out total percentage to decrease and not increase.”
It might still be possible to maintain a dual licensing model, if copyright assignment is instituted as a precondition for the acceptance of a patch. Drizzle, however, requires no such assignment. Which at once throttles up the potential volume of contributions and chokes the ability of MySQL/Sun to commercialize the asset under exclusive terms.
And yet Sun is actively funding the development, meaning that (presumably) they see commercial opportunities therein. Linux and countless other projects demonstrate quite adequately that exclusivity is not the sole or even primary path towards monetization, but this does represent a departure – and a significant one – from MySQL’s model.
As such, it bears watching.
For Better or Worse, Forking
Lost in discussion of Drizzle’s technical assumptions and architecture is the potential malleability of the project. While sitting on a panel with him at OSCON this past July, Brian previewed some of his thinking on forking committed to the page here:
I see forks as a positive development, they show potential ways we can evolve. Not all evolutionary paths are successful, but it makes us stronger to see where they go. I expect long term for groups to make distributions around Drizzle, I don’t know that we will ever do that.
Again, this is a significant departure from the conventional wisdom regarding open development, which considers forks inevitably toxic.
Granted, the current portfolio of development tools – such as the Bazaar/Launchpad combination used to construct Drizzle – permit levels of experimentation and fragmentation that would have derailed more centrally managed codebases in the past. Allow further that this level of experimentation can be enormously beneficial, much as evolution uses proliferation and specialization as a means of natural selection.
Still, it will be interesting to see whether or not the Drizzle community can sustain a relatively diffuse level of development, if projects multiply and diverge at a rate faster than the core development community can adapt to.
The core hardware assumptions for Drizzle are both aggressive and not. In no particular order, a few of the baseline assumptions:
- N > 1
I don’t have much to say about the 64 bit assumption except to say that I agree, and likewise designing towards multi-core seems like a no-brainer given that the laptop I’m writing this from has two, though the point about thinking bigger is well taken:
Right now adoption is at the 16 core point, which means that if you are developing software today, you need to be thinking about multiples of 16. I keep asking myself “how will this work with 256 cores”.
Regarding the SSDs, this is the Drizzle view:
SSD is here, but it is not here in the sizes needed. What I expect us to do is make use of SSD as a secondary cache, and not look at it as the primary at rest storage. I see a lot of databases sitting in the 20gig to 100gig range. The Library of Congress is 26 terabytes. I expect more scale up so systems will be growing faster in size. SSD is the new hard drive, and fixed disks are tape.
Again, not much to question here, but I wonder if there aren’t opportunities to leverage Flash drives more directly in conjunction with other media as the Fishworks guys are doing with their hybrid storage pools (see Mike Shapiro and I discussing that here – just ignore the buses going by). Storage infrastructure is intrinsically different than database infrastructure, it’s true, but the opportunity to pool different storage media using a ZFS like-filesystem to maximize collective performance might still be relevant even with the higher I/O needs.
Last on the hardware side is this: “I do not assume Drizzle will live on a single machine.” I’m assuming Joe Gregorio agrees; I know I do.
Storage Engine Implications
When it’s not rethinking the featureset, the design assumptions, or the development model, Drizzle is also triggering a reappraisal of the various storage engine options. The verdict? Of a short list of Innodb, Maria, Falcon, and PBXT, source trees will be built around only PBXT and Innodb. For now, anyway.
Disclosure: Sun is a RedMonk customer, as was MySQL prior to their acquisition.