NoSQL Live was a very different show than I’ve been to in recent months. It had very little in common with, for example, HadoopWorld, where the audience was largely already intimately familiar with the technology and value proposition. The NoSQL Live audience, by contrast, to judge from the questions, was mostly there to learn. With many of the usual suspects from the NoSQL world in attendance, along with substantial representation from projects like Cassandra, HBase, memcache, Riak, Voldemort and so on, the show certainly did not lack for subject matter expertise.
But the number of those generally unfamiliar with NoSQL was as surprising as it was gratifying. Gratifying because it serves as a proxy for interest: besides the experts, there were a substantial number of people there looking to get up to speed on the space. Which they certainly had an opportunity to do.
Adam Marcus, a graduate student at MIT’s Computer Science and Artificial Intelligence Laboratory, did a much better job that I could have taking notes on the show here, so I won’t rehash that. Instead, three quick takeaways from the show.
The NoSQL Term is a Problem
In his remarks to open the show, Dwight Merriman – the CEO at 10gen (the company behind MongoDB) – asserted that while the term “NoSQL” had problems, for better or for worse, the name had stuck. Which might indeed be true, and if so the projects may as well make the best of the situation, as he suggested. But if that’s true, the so-called NoSQL projects – all of them – are going to have problems.
Witness Merriman’s definition of NoSQL: no joins and light transactional semantics. Even were we to accept that definition – and even that is problematic as the support varies from project to project – we still have issues. Clearly column databases are differentiated from graph databases, just as both are differentiated from key value stores and document databases.
Currently, however, they are all referred to – marketed, even – under the blanket NoSQL. Hence some of the confusion heard from users yesterday, who were struggling to grasp why all these NoSQL tools had seemingly nothing in common with one another.
The good news is that there is – as evidenced by this and other NoSQL events – substantial interest and traction in data storage software that is not a relational database. The problem is that the naming is likely to become a serious problem if it isn’t already.
Consider slide 13 of Tim Anglade’s excellent presentation embedded above. If he’s correct, and we’re just this side of the Gartner’s trough of disillusionment – and I believe that’s a reasonable assertion – the NoSQL term is going to be one of the reasons for the fall. Most of the current NoSQL adopters are sufficiently up to date on developments in the data persistence space that the name is not much of an issue. The next wave of adopters is guaranteed to be less familiar with the distinctions between the project approaches and more frustrated by the inherent educational challenges therein.
I know quibbling over a name seems inane to a great many technologists out there, but you’d be surprised at how much difference a name makes in this industry. Remember what the mere application of the term Ajax did to discussion of that technical approach? Now consider if Ajax had attempted to encompass that and native client side development. That’s what NoSQL is doing at present, and it’s a problem.
MySQL is a Target
Mark Callaghan recently said:
I think that MySQL+memcached is still the default choice and I don’t think it is going away in the high-scale market.
Eric Day, Drizzle developer, likewise said that that project is complementary to many NoSQL efforts when I spoke to him on Monday. Clearly his new employer (and yes, more on that later) believes that, significantly contributing as it does to both NoSQL (Cassandra) and SQL (Drizzle) projects.
I think they’re right. The maturity of the MySQL ecosystem and its basic ubiquity will not easily be thrown over, if ever. That said, the commentary from Twitter’s Ryan King was really eye-opening.
It’s no secret that Twitter has been moving slowly towards Cassandra and away from MySQL. This is from an interview that King did previously with myNoSQL, describing the motivations:
We have a lot of data, the growth factor in that data is huge and the rate of growth is accelerating.
We have a system in place based on shared mysql + memcache but its quickly becoming prohibitively costly (in terms of manpower) to operate. We need a system that can grow in a more automated fashion and be highly available.
After yesterday, we can add to that some numbers. Twitter’s Cassandra infrastructure is at 45 nodes, which is handling – in parallel with the MySQL/memcached infrastructure – some 600/700 Tweets (i.e. writes) /second (50M/day) with massive spikes (like for SXSW, for example) and nine or ten billion rows.
The MySQL infrastructure – largely thanks to a massive memcached presence, according to what we heard yesterday – was still handling this load. But much of the real pain comes apparently in manageability. The MySQL cluster could, in the words of King, “never be taken down,” as the restarts were too painful. The Cassandra nodes, meanwhile, are rebooted regularly with rolling restarts.
What does this mean? Nothing, yet. Twitter is a traffic outlier than 99% of MySQL or NoSQL users will never see. But the number of higher traffic properties that are leaving MySQL based infrastructure for NoSQL alternatives is worth monitoring, just as was their original takeup of MySQL.
NoSQL and the Cloud
One of the panels yesterday was on the subject of NoSQL in the Cloud. The panelists were Benjamin Day (consultant with Azure experience), Jonathan Ellis (Rackspace, Cassandra lead), Adam Kocoloski (Cloudant, a CouchDB vendor), and Daniel Rinehart (Allurent, a startup using AWS’ SimpleDB).
Predictably, opinions varied on the suitability of NoSQL technologies for the cloud. The vendors offering or leveraging NoSQL services in the cloud – Allure/Cloudant/Rackspace – were more or less positive on the concept. Ellis, meanwhile, was less enthusiastic, urging workload based deployment: elastic, transient needs to the cloud, general, sustained workloads to the datacenter.
What I was surprised to hear little about, except from a questioner, is the question of operations. For many cloud users, questions of workload or the suitability for a given technology such as NoSQL to the cloud come second. The primary concern is operational costs, or the lackthereof. Put simply: the cloud takes important operational elements and makes them someone else’s problem. This may be even more compelling with NoSQL, because the relative immaturity of the projects means that they are often suboptimally packaged. Being able to spin up a prebuilt image on AWS or Rackspace is likely to be significantly preferable to an alternative of hand assembling all of the necessary pieces of your NoSQL infrastructure.
This is why I find services like Bradford Stephens’ Drawn to Scale interesting (again, more on them later): being able to offload the operational costs – both in dollars and learning curve – of software that can more efficiently attack large or unstructured datasets is likely to be an interesting proposition.
Whether NoSQL technologies are ideally suited to multi-tenant cloud environments, then, seems to be besides the point: they will be used there – heavily – regardless. If they’re not well suited to that, from a customer’s perspective, that’s the provider’s problem.
Anyway, thanks to the folks from 10gen, Cloudant, Hashrocket, O’Reilly, GigaOm, myNoSQL et al who put on the conference. Well worth the trip down.