“As MapReduce has grown in popularity, a stack for big data systems has emerged, comprising layers of Storage, MapReduce and Query (SMAQ). SMAQ systems are typically open source, distributed, and run on commodity hardware.
In the same way the commodity LAMP stack of Linux, Apache, MySQL and PHP changed the landscape of web applications, SMAQ systems are bringing commodity big data processing to a broad audience. SMAQ systems underpin a new era of innovative data-driven products and services, in the same way that LAMP was a critical enabler for Web 2.0.”
– Edd Dumbill “The SMAQ stack for big data”
It is not clear to me that we will have, at any point in the future, a LAMP equivalent for big data. The above notwithstanding, I think Edd’s excellent piece tacitly acknowledges this, as the components of his acronym are abstractions rather than projects.
The interest in a moniker for Big Data is understandable. LAMP, like Ajax, was nothing more or less than an enormously useful shorthand for talking about a greater than-the-sum-of-its-parts aggregation of software that was ubiquitous within certain contexts, but cumbersome to discuss individually. If anything, the space we clumsily refer to today as Big Data [coverage] has more moving pieces than did web infrastructure.
For better or for worse, the web stack was comprehensible: operating system, relational database, web server, dynamic language. This simplicity was born out of its purpose, which was serving web applications. True, implementations at scale drew in additional software, such as caching (memcached). But while the web applications that grew out of the LAMP stack were myriad in form, their infrastructure had much in common. Both with other web applications, and the application design patterns that preceded it. Yes, the application infrastructure of dot com era startups was distinct from those of the Web 2.0 firms who inherited the scorched earth they left behind, but how different, really? A venture backed startup in 2000 might have picked Solaris instead of Linux, BEA instead of Apache, Oracle instead of MySQL, and Java instead of PHP. What of it? How we built web applications hadn’t changed nearly as much as we thought it had.
At least until Big Data arrived.
Big Data, problematic as that term may be, is a fundamentally different animal. Sufficiently different, in fact, that it is the cause of the first material changes to the database market since the invention of the relational database. It took longer than we expected [coverage], but the data renaissance is here, and it’s real. The majority of the non-relational database technologies will wither and die and/or be absorbed, of course: there is insufficient oxygen in the market to support a dozen plus entrants in each of the key-value store, document database, graph database, columnar database, and distributed filesystem categories. It is likely, however, that we will have one or more survivors in each because the data demands it.
Previous software stacks have been application oriented, and applications were built in more or less the same way. Today’s Big Data stacks are oriented around the data first, application second. As any Hadoop user can easily confirm for you: Hadoop’s strength at the present time is in efficiently attacking medium to large datasets, not developer accessibility or application design. What this means in practical terms is that stacks will be tailored to datatype and workload. Given the differences in datatypes – and workload context – this in turn means that there are going to be a number of different stacks.
If you look at web properties such as Facebook, LinkedIn and Twitter, this is evident [coverage]. Portions of their Big Data workload are serviced by Hadoop implementations, while others are attacked by tools such as Cassandra, Voldemort or even the venerable MySQL/memcached combination. Google, likewise, uses Pregel for certain tasks, MapReduce for others, and Percolator as a complement to the latter. As we swing away from general purpose software and hardware towards more specialized offerings [coverage], the relevance of standardized stacks will continue to decrease.
It is possible that we’ll see standardization of componentry around specific projects like Hadoop – although even that seems unlikely with the rampant proliferation of query, import and other ecosystem projects – but I do not expect to see a standard stack of software used to tackle generic Big Data problems, because there really aren’t many generic Big Data problems. Inconvenient as that might be from a vocabulary perspective.