What follows is something like a live blog, based on comments from Matthew Wall and Simon Willison from The Guardian the NoSQL EU conference in London today.
Wall kicked off the talk with a question about NoSQL: is it a good name for the phenomenon? He says not really, pointing out absurdity of calling SQLite and MySQL “old world databases” as opposed to “new world” key value stores.
[This point resonates strongly with RedMonk thinking. Stephen and I have both been wary of reductionist approaches to defining NoSQL - we feel Hadoop style Big Data for example should be thought of as a related trend]
Where is The Guardian today? Its a modern, information-driven web site driven by tags and feeds.
“Its a traditional three tier web app, with a large Oracle database at center of the world. People might have thought we’re cooler than that, but we’re not.”
“The Guardian took the decision to stick with traditional relational model 5 years ago. The kind of tools we’re beginning to use weren’t as mature back then. A key reason for sticking with Oracle was the maturity of the surrounding tools ecosystem- performance management and optimisation, back up – and available skills.
SQL has worked well for the paper. SQL is great. we can do cool stuff with it. At scale.”
Searching one tag is ok, but what about two? What does it do to the database?
“Related content” was 40% of the Guardian’s app load so… the team used a search engine instead.. The search engine approach – using Apache Solr – worked well, but scale issues were still likely to become a problem.
“Willison suggested the Guardian stuck a massive memcached in front instead”.
It worked. But what about throwing more resource at Oracle instead?
“We wanted to avoid Oracle RAC because its really expensive, but we want to scale out”.
[Oracle RAC is the database giant's clustering technology.]
The Guardian’s Business Drivers: Linked data, social networks- there is all sorts of information out there. we need to engage with them. We can’t just broadcast the news…
The Guardian’s editor called for the organisation to Mutualise the News.
“We’re changing the platform because of the business change. new technologies: we have a real need to use them… blurring the line between journalists and readers.”
“Journalism is becoming the curation of all the world’s information”.
[note: google's automated curation seems to be winning at this point... which explains why the Guardian is responding in the way it is.]
What happens with API access, which drives for example, tag proliferation, which dramatically increases load on the database.
“Apache Solr is like a database, it works like one for us”
Fields can be multi-value. one piece of content with five tags can be stored in one field. Most important is that SOLR offers the ability to facet the content. apply it *like* a tag…
For example: – an editor’s star rating. we can facet on that for free, and just jump to all the three star albums. facets can be combined much more quickly than a relational database.
With Solr we can perform complex queries, filter by facets.
“On our data set, most queries are about the same cost. no transactions.”
With Solr Schema design is very important – the schema are more flexible and fuzzy than relational.
This is about getting data out of the system: powering the Guardian’s iPad app, site components, editors tools off the API, with far more to follow. But what about getting data in?
The Guardian has also built a simple REST/HTTP framework. for example – for sucking in live football scores, eg. apps that don’t affect the data store.
At this point the talk speeded up dramatically. Willison talks a lot faster than Wall. Never mind the high level stuff stuff- if you’re a real dork i recommend you go straight to the source – and check out Simon’s slides from the Redis workshop at NoSQL EU.
NoSQL for journalism
“I am working at the Guardian because I am interested in the opportunity to build rapid prototypes that go live: apps that live for two or three days. My interest is how NoSQL can help support journalism.”
Rapid prototyping. things that scale down as well as up handle massive spikes (if you’re on the front page) quickest way to do lookups- was to use Redis.
version 1 of the Guardian’s Investigate Your MPs Expenses app was not Redis enabled.
The initial application generated 468k rows, randomised, every time someone hit the button!
Guardian Zeitgeist, meanwhile, doesn’t use Redis. The app attempts to highlight stories on the guardian that are interesting- the amount of conversation about that thing on social networks. looks for peaks, ie, a page on the Guardian’s Environment section that gets more traffic than normal.
So use message queues and cron jobs. pull data, task queue, then calculate hotness. feed into Big Table, running on Google AppEngine, which not great at complex queries, but good at simple select and sort.
“Using Big Table as a dumping ground for data you can sort by 1 or 2 columns when you need to”
Talking of dumping grounds… Guardian employees were effectively creating data sets that if they didn’t make it into the paper as Infographics, weren’t used. Raw numbers were being collected and cleaned up. Today the underlying data will be in a Google Docs spreadsheet, and made accessible on the Guardian website accordingly.
Guardian Datablog – a bunch of Google doc spreadsheets. Retrieve data as CSV, XLSW, JSON. click “make a copy” Make a Copy, and run your own.
“We want to keep publishing arbitrary data sets, for example “output school league tables” or “volcano information”. we want something schema free.”
Our first option is CouchDb. Create schema free database, then index in Solr.
We have changed from the relational database being at the center of the world to a mix of datastores and models.
disclosure: Oracle is not a client. VMWare, which is, recently acquired Redis.


James Governor’s Monkchips » The Guardian: NoSQL EU. Don’t Melt The Database http://bit.ly/93dlPC #nosqleu cc : @matwal @simonw
This comment was originally posted on Twitter
reading @monkchips’ first blog post about #nosqleu. a writeup of @simonw and @matwal’s excellent guardian talk. http://bit.ly/93dlPC
This comment was originally posted on Twitter
Good post by @monkchips on @simonw and @matwall’s excellent talk, about NoSql at the Guardian #nosqleu: http://bit.ly/93dlPC
This comment was originally posted on Twitter
Quite odd to start reading a blog post then realise I’m sitting opposite the guy who just wrote it. http://bit.ly/93dlPC #nosqleu @monkchips
This comment was originally posted on Twitter
Live blogging – congrats – RT @monkchips: » The Guardian: NoSQL EU. Don’t Melt The Database http://bit.ly/93dlPC #nosqleu
This comment was originally posted on Twitter
Coverage of Mat Wall’s talk at #nosqleu here http://goo.gl/GQLr (via @monkchips)
This comment was originally posted on Twitter
@monkchips Thanks for the great write up of our #nosql talk here: http://bit.ly/aeGKYI You make us sound very professional!
This comment was originally posted on Twitter
The Guardian: NoSQL EU. Don’t Melt The Database:
What follows is something like a live blog, based on comments f… http://bit.ly/c3rZgI
This comment was originally posted on Twitter
The Guardian and NoSQL databases (@monkchips) http://bit.ly/9nNgov
This comment was originally posted on Twitter
“solr is like a database, it works like one for us” http://bit.ly/aeGKYI #nosqleu #nosql #rdbms #solr #lucene
This comment was originally posted on Twitter
@monkchips live blogs the #nosqleu talk by the Guardian. wish I’d been there; James does a great job laying out issues. http://bit.ly/aeGKYI
This comment was originally posted on Twitter
The Guardian: NoSQL EU. Don’t Melt the Database: Real-world case studies are great ways to distinguish fact from f… http://bit.ly/9XmeUf
This comment was originally posted on Twitter
@arnaldostream The Guardian: NoSQL EU. Don’t Melt The Database – http://bit.ly/9uU38V
This comment was originally posted on Twitter
Great article! RT @alexview: @arnaldostream The Guardian: NoSQL EU. Don’t Melt The Database – http://bit.ly/9uU38V #nosql
This comment was originally posted on Twitter
UK’s The Guardian, Lucid Imagination Customer: “Solr is like a database, it works like one for us” http://bit.ly/c0wYBp
This comment was originally posted on Twitter
The Guardian: NoSQL EU. Don’t Melt The Database http://ow.ly/17813s
This comment was originally posted on Twitter
The Guardian: #NoSQLEU. Don’t Melt The Database http://monk.ly/d3nBCe updated with link to @simonw’s much praised Redis Workshop.
This comment was originally posted on Twitter
the technology underpinning rusbriger’s “mutualism” – The Guardian: NoSQL EU. Don’t Melt The Database http://monk.ly/d3nBCe for @jeffnolan
This comment was originally posted on Twitter
so come on #yam people: guardian use same tech as we do, not so much rdf: http://bit.ly/bEh9wL – how’s rdf better?
This comment was originally posted on Twitter
@gvenk Aardig stuk over journalistiek, Guardian en NoSQL: http://bit.ly/bEh9wL /cc @wilbertbaan @dutchproblogger @paulvereijken
This comment was originally posted on Twitter
Een goed artikel over jounalistiek & NoSQL: http://bit.ly/ctxndK
This comment was originally posted on Twitter
Een goed artikel over jounalistiek & NoSQL: http://bit.ly/ctxndK (via @alper)
This comment was originally posted on Twitter
James Governor’s Monkchips » The Guardian: NoSQL EU. Don’t Melt … http://ow.ly/17e3yP
This comment was originally posted on Twitter
“At this point the talk speeded up dramatically. Willison talks a lot faster than Wall.” http://bit.ly/b3TwYs — Hee hee!
This comment was originally posted on Twitter