What follows is something like a live blog, based on comments from Matthew Wall and Simon Willison from The Guardian the NoSQL EU conference in London today.
Wall kicked off the talk with a question about NoSQL: is it a good name for the phenomenon? He says not really, pointing out absurdity of calling SQLite and MySQL “old world databases” as opposed to “new world” key value stores.
[This point resonates strongly with RedMonk thinking. Stephen and I have both been wary of reductionist approaches to defining NoSQL – we feel Hadoop style Big Data for example should be thought of as a related trend]
Where is The Guardian today? Its a modern, information-driven web site driven by tags and feeds.
“Its a traditional three tier web app, with a large Oracle database at center of the world. People might have thought we’re cooler than that, but we’re not.”
“The Guardian took the decision to stick with traditional relational model 5 years ago. The kind of tools we’re beginning to use weren’t as mature back then. A key reason for sticking with Oracle was the maturity of the surrounding tools ecosystem- performance management and optimisation, back up – and available skills.
SQL has worked well for the paper. SQL is great. we can do cool stuff with it. At scale.”
Searching one tag is ok, but what about two? What does it do to the database?
“Related content” was 40% of the Guardian’s app load so… the team used a search engine instead.. The search engine approach – using Apache Solr – worked well, but scale issues were still likely to become a problem.
“Willison suggested the Guardian stuck a massive memcached in front instead”.
It worked. But what about throwing more resource at Oracle instead?
“We wanted to avoid Oracle RAC because its really expensive, but we want to scale out”.
[Oracle RAC is the database giant’s clustering technology.]
The Guardian’s Business Drivers: Linked data, social networks- there is all sorts of information out there. we need to engage with them. We can’t just broadcast the news…
The Guardian’s editor called for the organisation to Mutualise the News.
“We’re changing the platform because of the business change. new technologies: we have a real need to use them… blurring the line between journalists and readers.”
“Journalism is becoming the curation of all the world’s information”.
[note: google’s automated curation seems to be winning at this point… which explains why the Guardian is responding in the way it is.]
What happens with API access, which drives for example, tag proliferation, which dramatically increases load on the database.
“Apache Solr is like a database, it works like one for us”
Fields can be multi-value. one piece of content with five tags can be stored in one field. Most important is that SOLR offers the ability to facet the content. apply it *like* a tag…
For example: – an editor’s star rating. we can facet on that for free, and just jump to all the three star albums. facets can be combined much more quickly than a relational database.
With Solr we can perform complex queries, filter by facets.
“On our data set, most queries are about the same cost. no transactions.”
With Solr Schema design is very important – the schema are more flexible and fuzzy than relational.
This is about getting data out of the system: powering the Guardian’s iPad app, site components, editors tools off the API, with far more to follow. But what about getting data in?
The Guardian has also built a simple REST/HTTP framework. for example – for sucking in live football scores, eg. apps that don’t affect the data store.
At this point the talk speeded up dramatically. Willison talks a lot faster than Wall. Never mind the high level stuff stuff- if you’re a real dork i recommend you go straight to the source – and check out Simon’s slides from the Redis workshop at NoSQL EU.
NoSQL for journalism
“I am working at the Guardian because I am interested in the opportunity to build rapid prototypes that go live: apps that live for two or three days. My interest is how NoSQL can help support journalism.”
Rapid prototyping. things that scale down as well as up handle massive spikes (if you’re on the front page) quickest way to do lookups- was to use Redis.
version 1 of the Guardian’s Investigate Your MPs Expenses app was not Redis enabled.
The initial application generated 468k rows, randomised, every time someone hit the button!
Guardian Zeitgeist, meanwhile, doesn’t use Redis. The app attempts to highlight stories on the guardian that are interesting- the amount of conversation about that thing on social networks. looks for peaks, ie, a page on the Guardian’s Environment section that gets more traffic than normal.
So use message queues and cron jobs. pull data, task queue, then calculate hotness. feed into Big Table, running on Google AppEngine, which not great at complex queries, but good at simple select and sort.
“Using Big Table as a dumping ground for data you can sort by 1 or 2 columns when you need to”
Talking of dumping grounds… Guardian employees were effectively creating data sets that if they didn’t make it into the paper as Infographics, weren’t used. Raw numbers were being collected and cleaned up. Today the underlying data will be in a Google Docs spreadsheet, and made accessible on the Guardian website accordingly.
Guardian Datablog – a bunch of Google doc spreadsheets. Retrieve data as CSV, XLSW, JSON. click “make a copy” Make a Copy, and run your own.
“We want to keep publishing arbitrary data sets, for example “output school league tables” or “volcano information”. we want something schema free.”
Our first option is CouchDb. Create schema free database, then index in Solr.
We have changed from the relational database being at the center of the world to a mix of datastores and models.
disclosure: Oracle is not a client. VMWare, which is, recently acquired Redis.
monkchips says:
April 20, 2010 at 11:38 am
James Governor’s Monkchips » The Guardian: NoSQL EU. Don’t Melt The Database http://bit.ly/93dlPC #nosqleu cc : @matwal @simonw
This comment was originally posted on Twitter
jystewart says:
April 20, 2010 at 11:48 am
reading @monkchips’ first blog post about #nosqleu. a writeup of @simonw and @matwal’s excellent guardian talk. http://bit.ly/93dlPC
This comment was originally posted on Twitter
awhitehouse says:
April 20, 2010 at 11:50 am
Good post by @monkchips on @simonw and @matwall’s excellent talk, about NoSql at the Guardian #nosqleu: http://bit.ly/93dlPC
This comment was originally posted on Twitter
wwwicked says:
April 20, 2010 at 11:57 am
Quite odd to start reading a blog post then realise I’m sitting opposite the guy who just wrote it. http://bit.ly/93dlPC #nosqleu @monkchips
This comment was originally posted on Twitter
kingsleydavies says:
April 20, 2010 at 12:06 pm
Live blogging – congrats – RT @monkchips: » The Guardian: NoSQL EU. Don’t Melt The Database http://bit.ly/93dlPC #nosqleu
This comment was originally posted on Twitter
jsvaughan says:
April 20, 2010 at 12:20 pm
Coverage of Mat Wall’s talk at #nosqleu here http://goo.gl/GQLr (via @monkchips)
This comment was originally posted on Twitter
matwall says:
April 20, 2010 at 12:21 pm
@monkchips Thanks for the great write up of our #nosql talk here: http://bit.ly/aeGKYI You make us sound very professional!
This comment was originally posted on Twitter
utollwi says:
April 20, 2010 at 12:53 pm
The Guardian: NoSQL EU. Don’t Melt The Database:
What follows is something like a live blog, based on comments f… http://bit.ly/c3rZgI
This comment was originally posted on Twitter
abilogica says:
April 20, 2010 at 2:04 pm
The Guardian and NoSQL databases (@monkchips) http://bit.ly/9nNgov
This comment was originally posted on Twitter
rgaidot says:
April 20, 2010 at 2:30 pm
“solr is like a database, it works like one for us” http://bit.ly/aeGKYI #nosqleu #nosql #rdbms #solr #lucene
This comment was originally posted on Twitter
merv says:
April 20, 2010 at 2:44 pm
@monkchips live blogs the #nosqleu talk by the Guardian. wish I’d been there; James does a great job laying out issues. http://bit.ly/aeGKYI
This comment was originally posted on Twitter
RelevantNewsNow says:
April 20, 2010 at 4:43 pm
The Guardian: NoSQL EU. Don’t Melt the Database: Real-world case studies are great ways to distinguish fact from f… http://bit.ly/9XmeUf
This comment was originally posted on Twitter
alexview says:
April 20, 2010 at 7:42 pm
@arnaldostream The Guardian: NoSQL EU. Don’t Melt The Database – http://bit.ly/9uU38V
This comment was originally posted on Twitter
arnaldostream says:
April 20, 2010 at 8:45 pm
Great article! RT @alexview: @arnaldostream The Guardian: NoSQL EU. Don’t Melt The Database – http://bit.ly/9uU38V #nosql
This comment was originally posted on Twitter
LucidImagineer says:
April 21, 2010 at 12:33 am
UK’s The Guardian, Lucid Imagination Customer: “Solr is like a database, it works like one for us” http://bit.ly/c0wYBp
This comment was originally posted on Twitter
utollwi says:
April 21, 2010 at 12:21 pm
The Guardian: NoSQL EU. Don’t Melt The Database http://ow.ly/17813s
This comment was originally posted on Twitter
monkchips says:
April 27, 2010 at 3:26 pm
The Guardian: #NoSQLEU. Don’t Melt The Database http://monk.ly/d3nBCe updated with link to @simonw’s much praised Redis Workshop.
This comment was originally posted on Twitter
monkchips says:
April 28, 2010 at 1:37 pm
the technology underpinning rusbriger’s “mutualism” – The Guardian: NoSQL EU. Don’t Melt The Database http://monk.ly/d3nBCe for @jeffnolan
This comment was originally posted on Twitter
robertbrook says:
April 28, 2010 at 1:39 pm
so come on #yam people: guardian use same tech as we do, not so much rdf: http://bit.ly/bEh9wL – how’s rdf better?
This comment was originally posted on Twitter
alper says:
April 28, 2010 at 1:40 pm
@gvenk Aardig stuk over journalistiek, Guardian en NoSQL: http://bit.ly/bEh9wL /cc @wilbertbaan @dutchproblogger @paulvereijken
This comment was originally posted on Twitter
Coté's People Over Process » Search as a database – Grant Ingersoll on Solr & Lucene – make all #3 says:
April 28, 2010 at 9:35 pm
[…] Lucid Imagination’s Grant Ingersoll to talk about using search as a database. I’ve come across people who are using Solr as a search-based way of retrieving and store data in their applica…, and I wanted to explore that topic with someone in the know, here, Grant who’s steeped in […]
gvenkdaily says:
April 29, 2010 at 7:20 am
Een goed artikel over jounalistiek & NoSQL: http://bit.ly/ctxndK
This comment was originally posted on Twitter
gvenkdaily says:
April 29, 2010 at 7:27 am
Een goed artikel over jounalistiek & NoSQL: http://bit.ly/ctxndK (via @alper)
This comment was originally posted on Twitter
tweeter909 says:
April 30, 2010 at 4:56 pm
James Governor’s Monkchips » The Guardian: NoSQL EU. Don’t Melt … http://ow.ly/17e3yP
This comment was originally posted on Twitter
bruntonspall says:
May 5, 2010 at 2:42 pm
“At this point the talk speeded up dramatically. Willison talks a lot faster than Wall.” http://bit.ly/b3TwYs — Hee hee!
This comment was originally posted on Twitter
Pigsaw Blog » Blog Archive » Bookmarks for 5 May 2010 says:
May 5, 2010 at 9:09 pm
[…] James Governor's Monkchips » The Guardian: NoSQL EU. Don’t Melt The Database"What follows is something like a live blog, based on comments from Matthew Wall and Simon Willison from The Guardian the NoSQL EU conference in London today." (databases data_journalism presentation guardian ) […]
Open-Source Search: Application Centric and a Way to Big Data | Tech News Ninja says:
September 28, 2010 at 11:18 pm
[…] James Governor attended a NoSQL event at the Guardian a few months ago. Apache Solr was a focal part of […]
Open-Source Search: Application Centric and a Way to Big Data | SEO College says:
September 28, 2010 at 11:27 pm
[…] James Governor attended a NoSQL event at the Guardian a few months ago. Apache Solr was a focal part of […]
Hacker News NoSQL Mentions | Data story says:
March 18, 2012 at 6:20 pm
[…] is a high performance key value store database, used at places like The Guardian. Like MongoDB it also includes messaging functionality for replication, which means it’s being […]
When to use CouchDB vs RDBMS [closed] | ASK AND ANSWER says:
January 18, 2016 at 6:40 pm
[…] I recently attended the NoSQL conference in London and think I have a better idea now how to answer the original question. I also wrote a blog post, and there are a couple of other good ones. […]