Jumping aboard the NOSQL train

NOSQL databases have become pretty trendy lately – especially with the news that big players like Facebook and Twitter are using the technology to scale into arbitrary dimensions. And there’s already a number of players on the court: projects like CouchDB and MongoDB are becoming more and more popular. And in fact jumping aboard this hype’s train seems to be worth the risk: Scaling where only the sky is the limit, completely transparent failure tolerance, easy to use schema-free document storage and a nice mapping to RESTful and cloudy concepts carry the promise of previously unknown efficiency both for the programmer and today’s servers’ horsepower.

So you decide that you’re the adventurous kind of coder and you definitely want to jump on the train and use a NOSQL database system to make the world better – you fire up your favourite search engine to find the best one out there and there you are: a not-so-impressive number of available products, every project has of course the fastest DB, some are written in obscure languages and many of them have features and buzz-words of which you can only guess the purpose: key/values stores, document stores, MapReduce, eventual consistency, auto-sharding, versioning… So here we are at the very core of this blog post: I want to give you a quick primer in NOSQL-speak and kick-start you into making the right choice for implementing your visions.

So let’s begin with the 1337 speak…

  • key/value vs. document store: Think key/value as a really really big and fast distributed hash-map – nothing more nothing less. It can’t do much and nothing complex – but what it does it does really really fast. Document stores on the other hand are more like a layer above: in their core they do pretty much the same thing – but they are aware of what they store. They force more structure onto your data to be aware of its content but in return they enable you to query your data via its structured content. That doesn’t mean you’re specifying a schema like you did in RDBMS – it means you have a defined format of how to put values into your store (e.g. JSON documents).
  • MapReduce: Chances are you heard about MapReduce. You know that it’s this fancy functional programming thing that makes some of Google’s apps scale so well, but you’re uncertain about how that concept fits into the database context. The general idea of MapReduce is heavy parallelism and arbitrary scaling. You start by writing a little “map” function (say in JavaScript) that works on one data item (say JSON documents) at a time and somehow transforms it. This transformation can be to skip it, to produce a number of new docs, count some words, or simply rewrite a single property of your JSON object – the important thing is that you only know one document at a time in the “map” phase. Next you write a little “reduce” function. This time you know all resulting documents that came out of the “map” step. Think aggregation. You can for example sum up some numeric fields, or find out the maximum value – or you can do nothing at all if you only want to use the map step’s filtering mechanism. Depending on the reduce function you want to use you can execute this step distributed as well – but that doesn’t always have to be the case and I wouldn’t count on the document store of your choice to even try it. So what I am saying is that MapReduce is the SELECT/WHERE/GROUP BY of document stores – you do filtering, you do projection, you do aggregation. But you do it as distributed as possible.
  • Eventual consistency: This must sound really bad to you if you’re working for a debit card company – and it should. It’s exactly what it sounds like – you don’t commit – and you don’t get the promise of immediate visibility amongst all nodes. But when you are Facebook or Twitter you really don’t care that much – if that poke of Sally takes a minute to pop up on Kara’s profile the world won’t stop spinning. No – when you’re Facebook you’re happy when you can handle the load that your users produce and everything pops up everywhere – eventually. This idea is more “cloudy” than ACID. Think your smartphone as a cluster node – you can’t guarantee that local changes immediately propagate to your Home-Server at commit time if you have really bad reception in downtown nowhere – so you don’t even try to promise that. Instead you promise to sync eventually once the reception gets better. Scale this concept to a server farm and you start to realize how to achieve fault tolerance and really fast updates in such an environment. And chances are your updates propagate pretty quickly… So if you’re not booking debit card bills you’ll probably be happy with eventual consistency.
  • Auto-sharding: Having a huge number of nodes makes it difficult for you to decide which documents to put where – the administrative overhead would most likely kill all your coding time – think load balancing and fault tolerance. Auto-sharding takes that burden away and lets the DBMS take care of that. In the key/value or document store concept this is much easier that with RDBMS since processing takes inherently place in a distributed fashion anyway. So you can in fact re-balance the data storage on a per-document basis as you please – and introduce an arbitrary amount of redundancy for failsafe reasons. There’s also no notion of master or slave in most parts of the NOSQL world – everything is peer-to-peer (again more “cloudy”). Good thing most (if not all) distributed stores out there take care of auto-sharding for you.
  • Versioning: That’s another one that is exactly what you think it might be. Well – almost. It’s something not all of the new datastores out there are doing right now. And with those that are doing it versioning comes in two flavors. Some only tag stored values/documents with a revision number while others actually keep all revisions of a document they have seen. While the second option can be a benefit in a scenarion where you really do some sort of versioning, the first one is another key ingredient to fault tolerance and replication in the cloud environment. Since every item has a unique revision identifier one can determine which document change has already been propagated and even detect conflicts during sync operations. The method of using revision handling to solve conflicts is nothing new – Multiversion Concurrency Control (MVCC) is also used in PostgreSQL for example. How merging conflicts are handled is product dependent though.
  • REST: I bet most readers know about this one since it’s buzzing around for quite some time now, but I include it for the sake of completeness. REST (REpresentational State Transfer) is a paradigm that postulates easy-peasy programming once you get statefulness out of your code. REST, CRUD and HTTP go nicely together most of time – and they tend to play nicely with key/value and document stores. In fact they are so close friends that a sizable number of them directly feature a HTTP REST interface or even solely rely on it (CouchDB). And there’s not many good reasons not to. Virtually any programming environment can access it, HTTP is easy to grasp and there’s more than enough tools out there that make your life easier once you use HTTP as your primary networking language. And once you are RESTful (which also means there’s no such thing as transactions) you can send your requests to arbitrary nodes for load balancing – there’s no master after all.

Now that some basic terms have been established you should already get an idea how and why those tools scale so well: The database concept as we know it from our all-in-wonder SQL beasts is dismantled until there’s not much more left than a rather simplistic way of assigning documents to identifiers – and then build upon that with new concepts that don’t destroy parallelism and performance. Of course there’s casualties – but also opportunities to see a DBMS from a totally new perspective and once you learn to get over the strange feeling of not having a schema and no common query language but start to embrace the new tools you have at hand there might be a chance that you and the NOSQL database of your choice can become really good friends.

This concludes this blog post and I hope I was able to give you the kick-start into NOSQL-speak I promised. And be sure to check back for the next post on NOSQL that I’ll be authoring together with Matt: It’ll be a real showdown of a number of key/value and document stores with all the feature comparison you could possibly wish for. We did some research into some of the fairly good comparisons out there but they tend to be outdated pretty quickly with this blazingly fast evolving topic, so we ended up doing our own evaluation and right now we’re in the middle of the battle for the first prize: We had a tie in the first round and now we’re really taking the two finalists apart to see what they can do – and the crystal ball says it’ll be a close one… Stay tuned!

Happy distributing,


[shareaholic app="share_buttons" id="19406647" link="http://blog.crealytics.com/blog/2010/03/08/jumping-aboard-the-nosql-train"]


Maximilian Hainlein

I'm working for crealytics as Social Media and Marketing Manager since 2011. My motto: "It's better to be the needle than the haystack."