Why beat the SQL drum?

Renen Watermeyer's picture

A few days ago I was asked, in our freenode forum (#knowledgetree), where the KnowledgeTree team stood on the various no-Sql database options that are currently so trendy.

My immediate reaction was: well they’re not there yet...

Then, last night, some friends asked me why I even continue to beat the SQL drum at all. This took me by some surprise (and precipitated a passionate and heated argument!).

Every time I look at the no-Sql space it becomes a little more defined. Clearly it’s time to re-evaluate and unpack this topic in a little more detail. And KnowledgeTree provides a great context for this analysis.

Several of the no-Sql databases look like they could be great fits for KnowledgeTree. Hell, they even call one class of these databases “Document Stores.” In particular, MongoDb and CouchDb stand out. They store large files very efficiently and they are reasonably quick (certainly MongoDb has a great reputation).

Then there are also Graph databases (Neo4J, etc.,) which treat relationships as first class objects. And, tags and folders are relationships of a sort. Should they be considered as potential contenders?

To explore this problem a little more concretely, I’m going to build my analysis around some of the obvious functionality required by KnowledgeTree: document stores.

But first, the bigger picture

Before we start, I should dwell briefly on the no-Sql database types I don’t address head on: graph databases, Jackrabbit and key-value stores.

Graph databases: While they feel to me less mainstream than their more prominent brethren, they are very intriguing. For example, Neo4J’s Solr integration might make it a very good fit for KnowledgeTree’s context. I am going to gloss over them in this discussion and will (hopefully!) return to them in a future post.

Similarly, I have spent some time with Apache Jackrabbit (and anticipate spending much more). But, when people talk to me about no-Sql, I don’t normally think of Jackrabbit. And, it warrants it’s own discussion.

I also gloss over key-value stores. This is not intentional and they do deserve significant consideration. But, while you can accomplish an enormous amount with a domain-key-value tuple, I think that there is more merit to the more structured approach of the document databases. Hence the bias reflected below.

Our Problem Domain: KnowledgeTree Documents

A document consists of the document itself, along with a set of properties (“metadata”), tags, and, importantly, one or more relationships to a hierarchically arranged folder structure. So, from a database perspective we should understand how we are going to treat the following:

  • Documents and filesFolders
  • A security modelTags
  • Properties and metadata

Documents and files

There is a certain amount of consensus that suggests that our current approach, storing documents as files in a file system (rather than in a database), is on the mark (see http://stackoverflow.com/questions/3748/storing-images-in-db-yea-or-nay. They are very different, but I also note that Google, Facebook and Flickr do not, generally, store files in databases).

Keeping our documents out of the database has brought us the opportunity to refactor our model into a number of storage drivers (hash based, path based and more recently, S3 based).

However, several of the no-Sql variants provide the capacity to store files. MongoDB stands out with its GridFS. But, I’m torn. It’s quite possible to shove files into MySql. But, even Windows works pretty well as a file store. I’m not convinced that MongoDB would outperform Linux.

What storing files in a no-Sql database might assist you with is creating a distributed, scalable architecture. But, if that’s your game, creating another storage driver to leverage the advantages of these stores may well prove a simpler and more proven bet. It is, after all, why we created our S3 driver.

Certainly, from a file handling perspective (and very generally), it seems that the no-Sql solutions can do everything that Sql implementations can. Some no-Sql products hold out the possibility of storing blobs in distributed, fault-tolerant databases. This is enticing – but, probably overkill for the bulk of the contexts we support.

And, the flexibility that we have garnered by keeping our documents out of the database remains compelling.

I’m not sure that I can award this point to either argument. So, round one: nil-nil.

Folders

In every SQL store I’ve worked with, hierarchies have proved very hard to implement (although, I note that the bigger providers generally do include hierarchical data types).

Either, you can have a clean, simple model – and terrible performance, or you can have a messier model with better read performance, but appalling update performance. Or, you can have a complex set of caches and materialized views that get you the best of both reading and writing. But, with spectacular complexity (Google “adjacency list model,” “nested set model” or “materialized paths” if you want to dig deeper).

But, with the possible exception of Graph databases, I’m not sure that the no-Sql contenders are materially better. Witness this thread exploring the options for document-oriented databases: http://seancribbs.com/tech/2009/09/28/modeling-a-tree-in-a-document-database. Clearly, modeling trees in no-Sql is fraught with many of the same pitfalls that you run into using traditional Sql!

Key-value stores don’t seem (to me anyway) to bring much to the party either. Clearly, you can implement a hierarchy in a key-value store. But, it is my sense is that the model is going to be complicated and traversing parent-child relationships is likely to be hard and slow.

I’m going to declare round 2 two a tie. Nil-nil.

Security

Perhaps I’m utopian, but I really would like the database infrastructure to provide real, usable, leverageable security for me. In Sql terms, I am after row level, user driven security – I want to be able to pass my client-tier credentials through to the database (and, no, triggers don’t cut it – you shouldn’t be able to read data from tables where you don’t have the right permissions). I’m not completely sure how I would use this given conventional approaches to connection pooling. But, hey, it could be damn useful.

The Mongo security model appears deliberately sparse. By default, it doesn’t provide any kind of authentication model (even to the database). Which is sensible: odds are the application business logic will require the developer to implement his or her own model anyway. Never mind that the limitations of connection pooling generally force you into your own space.

My sense is that the bulk of the other no-Sql products adopt the MongoDB approach: SimpleDB, Cassandra etc.

But then there’s CouchDb! CouchDb has a model that is extremely tantalizing (see Security and Validation, http://CouchDb.apache.org/docs/overview.html) with customizable, row or object level security. The possibilities are definitely exciting.

CouchDb wins this prize. 1, CouchDb, zero to the rest! :-)

Tags

Interestingly, my sense is that, with the exception of graph based databases, tagging resolves to an implementation issue: which schema or model is most appropriate.

Although, a little thinking makes me conclude that keeping tags consistent in a key-value store could be damn tricky. Tags can be added easily enough. But, asking for a list of everything tagged “interesting”, “simple” and “document management” could be hard – you are going to have to write some application level code to empower this.

I’ve made a mental note to return to this issue and to actually write some of that code.

For now, it’s still nil-nil in the main, with CouchDb ahead on the left and key-value stores trailing at -1 because of the challenges tagging seems to throw out!

Properties and Meta-Data

This must fall to the schema-less models of the no-Sql contenders. Sure you can create key-value pair like structures in Sql. But, row-size limitations and abysmal indexing options make them poor substitutes for the implementation offered by both the key-value and document stores, with the document stores edging out the key-value stores from convenience and simplicity perspectives.

To my mind, this is a clear win for the no-Sql crew. 1-0 (and, 2-0 to CouchDb!).

So why beat the SQL drum?

2-0. I’m not completely surprised. To quote MongoDb.com, “databases are specializing and ‘one size fits all’ no longer applies”. We are moving into a different world and, using the heuristics we have just examined, it was unlikely that the Sql model would have proved the best fit. The message is clear: it is no longer acceptable to narrowly constrain the database choice.

But, for almost all of what we do, the my-Sql, file system combination does a pretty good job. Certainly, where we have run into performance constraints, they are of our own making and are surmountable.

So, what would it take to warrant a complete rethink? To catalyse a shift to another platform entirely? I think three things:

  1. Broader acceptance in the developer community, and, specifically, within our community. I need to be confident that the drivers are robust, that there is adequate documentation around the platforms, and that the understanding and common wisdom around the products is entrenched.
  2. I need to better understand the choices. Specifically, writing this post highlighted that I need to take a closer look at Neo4J and its connections to Solr. In fact, I need a much more certain understanding of graph databases in general
  3. Pressure from the community, or specific product requirements that necessitate the adoption of some kind of no-Sql product

It seems, though, that I might have to pack my drum away!

Watch this space.