Running RavenDB on Azure

Microsoft’s Azure platform provides excellent data storage facilities in the form of the Windows Azure Storage service, with Table, Blob and Queue stores, and SQL Azure, which is a near-complete SQL Server-as-a-service offering. But one thing it doesn’t provide is a “document database”, in the NoSQL sense of the term.

I saw Captain Codeman’s excellent post on running MongoDB on Azure with CloudDrive, and wondered if Ayende’s new RavenDB database could be run in a similar fashion; specifically, on a Worker role providing persistence for a Web role.

The short answer was, with the current build, no. RavenDB uses the .NET HttpListener class internally, and apparently that class will not work on worker roles, which are restricted to listening on TCP only.

I’m not one to give up that easily, though. I’d already downloaded the source for Raven so I could step-debug through a few other blockers (to do with config, mainly), and I decided to take a look at the HTTP stack. Turns out Ayende, ever the software craftsman, has abstracted his HTTP classes and provided interfaces for them. I forked the project and hacked together implementations of those interfaces built on the TcpListener, with my own HttpContext, HttpRequest and HttpResponse types. It’s still not perfect, but I have an instance running at

My fork of RavenDB is at – the Samples solution contains an Azure Cloud Service and a Worker Role project. My additions to the core code are mainly within the Server.Abstractions namespace if you want to poke around.


My technique for getting Raven running on a worker role differ from the commonly-used methods for third-party software. Generally these rely on running applications as a separate process, getting them to listen on a specified TCP port. With Raven, this is unnecessary since it consists of .NET assemblies which can be referenced directly from the worker project, so that’s how I did it:


The role uses an Azure CloudDrive for Raven’s data files. A CloudDrive is a VHD disk image that is held in Blob storage, and can be mounted as a drive within Azure instances.

Mounting a CloudDrive requires some fairly straightforward, boilerplate code:

private void MountCloudDrive(){var localCache = RoleEnvironment.GetLocalResource("RavenCache");

CloudDrive.InitializeCache(localCache.RootPath.TrimEnd('\\'), localCache.MaximumSizeInMegabytes);

var ravenDataStorageAccount =CloudStorageAccount.Parse(RoleEnvironment.GetConfigurationSettingValue("StorageAccount"));var blobClient = ravenDataStorageAccount.CreateCloudBlobClient();var ravenDrives = blobClient.GetContainerReference("ravendrives");ravenDrives.CreateIfNotExist();var vhdUrl =blobClient.GetContainerReference("ravendrives").GetPageBlobReference("RavenData.vhd").Uri.ToString();

_ravenDataDrive = ravenDataStorageAccount.CreateCloudDrive(vhdUrl);

try{_ravenDataDrive.Create(localCache.MaximumSizeInMegabytes);}catch (CloudDriveException ex){// This exception is thrown if the drive exists already, which is fine.}

_ravenDrivePath = _ravenDataDrive.Mount(localCache.MaximumSizeInMegabytes, DriveMountOptions.Force);}

(This code has been trimmed for size; the actual code involves more exception handling and logging.)

Once the drive is mounted, we can start the server:

private void StartTheServer(){var ravenConfiguration = new RavenConfiguration{AnonymousUserAccessMode = AnonymousUserAccessMode.All,Port = _endPoint.Port,ListenerProtocol = ListenerProtocol.Tcp,DataDirectory = _ravenDrivePath};

_documentDatabase = new DocumentDatabase(ravenConfiguration);_documentDatabase.SpinBackgroundWorkers();

_ravenHttpServer = new HttpServer(ravenConfiguration, _documentDatabase);_ravenHttpServer.Start();}

Again, trimmed for size.

A few points on the configuration properties:

  • the Port is obtained from the Azure Endpoint, which specifies the internal port that the server should listen on, rather than the external endpoint which will be visible to clients;
  • I added a new Enum, ListenerProtocol, which tells the server whether to use the Http or Tcp stack;
  • AnonymousUserAccessMode is set to all. My intended use for this project will only expose the server internally, to other Azure roles, so I have not implemented authentication on the TCP HTTP classes yet;
  • The DataDirectory is set to the path that the CloudDrive Mount operation returned.

I have to sign a contribution agreement, and do some more extensive testing, but I hope that Ayende is going to pull my TCP changes into the RavenDB trunk so that this deployment model is supported by the official releases. I’ll keep you posted.


(Yes, again.)

There’s an awful lot of buzz around new-fangled "NoSQL" or "non-relational" databases at the moment. I’m hearing a lot of people asking why they’d choose one of these new-fangled storage models over the relational databases they know and (mostly) like. This post aims to shed some light on the choices available, and to maybe give some guidance if you’re trying to make a decision.

Let’s define some terms

I find it always helps to make sure you’re talking about the same thing if you think you might be arguing.

Relational database management system (RDBMS)

When I say "relational database", I’m referring to an application or service which stores information in tables. Tables are divided into columns, and columns have a type, such as integer, date, text etc. The data held in the tables is divided into rows, and each row contains a value (possibly including NULL) for each of the columns. The RDBMS can maintain and enforce "master-detail" relationships between rows in different tables using primary and foreign keys.

The big commercial relational database systems are Microsoft SQL Server and Oracle; the main open-source alternatives are MySQL, PostgreSQL and SQLite.

New-fangled data stores (NoSQL)

I’m going to use the term NoSQL to cover everything else, which is a massive generalisation and therefore inevitably inaccurate (several new-fangled data stores actually support a SQL-like syntax). However, in the interests of a bit of clarity, I’ll try to outline some of the distinct types here, although the distinctions can be a bit vague.

Key/Value Stores are the most basic NoSQL databases, and are essentially hashes with disk persistence. In some cases the value is a single binary or text object; in others it is a tuple with named properties, sometimes with searching supported against those properties. These solutions are schema-less: entities within a collection, domain or table may have completely different properties. Microsoft’s Azure Table Storage and Amazon’s SimpleDB are both cloud-based, pay-as-you-go Key/Value stores. Open-source implementations include BerkeleyDB, MemcacheDB, Redis and TokyoCabinet.

Wide Column, Tabular or Column Family Stores are scalable, distributed systems that support a defined schema for each "table", but not relationships between tables. Google’s BigTable database, which is the storage solution for their AppEngine cloud platform, is a Wide Column store. Open-source alternatives include Cassandra and Hypertable.

Document Stores work with data structured using Object Notation, usually JSON or BSON. Not to be confused with SharePoint, Domino etc. This format allows complex data structures, including arrays and hashes, to be represented within a single entity that can be retrieved in one read operation; they also often support complex update operations such as adding elements to an array within a document. Some popular Document Stores are CouchDB, MongoDB and Riak.

Object Databases are superficially similar to Document Stores, but have predefined schema and inheritance hierarchies. In an Object Database, graphs are persisted using pointers between objects rather than by embedding graphs within other graphs. The most popular Object Database is db4o.

So which is best?




Right, so. Why should I use any of these new-fangled "stores" instead of a relational database?


Beg pardon?

I said "mu". I keep hearing this question, and I submit that the answer is "mu", which means "your question cannot be answered because it depends on incorrect assumptions."

What incorrect assumption is that?

The assumption that the relational database is axiomatically the correct storage solution in all situations, and that the onus is on all other options to prove that they are better.

Is that not right then?

No, it isn’t. Have you heard of doing the simplest thing that could possibly work? Or YAGNI? It’s possible to add unnecessary features to a system without coding them yourself. The first thing you should rule out as a storage medium is the file-system.

What, text files? Isn’t that a bit 1980s?

Not at all. Let me ask you a question: what requirements or properties of your application preclude using one or more flat text files to persist your data?

I need structured data.

What’s wrong with tab-delimited, CSV or even fixed-length field files?

I need master-detail structured data.

In which case, you could use XML, YAML or even JSON. Let’s say we’re using XML, even though it’s quite the worst of those three options.

I need details with many masters.

You mean lookups? Why on earth would you want to use those?

To save space.

Right, because storage space is so expensive these days. You can buy a 2-terabyte hard disk and get change from £120 (or $150US). Keeping a gigabyte in the cloud’ll cost you 15 cents a month.

OK, I don’t need to save space. But I still need one copy of each piece of data, so I only have to update it in one place.

That’s fine, but does that need to be supported by your persistence layer? Are you going to re-read those lookup descriptions every time you pull a record? Add multiple joins to every SELECT statement just to read back those varchar(40) descriptions?

As opposed to…?

As opposed to reading those lookups into memory once the first time they’re needed, and maybe refreshing them in background every now and then if you’re in a multi-user smart-client system and they’re particularly dynamic, which they’re probably not. How often does somebody add a new country to the system? Or a new type of customer, or method of payment, or whatever?

OK. I accept that lookups are not a valid reason to use a relational database.

Good. So, next requirement or property?

I need many-to-many relationships. For example, in my library application, each customer can borrow many books, and each book (over time) can be borrowed by many customers.

Well, we’re still using XML, so you can have a list of child elements in your customer record detailing the books they’ve borrowed, and a list of child elements in the book record detailing the customers who have borrowed it.

So when somebody borrows a book, I have to update two records? That’s bad, isn’t it?

I don’t know; it’s your application. Do both records have to get updated in the same split second? Is it likely that another user will do a read that could get inaccurate information while the two updates are happening? Or is it enough that the customer and book records should be eventually consistent (for a given value of eventually)?

But then I’d have to read a bunch of files individually to get the details of those books, or the names of those customers.

Not at all.

Well how would you display sensible data on the screen? Are you going to cache the entire book catalogue and customer list in memory as well?

It’s an option. But no, I would probably just decide which details about the book I need in the customer views, and which details about the customer I need in the book views, and then I’d include those details in the child records.

You’d duplicate the data?

Oh, yes.

[Shocked gasps from enthralled onlookers]

You’re mad.

Not a bit of it. Books don’t change their titles, or authors. Customers don’t change their names very often; certainly not so often that it’s beyond reason to bulk-update a bunch of book records when they do.

So you just duplicate the data?

Absolutely. Saves a whole bunch of extra disk reads, and those join tables you keep using in many-to-many relationships, well, let’s be honest: they’re a kludge, and they’re one of the most common causes of performance issues due to volume of data.

I remain unconvinced.

OK. Let’s take your library example. Let’s say your library is in a particularly literate part of the world and has 10,000 active customers, and 100,000 books. And let’s say that each customer borrows two books a week. After one year, your many-to-many join table has 1,000,000 rows in it. Meanwhile, each of my customers’ records have 100 book record keys, titles, author names, publishers, and loan start- and end-dates, and each of my book records have, on average, 10 customer record keys, names, maybe addresses, and loan start- and end-dates. In storage terms, let’s say you’ve used probably 32 bytes per row (two 32-bit keys and two 64-bit date/times, plus overhead), so that’s 32MB, plus the same again for indexing. Meanwhile, let’s just hazard a guess and say I’ve used a whole gigabyte with all my duplicated data.

Exactly. Look at all that storage you’re wasting.

My goodness, you’re right! That’s 0.1% of a £120 hard-disk I’ve just frittered away! Or nearly $2 a year for the cloud storage.

There’s no need to be sarcastic.

I wasn’t; I was being ironic. Anyway, my point is, pulling a customer record and displaying the complete list of books they’ve borrowed for me is a single read operation; for you it’s a three-table join with one of those tables containing a million records for each year your software’s been in use.


Actually, scratch that.


Yes, I was wrong.

I thought you were.

I forgot that you’ll have normalised out your book catalogue into a book table with a many-to-many join to an author table, and another many-to-many join to a publisher table, so you’ve actually got seven tables totalling a few million rows that you need to reference to get the same information that I can get in a single read operation. How’s your EXPLAIN output looking, by the way?

I’m not sure. What does DEPENDENT SUBQUERY mean?

No idea, but it sounds nasty. You can fix it later. You’re supposed to be giving me reasons you need a relational database.

So I am. OK, well, I need to search the data.

Aha! Very good. Much closer. That would be a reason for a database management system, but not necessarily a relational one. There are lots of NoSQL solutions that will let you run queries against your datasets, with varying degrees of performance. I’ve written queries in MongoDB that run against massive datasets so fast it’d make your eyes bleed.

[Suddenly, a heavenly light shines down upon the pupil. Or some scales fall from his eyes.]

I’ve been such a fool! I drank the relational database kool-aid and no mistake! Thank you for showing me the error of my ways. I’m going NoSQL all the way for my next project.

What? No, wait, come back.

Why? I’m convinced. You made your case.

No, I haven’t. I’ve just stopped you from being one kind of wrong and you’ve immediately started being another, completely different, but equally bad kind of wrong.

No need to be rude.

Sorry. But flat text files and  NoSQL databases aren’t the solution to every persistence problem. There are perfectly good reasons to use a SQL database.

Such as?

You might need the whole ACID thing: Atomicity, Consistency, Isolation and Durability. If your system deals with financial data, or medical records, or other sensitive information, you’re going to want a robust transaction system.


And if you’ve got a lot of concurrent users who might try to update the same bit of data at the same time, like an entry in a Stock table in an e-commerce system, then isolation and locking could be very handy.

And NoSQL systems can’t do ACID and locking and so on?

Oh, they can, one way or another. But a lot of the time you end up writing extra code in your data access layer to do stuff that a decent RDBMS will do for you. And if you’ve got a room-full of developers who are dead good with SQL Server or Entity Framework or whatever, why force them to learn a new technology?

I would have a hard time justifying the time and money required for our developers to learn the new skills.

Then again, in the cloud the NoSQL storage can be much, much cheaper than the RDBMS. For example, Windows Azure storage costs $0.15 per gigabyte per month, whereas a 1GB SQL Azure instance is $9.99 per month; 10GB is $99.99 per month.

So Windows Azure storage is cheaper than SQL Azure storage?

Not necessarily. With Windows Azure Storage you also pay for transactions, at the rate of $0.01 for every ten thousand. SQL Azure doesn’t charge for transactions. So if you’re storing less than a gigabyte but running more than ten million transactions per month, SQL Azure would be cheaper. Or if you’re storing less than ten gigabytes but running more than a hundred million transactions-

Stop! I’m all confused. Please, just tell me, what should I use?

You want a definite answer?



There really is one?

There really is one.

And are you going to tell me what it is?

I am.




Although I don’t think you’re going to like it*.

It doesn’t matter! I must know!


Yes, now!



You’re really not going to like it.

Tell me!

The answer is…


The answer is: It All Depends.

I want the last 5 minutes of my life back.

5 minutes? I’ve spent hours writing this.


Anyway, the point is…

…every system is different. Which persistence paradigm is best depends on the project and its requirements and properties. All I want to say is that you shouldn’t start with RDBMS and then look for reasons not to use it; at the same time, don’t start with Project Voldemort or CouchDB just because they’re new and exciting. Approach every project with no preconceptions about what you’re going to use; or have "flat text files" as the default choice.

A non-exhaustive list of things to consider:

  • If constant consistency is a key requirement then you should probably use an RDBMS.
  • If horizontal scalability is important and eventual consistency is good enough, then a NoSQL solution might be better.
  • If you’re deploying an application to Windows Azure or Amazon Web Services, then you should probably compare the costs of Windows Azure Storage Services or SimpleDB to SQL Azure or Elastic RDBMS for your particular application.
  • If you’re writing an application for a mobile platform, you’re back to limited storage space so you might get back to RDBMS and normalising for efficient use of that space.
  • If you’re writing a small application you want to knock out quickly using Entity Framework or NHibernate or ActiveRecord then use SQL Server Compact or SQLite, or skip that stuff altogether and serialise object graphs to files on the hard-disk.

In fact, if you really design your architecture with an open mind (or beginner’s mind), you’ll come up with a combination of technologies and paradigms that properly meet different parts of the requirements; maybe use SQL Server 2008 R2 for stock control, invoicing and geo-spatial search operations, but Azure Table Storage for auditing, text files for logging. And why not evaluate distributed Document Stores for the awesome customer review and discussion features that you’re adding in phase two?

Most important of all, before making any decisions, take the time to do some apples-to-apples comparisons of potential solutions, with use cases,  data volumes and loads that are representative of your system. I recently chose MongoDB for a personal project, but I started out comparing SQL Server, MySQL and PostgreSQL and finding them all wanting, which is how I ended up looking at NoSQL databases in the first place.

(Just for the record, the most important query and most frequently-run query for the application took nearly 15 seconds on the fastest relational database (even after some violent denormalization); MongoDB can retrieve the same information in under half a second.)

Further reading

A comprehensive – and very balanced – list of articles on NoSQL databases:


*All apologies to Douglas Adams.


Get every new post delivered to your Inbox.

Join 4,010 other followers