Node.js: From Couch to Mongo

This post is part of a series of articles about my recent experience building Sled using Node.js.

I started with CouchDB and ended with MongoDB.

Working with CouchDB was fantastic. It took no time to learn the REST API and jump right into building the application. I had the first version of the base API ready in 2 days, including full database integration. Couch is a document store using JSON as the document format. Combined with an HTTP REST API, it makes Couch an ideal fit for Node. That’s exactly why I picked it.

I dismissed MySQL from the very beginning for two reasons. First, I had no idea what my schema would look like, and knew I was going to change it a lot over time, as well as allow unstructured data. Second, MySQL is a block database in nature (e.g. database deadlocks, atomic actions, etc.) and while you can use it with Node, it really wasn’t designed for this kind of environment. Also, at the time, there was no quality driver for Node. Oh, and I really wanted to play with a NoSQL database.

I started talking directly to the database but within a few hours switched to use Cloudhead‘s excellent cradle module. The module provides a light layer for managing the HTTP client calls, error handling, simple caching, and common macros for performing multiple database requests in a single function call. It is a very light abstraction layer (and it doesn’t abstract too much either). This worked very well for a few months. I didn’t use cradle’s built-in cache because of the expected size of my database and my constant tweaking of data manually.

Once we started stress testing the server, we hit one of the most common problems with Node’s asynchronous model: the inability to control the execution order of events. We have a simple document which includes a name, place, time, and date. The properties can be modified by multiple people, and if two people change the same property, the last update wins. But when two people change different properties (one the title and the other the place), there should not be any conflict.

To update a document in Couch, you must have its current version. Typically this means going to the database, grabbing the latest version, changing it, and saving it back. The problem is, this simple process breaks into two callbacks, one for fetching the latest document, and another for saving the modified version. When two requests come in at the same time, Node will queue two database fetch requests (for the same document) and then will process each in order. Both requests will return the same document, but after the first changes the document, the second update fails.

I came up with a few ways to address this race condition (I’m sure there are more):

  1. Retry as many times as it is likely to get multiple updates to the same document at the same time. So if we expect 5 people to update the same list at the same time, we should retry at least 5 times.
  2. Use a local cache to store the latest version of recently modified documents and a queue of pending updates. When the first request comes in, the change is added to the queue and the latest document is requested from the database. When the latest document is obtained, the update is applied to both the cached copy and the database. When the second request comes in, the cache can be in one of two states: have the latest document in memory, or pending. If the document is available, it is updated and saved. If not, the second update gets added to the queue. When the document is received, both updates are applied at the same time.
  3. Perform lazy database updates, batching together multiple updates into a single commit. This is performed as soon as there are updates, but it applies all the accumulated updates after it gets the latest copy.
  4. Use another database with native support for partial updates.

#1 works but is butt-ugly. It hard-codes load expectations and makes it harder to identify actual database errors. #2 could work, but I just didn’t feel comfortable with this level of complexity for something as fundamental as database updates. #3 means requests coming in between updates will include data that can be multiple-generations old. So I went with #4 and switched to Mongo.

The good news is that converting the application from Couch to Mongo took two days. I used the powerful native Mongo driver from Christian Amor Kvalheim. Overall the transition wasn’t bad, but it wasn’t trivial. The main difference is how the two databases store documents. Couch stores native JSON objects while Mongo has its own internal format which is similar but not identical to JSON. The biggest difference is Mongo’s support for a more complex set of numbers and native types.

Coming from Couch, I got pretty spoiled at working exclusively with simple JSON documents. JSON-in-JSON-out. If you can make Couch work for you, this alone will make your life much simpler. With Mongo, you have to decide how to handle numbers, dates, identifiers, and other data types. To make the conversion faster, I wrote a small layer on top of the native driver to normalize my JSON objects into and out of the database, converting ids to strings, and all numbers to simple JavaScript numbers.

I also had to add some restrictions on key names not to start with ‘$’ or include periods which are forbidden in Mongo (for a reason). We have an API allowing client developers to store a bit of data on the server for their own use. With Couch, we could let them dump a JSON object as-is (after some security sanity checks), but with Mongo, we had to filter out the forbidden keys and had to move to a key-value store instead.

On the other hand, the database itself became much simpler, removing the need to use Couch views (which meant one less place to manage code). I’m also loving the ability to manipulate arrays within a document with very specific instructions using Mongo’s powerful array update support.

Overall, if you need partial document updates as a basic database functionality, Mongo is a better fit, especially in Node. If you don’t, Couch is simpler and much faster to learn and use. I really love Couch and I wish I could use it. Mongo is pretty awesome too.

One major complaint for Mongo is the lack of decent management tools. Couch comes with the built-in Futon application which is a must-have for any new application development. If I had to use the Mongo command shell to get going in the first few days, it would have taken me twice as long to get going. The available tools for Mongo are awful. They are buggy, they crash, they corrupt data, and when they work, they require you to install Python, Ruby, or PHP on your server – something I refuse to do on my clean Node box.

11 thoughts on “Node.js: From Couch to Mongo

  1. To update a document, you do not necessry need to know its revision number. Knowing the “_id” is enough.

    I agree views are not that obvious, especially when you expect / are use to SQL queries.

    When I looked at MongoDB vs CouchDB a few years ago, my choice was CouchDB because it was more focused on consistence (ACID & MVCC) than performance as for MongoDB. I don’t know if it changed over time. I also like from CouchDB their replication model and CouchApps are nice idea (even if I’m not really good in JS).

    Futon is really a must have and it’s sad there is no equivalent for MongoDB

  2. One thing I’m not clear on about your sled development using node.js. Are you developing the whole server or just adding an app under another server such as Apache? (I’m guessing it’s the latter but I wanted to check.)

    • On Windows I used MongoExplorer which is great for reading values, but will corrupt non simple types like long numbers or booleans. On Mac I used MongoHub which when working is pretty good, but I never got it to stay running for more than one document update. I tried to the install the various django, PHP, and Ruby tools but given the fact my server had not been configured for any of them, getting it to work (with all the required broken package managers) was just too painful.

  3. What do you think about Redis? I’m sort of a nosql noob, so the question isn’t rhetorical. What issues would you run into if you were to use Redis? Also, any tips for building a data model in nosql?

Comments are closed.