Migrating to CouchDB 2.0

Datetime:2016-08-23 00:49:59          Topic: CouchDB  DataBase           Share

This is the eighth in a series of blog posts introducing the Apache CouchDB 2.0 release. Read partsone, twothreefourfive , six, andsevenin the series.

Maybe you’ve tested one of the release candidates, or RCs, (latest at the moment is RC4 ) for CouchDB 2.0 already, got excited aboutclustering,Mango orFauxton – and you’re wondering how to migrate your application to actually use it. If yes, then this blog post is for you: it outlines how to migrate both application logic and actual data.

A few words of caution at the beginning: First off, release candidates can still contain surprises (even more than releases) and make lots of sense for pre-production testing, while being potentially dangerous in actual production. Secondly, nothing can replace thorough testing of functional and load scenarios before rolling out into production. Both this post and the software release have been (and are still being) created with care. Still, your application is special and might stumble over edge cases not thought of yet.

Migrating your application

There’s good news concerning the API: it’s 99% similar to CouchDB 1.x, so much of your logic has a good chance of working without modification. The HTTP endpoint at 5984 (or 15984 for running via dev/run or Docker ‘-dev’) hides the cluster in the background. The single-node HTTP endpoint for the first node is 5986 (or 15986 for running via dev/run or Docker ‘-dev’). Though not intended for normal use, this option can be useful for some cases (see below for one).

Still, there’s some differences in API (and behavior), mostly due to clustering. Most importantly, the update seq, or sequence, (used for ?since in _changes et al.) is not a number any longer but a string. You just pass along the value anyway without interpreting it. Still, you might need to make adjustments (e.g., for data type).

One new additions means the _changes feed is no longer guaranteed to be strictly sorted. As it is now assembled from several shards in a cluster, and there can be small differences between those any time, changes made earlier can appear later in the _changes returned, or changes from before the ?since you pass might be returned. So, it makes sense to double check that your application is comfortable with getting the same change (i.e., same document and revision) with not one but two (or more) calls to _changes (= treat changes idempotently). One idea to mitigate: compare _rev , the doc’s revision, in your transient version with the change coming in. Plus, the first part of _rev is still a number counting up.

As we’re clustered now, whether potentially or actually, ordering might also come up in places where you (unintentionally) relied on natural order in your application. The author’s testing efforts did bring up such issues. However, natural order is bad (or say, surprising where you least need it), and that’s in no way specific to the new CouchDB version. So, the best time to get rid of it is NOW. By the way – the same applies to temporary views (the _temp_view endpoint) – which are no longer available.

When your application is comfortable with ordering specifics, you’ve already covered a major part of what’s new. Depending on your application, you’ll also have to look at these topics, some of which might be not that common yet:

  • Database names for source and target in replication (both via _replication and via a doc in the replicator database) need to contain the full URL, not just the name of a database. When your application actively triggers replications, you’ll have to change this.
  • When you relied on assets (like .js files) served from /_utils : these are now gone, you’ll have to package them with your application and change the paths referencing them.
  • The all_or_nothing parameter in _bulk_docs is not implemented in 2.0 at the moment. It was an indication (and not a guarantee); still, it’s not available and your application will have to go without it.
  • When your application changes _config of the CouchDB server: /_config/ is not available on the cluster, but there is /_node/<fqdn>/_config/ for your setup needs. Make sure you do it on all nodes and handle possible inconsistencies when it fails on one or several of them.
  • When your application creates and deletes databases in quick succession, propagating the deletion across the cluster might take a moment, so make sure to use unique names to avoid conflicts or errors. The /_uuids endpoint can help you with it.

Migrating your data

When you have your application ready, it’s time to migrate data. As you’re working with CouchDB, you have a world class tool right at your fingertips to help with that: Replication!

Whichever way you choose (two are outlined below), you’ll have to prepare sufficient free storage capacity. This is mainly because you have the existing (unclustered) and the new (clustered) database in parallel for a short period of time. Keep in mind that the cluster’s shards are copied several times (outlined in more detail in the CouchDB 2.0 Architecture post), so two times used storage won’t be enough.

The easiest way to get started (and to test with actual data) is replicating your existing database from a running CouchDB 1.x to a new CouchDB 2.x. Assume you have a 1.x database mydb on machine1 and a CouchDB 2.0 on machine2 on you might try this:

curl -X PUT 'http://machine2:5984/mydb' # create a clustered new mydb on CouchDB 2.0
curl -X POST 'http://machine1:5984/_replicate' -H 'Content-type: application/json' -d '{"source": "mydb", "target": "http://machine2:5984/mydb"}' # replicate data
curl -X GET 'http://machine2:5984/mydb/_design/<somedoc>/_view/<someview>?stale=update_after' # trigger re-build index(es) of somedoc with someview; do for all to speed up first use of application

And in fact you’re done!:wink:

You can now use http://machine2:5984/mydb as endpoint for your application and give it a shot. As there’s now a different way of storing the data (the clustered way) – along with a replication to move data from one way to the other, there is (to the best of the author’s knowledge) always a switch of databases involved. Due to (repeated) replication, its impact can be reduced to a minimum. Yet there is an impact.

Alternatively, if you don’t want to run both in parallel (and have even more storage capacity), you can copy the database file (those residing in the directory specified in /_config/couchdb/database_dir and ending with .couch ) to the data directory of one of the nodes of the cluster (e.g., lib/node1/data ). The database will then appear on the node’s local port (normally 5986 instead of 5984, other ports depend on your setup). You can fully use the database on the local port, but it’s not clustered yet. In order to get it clustered, you can use replication just like you did before:

curl -X PUT 'http://machine2:5984/mydb' # create a clustered new mydb on CouchDB 2.0
curl -X POST 'http://machine2:5984/_replicate' -H 'Content-type: application/json' -d '{"source": "http://machine2:5986/mydb", "target": "http://machine2:5984/mydb"}' # replicate data (local 2 cluster)
curl -X GET 'http://machine2:5984/mydb/_design/<somedoc>/_view/<someview>?stale=update_after' # trigger re-build index(es) of somedoc with someview; do for all to speed up first use of application

You can also move the files instead of copying them; but be triple-careful with that!! As soon as there is any write (including those during replication), you will never be able to go back to use those files with 1.x.. So, go with copies, plus have backups in place!

Finally, you can also replicate back to 1.6 when you need this for some reason.

As soon as you have the clustered database ready (and your application tweaked for the aforementioned changes), you should be ready to go!

Sebastian Rothbucher ( @sebredhh on Twitter) is a Web developer and CouchDB contributor. Coming from VB/Java, he now enjoys the JavaScript side of life (and spends some rainy Hamburg Saturdays hacking productivity tools and Web apps using CouchDB).

You can download the latest release candidate from http://couchdb.apache.org/release-candidate/2.0/ . Files with -RC in their name a special release candidate tags, and the files with the git hash in their name are builds off of every commit to CouchDB master.

We are inviting the community to thoroughly test their applications with CouchDB 2.0 release candidates. See the testing and setup instructions for more details.

About List