I recently came across a scenario where both of our Couchbase servers had failed due to major failures at our hosts’ data centers. One server eventually came back up but its state was set to “pending” and our app could not connect to it. We did enable replication but when we attempted to click the “fail over” button on the bad node, the scary data loss warnings frightened us away from attempting the fail over. Eventually, the second server came back on its own and the state of both Couchbase nodes changed to “up”.

This exercise is a test to see just how easy it is to recover from a single node and all node failure (assuming the node’s hard drives are still intact).

While the Couchbase documentation does explain all of this, I found this experiment most helpful to properly understand exactly what happens when nodes go down.

Set up a test two-node Couchbase environment

If you are using CentOS 6 or RedHat these steps should work. Otherwise just follow the instructions on couchbase.com.

sudo yum update -y
sudo wget http://packages.couchbase.com/releases/2.2.0/couchbase-server-community_2.2.0_x86_64.rpm
sudo rpm --install couchbase-server-community_2.2.0_x86_64.rpm 

Make sure the server’s firewall has these TCP ports open:

11209-11211, 4369, 8091-8092, 21100-21299

Once Couchbase is installed, you can access the Couchbase admin console from your browser:

http://your-couchbase-server-1:8091

Setup Couchbase

Since this is the first node we will start a new cluster:
Couchbase create new cluster

Default settings are fine for our test.
Create default Couchbase bucket

Select the beer-sample bucket so we can have some data to check when the nodes recover. You can use your own bucket too, just make sure replication is enabled.
Import sample bucket

We don’t care about Couchbase notifications for our test servers.
Ignore Couchbase notifications

Set up a Couchbase administrator account.
Setup an admin user

First node setup is complete:
First Couchbase node is setup

Now we need to set up the second node.

Repeat the steps above to install Couchbase.

Once Couchbase is installed on the second server visit that Couchbase server’s administration console in your browser.

http://your-couchbase-server-2:8091

This time we will be joining an existing cluster. Enter the IP address of the first node and the administrator username and password you set during the setup of the first node.
Join an existing Couchbase cluster

Server should now be associated to your Couchbase cluster.
Server added to cluster

In order to actually use the new node with your cluster, the cluster needs to be rebalanced. Click “Server Nodes” from the top nav and then click the “Pending Rebalance” tab. Then click “Rebalance” to the right.
Rebalance the Couchbase cluster

Wait for the nodes to rebalance before proceeding.
Rebalancing Couchbase nodes

When rebalancing is complete your nodes should look similar to this:
Couchbase nodes are rebalanced and active.

Now it’s time to fail some nodes.

Single-node failure

First have a look at the buckets in your cluster. Note the number of items in the beer-sample bucket. You should see 7303 items (unless the sample bucket has changed since this post).
Couchbase cluster buckets

The item count is an easy way to see how much data is potentially available.

Ok, now it’s time to kill a node. Choose one of your Couchbase nodes (it doesn’t matter which one) and either shut it down or just stop the Couchbase server service.

sudo service couchbase-server stop

If you were viewing the “failed” nodes web administration console you will be disconnected and should login to the other node’s web console.

You should see one node up and one down.
Single Couchbase node failure

Now have a look at your buckets. Note that the item count is now reduced by 50%. The data is still safe because the data was replicated and evenly distributed on all nodes. We are seeing an reduced item count because half the active data is gone.
Buckets state with one node down.

To get back access to all of our data we need to make the replica data (on our remaining node) active. This is actually really easy. Just click “Fail Over” on the down node.

You will be presented with the very scary data loss warning. I’m sure in some circumstances you will lose data but not with this simple scenario.
Confirm failover

The “down” server will be added to the “pending rebalance” tab. If you rebalance now, any data not replicated across the cluster on the “down” server will be lost. If the “down” server comes back online while it is pending rebalance you will be prompted to add the server back. If you did rebalance, the server will have to be reconfigured manually to join the cluster again.

Have a look at your buckets now. Item count should be 7303 again and it should look the same as before, except you now only have 1 node.
Cluster up with 1 node

Your Couchbase cluster should now be working (but slower and without replication).

Restart the “down” node so we can do the next test.
Couchbase should automatically detect that the previously “down” server is back and it will prompt you to add it.
Add node back

Add the node back and rebalance. Once complete your cluster should be up and running with 2 nodes.
Couchbase cluster working

Two-node failure

This is the actual situation we found ourselves in last week. Both of our nodes went down at the same time. To replicate this, stop the Couchbase service on both nodes.

Node 1:

sudo service couchbase-server stop

Node 2:

sudo service couchbase-server stop

Now start the Couchbase service on one of the nodes.

sudo service couchbase-server start

Login to the web administration console for the running node. You should see something like this:
Couchbase cluster pending and down

Now look at the buckets. Yikes! Item count is 0 on beer-sample.
Cluster down, bucket item count 0

To resolve this, it’s actually the same procedure as a single node failure. The only difference is that this time no nodes are up which means none of the Couchbase data is in an active state.
Click “Fail over” on the “down” node and confirm the fail over.

Now the node that was “pending” should now be “up”.

Couchbase, up down

Have a look at the buckets which should show 7303 items.
All items available in bucket

The cluster should now be running, just without replications and slower since we only have 1 node.

Now restart the Couchbase service on the “down” node.

sudo service couchbase-server start

Add it back to the cluster and rebalance.
Add node

Your cluster should now be fully restored.
Couchbase fully restored

  1. Download ElasticSearch
    wget http://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-0.20.6.tar.gz
    
    
  2. Extract archive
    tar -xzf elasticsearch-0.20.6.tar.gz
  3. Move extracted folder to /opt/elasticsearch
    mv elasticsearch-0.20.6 /opt/elasticsearch

  4. Set permissions
    chown -R root:root /opt/elasticsearch
  5. Change to elasticsearch directory
    cd /opt/elasticsearch

    // //

  6. Install Web GUI plugin
    bin/plugin -install mobz/elasticsearch-head
  7. Install Couchbase Transport plugin
    bin/plugin -install transport-couchbase -url http://packages.couchbase.com.s3.amazonaws.com/releases/elastic-search-adapter/1.0.0/elasticsearch-transport-couchbase-1.0.0.zip
  8. Setup a username and password for Couchbase Replication to connect to your ElasticSearch server. Change “abc123” to your desired password.
    echo "couchbase.password: abc123" >> config/elasticsearch.yml
    
     echo "couchbase.username: admin" >> config/elasticsearch.yml
    
     
  9. Edit ElasticSearch configuration file and set the following parameters
    cluster.name: NameOfYourCluster
    
    network.host: local ip address of this node
    
    node.name: "name of this node"
    
    
  10. Download a script that will allow you to run ElasticSearch as a service
    curl -L http://github.com/elasticsearch/elasticsearch-servicewrapper/tarball/master | tar -xz
  11. We only need the one script so move it over
    mv *servicewrapper*/service bin/
  12. Cleanup
    rm -Rf *servicewrapper*
  13. Install ElasticSearch as service with the new script.
    bin/service/elasticsearch install
  14. Create a symbolic link
    ln -s `readlink -f bin/service/elasticsearch` /usr/local/bin/rcelasticsearch
  15. Start the service
    service elasticsearch start
  16. Make ElasticSearch start on boot
    chkconfig elasticsearch on
  17. Set the default template for Couchbase Transport
    curl -XPUT http://localhost:9200/_template/couchbase -d @plugins/transport-couchbase/couchbase_template.json
  18. That’s it. Your ElasticSearch server is now ready to be setup as a replication endpoint for Couchbase. For instructions on how to setup the replication on your Couchbase server visit: http://blog.couchbase.com/couchbase-and-full-text-search-couchbase-transport-elastic-search