RabbitMQ cluster - Avoid loss of messages

You likely want to avoid loss of messages when using RabbitMQ. This post is about clusters, but I will begin by discussing how this is done outside of a clustered setup.

Making sure messages are not lost

RabbitMQ has at-least-once delivery meaning your message will be delivered, but can be delivered multiple times. In order to make sure that the broker has received a message you need to enable publish confirms. The same applies to consumers consuming messages. Without this the messages can be lost in transit.

Every server or application can go down - this also applies to RabbitMQ. This is where persistent messages and durable queues help us not lose messages. When RabbitMQ comes back online all of your messages will still be there - this requires you to have both persistent messages and durable queues. Of course if the disk is fried and it is a single HDD the messages are lost - however this is outside the control of RabbitMQ.

In a clustered setup

Everything is all good with one node, except for one thing: you do not have high availability! You are relying on a single RabbitMQ node, should it go down nothing is working and you better get that up and running as fast as possible. This is where RabbitMQ introduces clustering and federations. Here we will focus on clustering. A RabbitMQ cluster is several nodes getting together to form a single logical broker. This is a great upside because you can connect to any of the RabbitMQ nodes in the cluster. All exchanges and bindings are on all nodes and queues can be mirrored so that they are on more than one node. Meaning if one node goes down another can take over - you have a failover as active redundancy and replication.

This is all good as we now have some protection when a node goes down. However there is a huge pitfall with clusters. According to RabbitMQ they should reside on LAN (Local Area Network) and never on WAN (Wide Area Network). The people behind RabbitMQ states that
"If you are thinking of clustering across a WAN, don't.". The problem is that in the event of a network partition, both sides of the partition keeps on running. Living out two seperate lives. There will be master queues on both sides as mirrors are promoted. Even worse when the network partition is resolved this is not automatically fixed. So the two sides keep on running - creating their own state.

Recovering from a network partition

In order to fix a network partition you must do the following:

Select the node/partition with the state that you want. The other nodes' state will be lost.
- If possible you can drain all the messages that are stuck on "bad" nodes by moving all publishers to the node/partition you wish to continue working with. If any messages remain use the shovel plugin.
Close down all other nodes - again their state will be lost.
Start them up and they will join the other node/partition - picking up that state instead of what they initially had. Which means that the network partition is now "fixed".
Restart the node/partition which had the state you wanted (Clears a partition warning).

If not apparent, I wish to state that the above may cause loss of messages!

RabbitMQ also features Autoheal. But this is basically saying you want high availability over data integrity (which you may of course want). In this mode RabbitMQ may throw away messages (just as it may when you manually rejoin).

To my knowledge RabbitMQ clusters should be hosted over LAN unless you want to deal with the above. Which is a question of data integrity. You may have a stable network but I do not feel that I can promote creating clusters over WAN when the creators of RabbitMQ say "Don't!" :)

Feel free to let me know your thoughts in the comments!