Clustered RabbitMQ on Kubernetes
Naming your rabbits
To deploy a Kubernetes RabbitMQ cluster inside k8s poses a series of interesting problems. The very first problem is what kind of names we should use to make rabbits that can see each other. Here are some examples of allowable names in different forms:- rabbit@hostname
- rabbit@hostname.domainname
- rabbit@172.17.0.4
The Erlang distribution (which is actively used by RabbitMQ) can run in one of two naming modes: short names or long names. The rule of thumb is that it's "long" when it contains a period (.), and "short" otherwise. For the name examples above that means that the first one is a short name, and the second and the third are long names.
Looking at all this we see that on Kubernetes we have the following options for naming nodes:
- Use Kubernetes (StatefulSets) so we will have some stable DNS names. In contrast to the normal "disposable" replicas that can be easily dropped if they become unhealthy, a statefulSet is a group of stateful pods with a stronger notion of identity.
- Use IP-addresses and the help of some sort of automatic peer discovery (such as the autocluster plugin software, which automatically clusters RabbitMQ nodes in a discoverable way)
Erlang cookie
The second prerequisite for successful clustering is that RabbitMQ nodes need to have a shared secret cookie. By default, RabbitMQ on Kubernetes reads this cookie from a file (and generates this file if it’s missing). Our options for making sure this cookie is the same on all nodes are as following:- Create the cookie file during docker image creation. This practice isn't recommended, because knowing this cookie gives you the full access to all RabbitMQ internals.
- Create the cookie file in an entrypoint script, with the secret value passed as an environment variable. If we also need an entrypoint script for other reasons, this is as good as the next solution.
- Pass additional credentials to RabbitMQ via environment variables:RABBITMQ_CTL_ERL_ARGS=”-setcookie <our-cookie>”RABBITMQ_SERVER_ADDITIONAL_ERL_ARGS=”-setcookie <our-cookie>”
Clustering gotchas
Another important thing to know about RabbitMQ on Kubernetes clusters is that when a node joins to a cluster, its data will be lost, no matter what. It doesn't matter in the most common case - when it's the empty node that is joining to the cluster, as we have nothing to lose in this case. But if we had 2 nodes that operated independently for some time and accumulated some data, there is no way to join them without any losses (note that restoring a cluster after network-split or node outage is just a special case of the same thing, and also with data loss). For a specific workload you can invent some workarounds, such as draining (manually or automatically) the nodes that are bound to be reset. But there is just no way to make such a solution robust, automatic and universal.So our choice of automatic clustering solution is heavily influenced by what kinds of data losses we can tolerate for our concrete workloads.
Cluster formation
Assuming that you've solved all naming-related problems and can cluster your rabbits manually using rabbitmqctl, it's time to make our cluster assembly automatic. But as we seen in the previous section, there is no one-size-fits-all solution for that problem.One such very specific solution is only suitable for workloads where we don't care if some data is lost along with a server that went down or got disconnected. An example of this kind of workload is when you're using RPC, where clients just retry their (preferably idempotent) requests after any error/timeout, because by the time a server recovers from a failure, the RPC request will be stale, and no longer relevant anyway. Fortunately, this is exactly what's happening with RPC calls performed by various components of OpenStack.
Bearing all of the above in mind, we can start designing our solution. The more stateless the better, so using IP-addresses is preferable to using PetSets, and the autocluster plugin is our obvious candidate for forming a cluster from a bunch of dynamic disposable nodes.
Going through autocluster documentation, we end up setting the following configuration options:
- {backend, etcd}: This is an almost arbitrary choice; consul or k8s would've worked just as well. The only reason for choosing it was that it's easier to test. You can download the etcd binary, run it without any parameters, and just start forming clusters on localhost.
- {autocluster_failure, stop}: A pod that failed to join the cluster is useless for us, so it should bail out and hope that the next restart will happen in a more friendly environment.
- {cluster_cleanup, true}, {cleanup_interval, 30},{cleanup_warn_only, false}, {etcd_ttl, 15}: The node is registered in etcd only after it's successfully the joined cluster and fully started up. This registration TTL is constantly updated while the node is alive. If the node dies (or fails to update TTL in any other way), it's forcefully kicked from the cluster. So even if the failed node restarts with the same IP, it'll be able to join the cluster afresh.
Unexpected races
If you've tried assembling a cluster several times with the config above, you may have noticed that sometimes it can assemble several different unconnected clusters. This happens because the only protection against startup races in autocluster is some random delay during node startup - and with very bad timing every node can decide that it is the first node (i.e. there is no other records in etcd) and just start in unclustered mode.This problem led to the development of a pretty big patch for autocluster. It adds proper startup locking - the node acquires the startup lock early in startup, and releases it only after it has properly registered itself in the backend. Only the etcd backend is supported at the moment, but others can be added easily (by implementing just 2 new callbacks in the backend module).
Another solution to that problem is, again, kubernetes PetSets, as they can perform startup orchestration - with only one node performing startup at any given time - but they are currently considered an alpha feature, and the patch above provides provides this functionality for everyone, not only for kubernetes users.
Monitoring
The only thing left is to monitor our RabbitMQ cluster so that it can run unattended. Kubernetes users need to monitor both rabbit's health and whether it's properly clustered with the rest of nodes.You may remember the times when rabbitmqctl list_queues /rabbitmqctl list_channels was used as a method of monitoring, but this is not ideal, because it can't distinguish between local and remote problems, and it creates significant network load. To that end, meet the new and shiny rabbitmqctl node_health_check - since 3.6.4 it's the best way to check the health of any single RabbitMQ node.
Checking whether a node is properly clustered requires several checks:
- It should be clustered with the best node registered in the autocluster backend. This is the node that new nodes will attempt to join to, which is currently the first alive node in alphabetical order.
- Even when the node is properly clustered with the discovery node, its data can still be diverged. Also this check is not transitive, so we need to check the partitions list both on the current node and on the discovery node.
rabbitmqctl eval 'autocluster:cluster_health_check_report().'
Using this rabbitmqctl command we can both detect any problem with our rabbit node and stop it immediately, so kubernetes will have a chance to do its restarting magic.
Make your own RabbitMQ cluster
If you want to replicate this setup yourself, you'll need a recent version of RabbitMQ and the custom release of the autocluster plugin (as the startup locking patch is not yet accepted upstream).You can look into how this setup was done for Fuel CCP, or use the standalone version of the same setup as a base for your own implementation.
To give you an idea of how this works, let's assume that you have cloned the second repository, and that you have a k8s namespace named `demo` with an `etcd` server running in it and accessible using the same `etcd` name. You can create this setup by running the following commands:
kubectl create namespace demo kubectl run etcd --image=microbox/etcd --port=4001 \Once that's done, to set up RabbitMQ, follow these steps:
--namespace=demo -- --name etcd kubectl --namespace=demo expose deployment etcd
- Build a Docker image with proper version of RabbitMQ and autocluster, and with all the necessary configuration parts.
$ docker build . -t rabbitmq-autocluster
- Store the erlang cookie to k8s secret storage.
$ kubectl create secret generic --namespace=demo erlang.cookie \
--from-file=./erlang.cookie - Create a 3-node RabbitMQ deployment. For simplicity's sake, you can use the rabbitmq.yaml file from https://github.com/binarin/rabbit-on-k8s-standalone/blob/master/rabbitmq.yaml.
$ kubectl create -f rabbitmq.yaml
- Check that clustering indeed worked.
$ FIRST_POD=$(kubectl get pods --namespace demo -l 'app=rabbitmq' \
-o jsonpath='{.items[0].metadata.name }') $ kubectl exec --namespace=demo $FIRST_POD rabbitmqctl \
cluster_status
Cluster status of node 'rabbit@172.17.0.3' ... [{nodes,[{disc,['rabbit@172.17.0.3','rabbit@172.17.0.4', 'rabbit@172.17.0.7']}]}, {running_nodes,['rabbit@172.17.0.4','rabbit@172.17.0.7','rabbit@172.17.0.3']}, {cluster_name,<<"rabbit@rabbitmq-deployment-861116474-cmshz">>}, {partitions,[]}, {alarms,[{'rabbit@172.17.0.4',[]}, {'rabbit@172.17.0.7',[]}, {'rabbit@172.17.0.3',[]}]}]The important part here is that both
nodes
and running_nodes
contain three nodes.