Best practices for running RabbitMQ in OpenStack
Deploy RabbitMQ on dedicated nodes
With dedicated nodes, RabbitMQ is isolated from other CPU-hungry processes, and hence can sustain more stress.This isolation option is available in Mirantis OpenStack starting from version 8.0. For more information, do a search for ‘Detach RabbitMQ’ on the validated plugins page.
Run RabbitMQ with HiPE
HiPE stands for High Performance Erlang. When HiPE is enabled, the Erlang application is pre-compiled into machine code before being executed. Our benchmark showed that this gives RabbitMQ a performance boost up to 30%. (If you're into that sort of thing, you can find the benchmark details here and the results are here.)The drawback with doing things this way is that application initial start time increases considerably while the Erlang application is compiled. With HiPE, the first RabbitMQ start takes around 2 minutes.
Another subtle drawback we have discovered is that if HiPE is enabled, debugging RabbitMQ might be hard as HiPE can spoil error tracebacks, rendering them unreadable.
HiPE is enabled in Mirantis OpenStack starting with version 9.0.
Do not use queue mirroring for RPC queues
Our research shows that enabling queue mirroring on a 3-node cluster makes message throughput drop twice. You can see this effect in publicly available data produced by Mirantis Scale team - test reports.On the other side, RPC messages become obsolete pretty quickly (1 minute) and if messages are lost, it leads only to failure of current operations in progress, so overall RPC queues without mirroring seem to be a good tradeoff.
At Mirantis, you generally enable queue mirroring only for Ceilometer queues, where messages must be preserved. You can see how we define such a RabbitMQ policy here.
The option to turn off queue mirroring is available in MOS starting in Mirantis OpenStack 8.0 and is enabled by default for RPC queues starting in version 9.0.
Use a separate RabbitMQ cluster for Ceilometer
In general, Ceilometer doesn't send many messages through RabbitMQ. But if Ceilometer gets stuck, its queues overflow. That leads to RabbitMQ crashing, which in turn causes outages for other OpenStack services.The ability to use a separate RabbitMQ cluster for notifications is available starting with OpenStack Mitaka (MOS 9.0) and is not supported in MOS out of the box. The feature is not documented yet, but you can find the implementation here.
Reduce Ceilometer metrics volume
Another best practice when it comes to running RabbitMQ beneath OpenStack is to reduce the number of metrics sent and/or their frequency. Obviously that reduces stress put on RabbitMQ, Ceilometer and MongoDB, but it also reduces the chance of messages piling up in RabbitMQ if Ceilometer/MongoDB can't cope with their volume. In turn, messages piling up in a queue reduce overall RabbitMQ performance.You can also mitigate the effect of messages piling up by using RabbitMQ’s lazy queues feature (available starting with RabbitMQ 3.6.0), but as of this writing, MOS does not make use of lazy queues..
(Carefully) consider disabling queue mirroring for Ceilometer queues
In the Mirantis OpenStack architecture, queue mirroring is the only ‘persistence’ measure used. We do not use durable queues, so do not disable queue mirroring if losing Ceilometer notifications will hurt you. For example, if notification data is used for billing, you can't afford to lose those notifications.The ability to disable mirroring for Ceilometer queues is available in Mirantis OpenStack starting with version 8.0, but it is disabled by default.
So what do you think? Did we leave out any of your favorite tips? Let us know in the comments!