From 93cfa0134d27fa2433935e39799f65cf942bb8e8 Mon Sep 17 00:00:00 2001 From: Kevin Fenzi Date: Thu, 12 Mar 2020 19:25:55 +0000 Subject: [PATCH] rabbitmq: adjust things to avoid messy partitions We have been having the cluster fall over for still unknown reasons, but this patch should at least help prevent them: first we increase the net_ticktime parameter from it's default of 60 to 120. rabbitmq sends 4 'ticks' to other cluster members over this time and if 25% of them are lost it assumes that cluster member is down. All these vm's are on the same net and in the same datacenter, but perhaps heavy load from other vm's causes them to sometimes not get a tick in time? http://www.rabbitmq.com/nettick.html Also, set our partitioning strategy to autoheal. Currently if some cluster member gets booted out, it gets paused, and stops processing at all. With autoheal it will try and figure out a 'winning' partition and restart all the nodes that are not in that partition. https://www.rabbitmq.com/partitions.html Hopefully the first thing will make partitions less likely and the second will make them repair without causing massive pain to the cluster. Signed-off-by: Kevin Fenzi --- roles/rabbitmq_cluster/templates/rabbitmq.config | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/roles/rabbitmq_cluster/templates/rabbitmq.config b/roles/rabbitmq_cluster/templates/rabbitmq.config index 5c38dbdc1f..82dd4446f1 100644 --- a/roles/rabbitmq_cluster/templates/rabbitmq.config +++ b/roles/rabbitmq_cluster/templates/rabbitmq.config @@ -21,7 +21,7 @@ %% How to respond to cluster partitions. %% Documentation: https://www.rabbitmq.com/partitions.html - {cluster_partition_handling, pause_minority}, + {cluster_partition_handling, autoheal}, %% And some general config {log_levels, [{connection, none}]}, @@ -29,9 +29,7 @@ {heartbeat, 600}, {channel_max, 128} ]}, - {kernel, - [ - ]}, + {kernel, [{net_ticktime, 120}]}, {rabbitmq_management, [ {listener, [{port, 15672},