This worked in prod, but in staging the queue isn't starting with the
username because that has a .stg in it. So, we need to also have the
queue's have .stg in the name.
Signed-off-by: Kevin Fenzi <kevin@scrye.com>
Fixes https://pagure.io/fedora-infrastructure/issue/9170
Lets just have rabbitmq cleaup any queues in the /bodhi vhost that are
around for more than a week idle.
Signed-off-by: Kevin Fenzi <kevin@scrye.com>
The osci queue's have changed since this playbook last completed.
The ttl changed and the routing keys have changed.
The ansible rabbitmq module can't change these things on already created
queues because the api doesn't allow it. This makes this playbook fail
with:
"RabbitMQ RESTAPI doesn't support attribute changes for existing
queues"
So, for now, set the ttl to what it already is, and don't change the
routing keys at all. Hopefully this will get it to complete and osci can
manage at least routing keys themseleves wherever they do that.
Signed-off-by: Kevin Fenzi <kevin@scrye.com>
This was hidden away in the odcs playbook in fedora infra, so I missed
that we didn't make it in the odcs role, which is where we copied the
things for the centos odcs application. So, add it in there so it makes
a centos-odcs user.
Signed-off-by: Kevin Fenzi <kevin@scrye.com>
Not all OSCI queues are actively used all the time -- no need to keep messages for 10 days in them. 5 days TTL should be plenty of time even for actively used queues.
The default loop var is 'item' but it's already being used in
rabbit/queue so if we use it here also it causes clashing and a invalid
binding. So, change this one to something else and see if it fixes the
issue.
Signed-off-by: Kevin Fenzi <kevin@scrye.com>
We have been having the cluster fall over for still unknown reasons,
but this patch should at least help prevent them:
first we increase the net_ticktime parameter from it's default of 60 to 120.
rabbitmq sends 4 'ticks' to other cluster members over this time and if 25%
of them are lost it assumes that cluster member is down. All these vm's are
on the same net and in the same datacenter, but perhaps heavy load
from other vm's causes them to sometimes not get a tick in time?
http://www.rabbitmq.com/nettick.html
Also, set our partitioning strategy to autoheal. Currently if some cluster
member gets booted out, it gets paused, and stops processing at all.
With autoheal it will try and figure out a 'winning' partition and restart
all the nodes that are not in that partition.
https://www.rabbitmq.com/partitions.html
Hopefully the first thing will make partitions less likely and the second
will make them repair without causing massive pain to the cluster.
Signed-off-by: Kevin Fenzi <kevin@scrye.com>
The Federation plugin uses an AMQP client that verifies that the
hostname it's connecting to is the right one. Our RabbitMQ server
TLS certificates only have the "public" name as Subject Alternative Name
and in that case apparently the client does not check the CN. Therefore
this changeset sets the client parameter to expect the "public" name in
the certificate.
Signed-off-by: Aurélien Bompard <aurelien@bompard.org>