massupgrade: lots of fixes and improvements

This SOP had a bunch of old stuff in it.
This syncs it back up with reality mostly.

Proofreading/formatting welcome.

Questions also welcome. :)

Signed-off-by: Kevin Fenzi <kevin@scrye.com>
This commit is contained in:
Kevin Fenzi 2023-03-27 12:43:45 -07:00 committed by zlopez
parent a7dd90e1c3
commit f03fde0f5a

View file

@ -38,39 +38,29 @@ Purpose:::
== Preparation
[arabic]
. Determine which host group you are going to be doing updates/reboots
on.
+
Group "A"::
servers that end users will see or note being down and anything that
depends on them.
Group "B"::
servers that contributors will see or note being down and anything
that depends on them.
Group "C"::
servers that infrastructure will notice are down, or are redundent
enough to reboot some with others taking the load.
. Appoint an 'Update Leader' for the updates.
. Follow the xref:outage.adoc[Outage Infrastructure SOP] and send advance notification
to the appropriate lists. Try to schedule the update at a time when many
admins are around to help/watch for problems and when impact for the
group affected is less. Do NOT do multiple groups on the same day if
possible.
. Plan an order for rebooting the machines considering two factors:
+
____
* Location of systems on the kvm or xen hosts. [You will normally reboot
all systems on a host together]
* Impact of systems going down on other services, operations and users.
Thus since the database servers and nfs servers are the backbone of many
other systems, they and systems that are on the same xen boxes would be
rebooted before other boxes.
____
. To aid in organizing a mass upgrade/reboot with many people helping,
it may help to create a checklist of machines in a gobby document.
. Schedule downtime in nagios.
. Make doubly sure that various app owners are aware of the reboots
Mass updates are usually applied every few months or sooner if there's some
critical bugs fixed. Mass updates are done outside of freeze windows to avoid
causing any problems for Fedora releases.
The following items are all done before the actual mass update:
* Plan a outage window or windows outside of a freeze.
* File a outage ticket in the fedora-infrastructure tracker, using the outage
template. This should describe the exact time/date and what is included.
* Get the outage ticket reviewed by someone else to confirm there's no mistakes
in it.
* Sent outage announcement to infrastructure and devel-announce lists (for
outages that affect contributors only) or infrastructure, devel-announce
and announce (for outages that affect all users).
* Add a 'planned' outage to fedorastatus. This will show the planned outage
there for higher visibility.
* Setup a hackmd or other shared document that lists all the virthosts and
bare metal hosts that need rebooting and organize it per day. This is used
to track what admin is handling what server(s).
Typically updates/reboots are done on all staging hosts on a monday,
then all non outage causing hosts on tuesday and then finally the outages
are on wednsday.
== Staging
@ -78,43 +68,25 @@ ____
Any updates that can be tested in staging or a pre-production
environment should be tested there first. Including new kernels, updates
to core database applications / libraries. Web applications, libraries,
etc.
etc. This is typically done a few days before the actual outage.
Too far in advance and things may have changed again, so it's important
to do this just before the production updates.
____
== Non outage causing hosts
Some hosts can be safely updated/rebooted without an outage because
they either have multiple machines in a load balancer or are not
visible to end users or other reasons. These updates are typically
done on tuesday of the outage week so they are done before the outage
on wed. These hosts include proxies and a number of virthosts that
have vm's that meet this criteria.
== Special Considerations
While this may not be a complete list, here are some special things that
must be taken into account before rebooting certain systems:
=== Disable builders
Before the following machines are rebooted, all koji builders should be
disabled and all running jobs allowed to complete:
____
* db04
* nfs01
* kojipkgs02
____
Builders can be removed from koji, updated and re-added. Use:
....
koji disable-host NAME
and
koji enable-host NAME
....
[NOTE]
====
you must be a koji admin
====
Additionally, rel-eng and builder boxes may need a special version
of rpm. Make sure to check with rel-eng on any rpm upgrades for them.
=== Post reboot action
The following machines require post-boot actions (mostly entering
@ -122,50 +94,64 @@ passphrases). Make sure admins that have the passphrases are on hand for
the reboot:
____
* backup-2 (LUKS passphrase on boot)
* sign-vault01 (NSS passphrase for sigul service)
* sign-bridge01 (NSS passphrase for sigul bridge service)
* serverbeach* (requires fixing firewall rules):
* backup01 (ssh agent passphrase for backup ssh key)
* sign-vault01 (NSS passphrase for sigul service and luks passphrase)
* sign-bridge01 (run: 'sigul_bridge -dvv' after it comes back up, not passphrase needed)
* autosign01 (NSS passphrase for robosignatory service and luks passphrase)
* buildvm-s390x-15/16/16 ( need sshfs mount of koji volume redone)
* batcave01 (ssh agent passphrase for ansible ssh key)
* notifs-backend01 (
rabbitmqctl eval 'application:set_env(rabbit, consumer_timeout, 36000000).'
systemctl restart fmn-backend@1; for i in `seq 1 24`; do echo $i; systemctl restart fmn-worker@$i | cat; done
____
Each serverbeach host needs 3 or 4 iptables rules added anytime it's
rebooted or libvirt is upgraded:
....
iptables -I FORWARD -o virbr0 -j ACCEPT
iptables -I FORWARD -i virbr0 -j ACCEPT
iptables -t nat -I POSTROUTING -s 192.168.122.3/32 -j SNAT --to-source 66.135.62.187
....
[NOTE]
====
The source is the internal guest ips, the to-source is the external ips
that map to that guest ip. If there are multiple guests, each one needs
the above SNAT rule inserted.
====
=== Schedule autoqa01 reboot
There is currently an autoqa01.c host on cnode01. Check with QA folks
before rebooting this guest/host.
=== Bastion01 and Bastion02 and openvpn server
We need one of the bastion machines to be up to provide openvpn for all
machines. Before rebooting bastion02, modify:
`manifests/nodes/bastion0*.iad2.fedoraproject.org.pp` files to start
openvpn server on bastion01, wait for all clients to re-connect, reboot
bastion02 and then revert back to it as openvpn hub.
If a reboot of bastion01 is done during an outage, nothing needs to be changed
here. However, if bastion01 will be down for an extended period of time
openvpn can be switched to bastion02 by stopping openvpn-server@openvpn
on bastion01 and starting it on bastion02.
=== Special yum directives
on bastion01: 'systemctl stop openvpn-server@openvpn'
on bastion02: 'systemctl start openvpn-server@openvpn'
Sometimes we will wish to exclude or otherwise modify the yum.conf on a
machine. For this purpose, all machines have an include, making them
read
http://infrastructure.fedoraproject.org/infra/hosts/FQHN/yum.conf.include
(TODO Fix link)
from the infrastructure repo. If you need to make such changes, add them
to the infrastructure repo before doing updates.
and the process can be reversed after the other is back.
Clients try 01 first, then 02 if it's down. It's important
to make sure all the clients are using one machine or the
other, because if they are split routing between machines
may be confused.
=== batcave01
batcave01 is our ansible control host. It's where you run playbooks
that have been mentioned in this SOP. However, it too needs updating
and rebooting and you cannot use the vhost_reboot playbook for it,
since it's rebooting it's own virthost. For this host you should
go to the virthost and 'virsh shutdown' all the other vm's, then
'virsh shutdown' batcave01, then reboot the virthost manually.
=== noc01 / dhcp server
noc01 is our dhcp server. Unfortunately, when rebooting the vmhost that
contains noc01 vm, it means that that vmhost has no dhcp server to
answer it when booting and trying to configure network to talk to
the tang server. To work around this you can run a simple dhcpd
on batcave01. Start it there and let the vmhost with noc01 come
up and then stop it. Ideally we would make another dhcp host
to avoid this issue at some point.
batcave01: 'systemctl start dhcpd'
remember to stop it after the host comes back up.
=== Special package management directives
Sometimes we need to exclude something from being updated.
This can be done with the package_exlcudes variable. Set
that and the playbooks doing updates will exclude listed items.
This variable is set in ansible/host_vars or ansible/group_vars
for the host or group.
== Update Leader
@ -178,168 +164,29 @@ come back up from reboot or aren't working right after reboot. It's
important to avoid multiple people operating on a single machine in a
read-write manner and interfering with changes.
== Group A reboots
Uusally for a mass update/reboot there will be a hackmd or similar
document that tracks what machines have already been rebooted
and who is working on which one. Please check with the leader
for a link to this document.
Group A machines are end user critical ones. Outages here should be
planned at least a week in advance and announced to the announce list.
== Updates and Reboots via playbook
List of machines currently in A group (note: this is going to be
automated)
There's several playbooks related to this task:
vhost_update.yml applies updates to a vmhost and all it's guests
vhost_reboot.yml shuts down vm's and reboots a vmhost
vhost_update_reboot.yml does both of the above
These hosts are grouped based on the virt host they reside on:
For hosts out of outage you probably want to use these to make sure
updates are applied before reboots. Once updates are applied globally
before the outage you will want to just use the reboot playbook.
* torrent02.fedoraproject.org
* ibiblio02.fedoraproject.org
* people03.fedoraproject.org
* ibiblio03.fedoraproject.org
* collab01.fedoraproject.org
* serverbeach09.fedoraproject.org
* db05.iad2.fedoraproject.org
* virthost03.iad2.fedoraproject.org
* db01.iad2.fedoraproject.org
* virthost04.iad2.fedoraproject.org
* db-fas01.iad2.fedoraproject.org
* proxy01.iad2.fedoraproject.org
* virthost05.iad2.fedoraproject.org
* ask01.iad2.fedoraproject.org
* virthost06.iad2.fedoraproject.org
These are the rest:
* bapp02.iad2.fedoraproject.org
* bastion02.iad2.fedoraproject.org
* app05.fedoraproject.org
* backup02.fedoraproject.org
* bastion01.iad2.fedoraproject.org
* fas01.iad2.fedoraproject.org
* fas02.iad2.fedoraproject.org
* log02.iad2.fedoraproject.org
* memcached03.iad2.fedoraproject.org
* noc01.iad2.fedoraproject.org
* ns02.fedoraproject.org
* ns04.iad2.fedoraproject.org
* proxy04.fedoraproject.org
* smtp-mm03.fedoraproject.org
* batcave02.iad2.fedoraproject.org
* mm3test.fedoraproject.org
* packages02.iad2.fedoraproject.org
=== Group B reboots
This Group contains machines that contributors use. Announcements of
outages here should be at least a week in advance and sent to the
devel-announce list.
These hosts are grouped based on the virt host they reside on:
* db04.iad2.fedoraproject.org
* bvirthost01.iad2.fedoraproject.org
* nfs01.iad2.fedoraproject.org
* bvirthost02.iad2.fedoraproject.org
* pkgs01.iad2.fedoraproject.org
* bvirthost03.iad2.fedoraproject.org
* kojipkgs02.iad2.fedoraproject.org
* bvirthost04.iad2.fedoraproject.org
These are the rest:
* koji04.iad2.fedoraproject.org
* releng03.iad2.fedoraproject.org
* releng04.iad2.fedoraproject.org
=== Group C reboots
Group C are machines that infrastructure uses, or can be rebooted in
such a way as to continue to provide services to others via multiple
machines. Outages here should be announced on the infrastructure list.
Group C hosts that have proxy servers on them:
* proxy02.fedoraproject.org
* ns05.fedoraproject.org
* hosted-lists01.fedoraproject.org
* internetx01.fedoraproject.org
* app01.dev.fedoraproject.org
* darkserver01.dev.fedoraproject.org
* fakefas01.fedoraproject.org
* proxy06.fedoraproject.org
* osuosl01.fedoraproject.org
* proxy07.fedoraproject.org
* bodhost01.fedoraproject.org
* proxy03.fedoraproject.org
* smtp-mm02.fedoraproject.org
* tummy01.fedoraproject.org
* app06.fedoraproject.org
* noc02.fedoraproject.org
* proxy05.fedoraproject.org
* smtp-mm01.fedoraproject.org
* telia01.fedoraproject.org
* app08.fedoraproject.org
* proxy08.fedoraproject.org
* coloamer01.fedoraproject.org
Other Group C hosts:
* ask01.stg.iad2.fedoraproject.org
* app02.stg.iad2.fedoraproject.org
* proxy01.stg.iad2.fedoraproject.org
* releng01.stg.iad2.fedoraproject.org
* value01.stg.iad2.fedoraproject.org
* virthost13.iad2.fedoraproject.org
* db-fas01.stg.iad2.fedoraproject.org
* pkgs01.stg.iad2.fedoraproject.org
* packages01.stg.iad2.fedoraproject.org
* virthost11.iad2.fedoraproject.org
* app01.stg.iad2.fedoraproject.org
* koji01.stg.iad2.fedoraproject.org
* db02.stg.iad2.fedoraproject.org
* fas01.stg.iad2.fedoraproject.org
* virthost10.iad2.fedoraproject.org
* autoqa01.qa.fedoraproject.org
* autoqa-stg01.qa.fedoraproject.org
* bastion-comm01.qa.fedoraproject.org
* batcave-comm01.qa.fedoraproject.org
* virthost-comm01.qa.fedoraproject.org
* compose-x86-01.iad2.fedoraproject.org
* compose-x86-02.iad2.fedoraproject.org
* download01.iad2.fedoraproject.org
* download02.iad2.fedoraproject.org
* download03.iad2.fedoraproject.org
* download04.iad2.fedoraproject.org
* download05.iad2.fedoraproject.org
* download-rdu01.vpn.fedoraproject.org
* download-rdu02.vpn.fedoraproject.org
* download-rdu03.vpn.fedoraproject.org
* fas03.iad2.fedoraproject.org
* secondary01.iad2.fedoraproject.org
* memcached04.iad2.fedoraproject.org
* virthost01.iad2.fedoraproject.org
* app02.iad2.fedoraproject.org
* value03.iad2.fedoraproject.org
* virthost07.iad2.fedoraproject.org
* app03.iad2.fedoraproject.org
* value04.iad2.fedoraproject.org
* ns03.iad2.fedoraproject.org
* darkserver01.iad2.fedoraproject.org
* virthost08.iad2.fedoraproject.org
* app04.iad2.fedoraproject.org
* packages02.iad2.fedoraproject.org
* virthost09.iad2.fedoraproject.org
* hosted03.fedoraproject.org
* serverbeach06.fedoraproject.org
* hosted04.fedoraproject.org
* serverbeach07.fedoraproject.org
* collab02.fedoraproject.org
* serverbeach08.fedoraproject.org
* dhcp01.iad2.fedoraproject.org
* relepel01.iad2.fedoraproject.org
* sign-bridge02.iad2.fedoraproject.org
* koji03.iad2.fedoraproject.org
* bvirthost05.iad2.fedoraproject.org
* (disable each builder in turn, update and reenable).
* ppc11.iad2.fedoraproject.org
* ppc12.iad2.fedoraproject.org
* backup03
Additionally there are two more playbooks to check things:
check-for-nonvirt-updates.yml
check-for-updates.yml
See those playbooks for more information, but basically they allow
you to see how many updates are pending on all the virthosts/bare
metal machines and/or all machines. This is good to run at the end
of outages to confirm that everything was updated.
== Doing the upgrade
@ -348,55 +195,21 @@ If possible, system upgrades should be done in advance of the reboot
make sure that the Infrastructure RHEL repo is updated as necessary to
pull in the new packages (xref:infra-repo.adoc[Infrastructure Yum Repo SOP])
On batcave01, as root run:
....
func-yum [--host=hostname] update
....
..note: --host can be specified multiple times and takes wildcards.
pinging people as necessary if you are unsure about any packages.
Additionally you can see which machines still need rebooted with:
....
sudo func-command --timeout=10 --oneline /usr/local/bin/needs-reboot.py | grep yes
....
You can also see which machines would need a reboot if updates were all
applied with:
....
sudo func-command --timeout=10 --oneline /usr/local/bin/needs-reboot.py after-updates | grep yes
....
== Doing the reboot
In the order determined above, reboots will usually be grouped by the
virtualization hosts that the servers are on. You can see the guests per
virt host on batcave01 in `/var/log/virthost-lists.out`
To reboot sets of boxes based on which virthost they are we've written a
special script which facilitates it:
....
func-vhost-reboot virthost-fqdn
....
ex:
....
sudo func-vhost-reboot virthost13.iad2.fedoraproject.org
....
Before outage, ansible can be use to just apply all updates to hosts or
apply all updates to staging hosts before those are done. Something like:
ansible -m shell 'yum clean all; yum update -y; rkhunter --propupd' hostlist
== Aftermath
[arabic]
. Make sure that everything's running fine
. Reenable nagios notification as needed
. Check nagios for alerts and clear them all
. Reenable nagios notification after they are cleared.
. Make sure to perform any manual post-boot setup (such as entering
passphrases for encrypted volumes)
. Consider running check-for-updates or check-for-nonvirt-updates to confirm
that all hosts are updated.
. close fedorastatus outage
. Close outage ticket.
=== Non virthost reboots
@ -405,8 +218,5 @@ If you need to reboot specific hosts and make sure they recover -
consider using:
....
sudo func-host-reboot hostname hostname1 hostname2 ...
sudo ansible -m reboot hostname
....
If you want to reboot the hosts one at a time waiting for each to come
back before rebooting the next pass a `-o` to `func-host-reboot`.