infra-docs-fpo/modules/sysadmin_guide/pages/massupgrade.adoc
Kevin Fenzi f03fde0f5a massupgrade: lots of fixes and improvements
This SOP had a bunch of old stuff in it.
This syncs it back up with reality mostly.

Proofreading/formatting welcome.

Questions also welcome. :)

Signed-off-by: Kevin Fenzi <kevin@scrye.com>
2023-05-15 09:46:13 +00:00

222 lines
8.3 KiB
Text

= Mass Upgrade Infrastructure SOP
Every once in a while, we need to apply mass upgrades to our servers for
various security and other upgrades.
== Contents
* <<_contact_information>>
* <<_preparation>>
* <<_staging>>
* <<_special_considerations>>
** <<_disable_builders>>
** <<_post_reboot_action>>
** <<_schedule_autoqa01_reboot>>
** <<_bastion01_and_bastion02_and_openvpn_server>>
** <<_special_yum_directives>>
* <<_update_leader>>
* <<_group_a_reboots>>
* <<_group_b_reboots>>
* <<_group_c_reboots>>
* <<_doing_the_upgrade>>
* <<_doing_the_reboot>>
* <<_aftermath>>
== Contact Information
Owner:::
Fedora Infrastructure Team
Contact:::
#fedora-admin, sysadmin-main, infrastructure@lists.fedoraproject.org,
#fedora-noc
Location:::
All over the world.
Servers:::
all
Purpose:::
Apply kernel/other upgrades to all of our servers
== Preparation
Mass updates are usually applied every few months or sooner if there's some
critical bugs fixed. Mass updates are done outside of freeze windows to avoid
causing any problems for Fedora releases.
The following items are all done before the actual mass update:
* Plan a outage window or windows outside of a freeze.
* File a outage ticket in the fedora-infrastructure tracker, using the outage
template. This should describe the exact time/date and what is included.
* Get the outage ticket reviewed by someone else to confirm there's no mistakes
in it.
* Sent outage announcement to infrastructure and devel-announce lists (for
outages that affect contributors only) or infrastructure, devel-announce
and announce (for outages that affect all users).
* Add a 'planned' outage to fedorastatus. This will show the planned outage
there for higher visibility.
* Setup a hackmd or other shared document that lists all the virthosts and
bare metal hosts that need rebooting and organize it per day. This is used
to track what admin is handling what server(s).
Typically updates/reboots are done on all staging hosts on a monday,
then all non outage causing hosts on tuesday and then finally the outages
are on wednsday.
== Staging
____
Any updates that can be tested in staging or a pre-production
environment should be tested there first. Including new kernels, updates
to core database applications / libraries. Web applications, libraries,
etc. This is typically done a few days before the actual outage.
Too far in advance and things may have changed again, so it's important
to do this just before the production updates.
____
== Non outage causing hosts
Some hosts can be safely updated/rebooted without an outage because
they either have multiple machines in a load balancer or are not
visible to end users or other reasons. These updates are typically
done on tuesday of the outage week so they are done before the outage
on wed. These hosts include proxies and a number of virthosts that
have vm's that meet this criteria.
== Special Considerations
While this may not be a complete list, here are some special things that
must be taken into account before rebooting certain systems:
=== Post reboot action
The following machines require post-boot actions (mostly entering
passphrases). Make sure admins that have the passphrases are on hand for
the reboot:
____
* backup01 (ssh agent passphrase for backup ssh key)
* sign-vault01 (NSS passphrase for sigul service and luks passphrase)
* sign-bridge01 (run: 'sigul_bridge -dvv' after it comes back up, not passphrase needed)
* autosign01 (NSS passphrase for robosignatory service and luks passphrase)
* buildvm-s390x-15/16/16 ( need sshfs mount of koji volume redone)
* batcave01 (ssh agent passphrase for ansible ssh key)
* notifs-backend01 (
rabbitmqctl eval 'application:set_env(rabbit, consumer_timeout, 36000000).'
systemctl restart fmn-backend@1; for i in `seq 1 24`; do echo $i; systemctl restart fmn-worker@$i | cat; done
____
=== Bastion01 and Bastion02 and openvpn server
If a reboot of bastion01 is done during an outage, nothing needs to be changed
here. However, if bastion01 will be down for an extended period of time
openvpn can be switched to bastion02 by stopping openvpn-server@openvpn
on bastion01 and starting it on bastion02.
on bastion01: 'systemctl stop openvpn-server@openvpn'
on bastion02: 'systemctl start openvpn-server@openvpn'
and the process can be reversed after the other is back.
Clients try 01 first, then 02 if it's down. It's important
to make sure all the clients are using one machine or the
other, because if they are split routing between machines
may be confused.
=== batcave01
batcave01 is our ansible control host. It's where you run playbooks
that have been mentioned in this SOP. However, it too needs updating
and rebooting and you cannot use the vhost_reboot playbook for it,
since it's rebooting it's own virthost. For this host you should
go to the virthost and 'virsh shutdown' all the other vm's, then
'virsh shutdown' batcave01, then reboot the virthost manually.
=== noc01 / dhcp server
noc01 is our dhcp server. Unfortunately, when rebooting the vmhost that
contains noc01 vm, it means that that vmhost has no dhcp server to
answer it when booting and trying to configure network to talk to
the tang server. To work around this you can run a simple dhcpd
on batcave01. Start it there and let the vmhost with noc01 come
up and then stop it. Ideally we would make another dhcp host
to avoid this issue at some point.
batcave01: 'systemctl start dhcpd'
remember to stop it after the host comes back up.
=== Special package management directives
Sometimes we need to exclude something from being updated.
This can be done with the package_exlcudes variable. Set
that and the playbooks doing updates will exclude listed items.
This variable is set in ansible/host_vars or ansible/group_vars
for the host or group.
== Update Leader
Each update should have a Leader appointed. This person will be in
charge of doing any read-write operations, and delegating to others to
do tasks. If you aren't specficially asked by the Leader to reboot or
change something, please don't. The Leader will assign out machine
groups to reboot, or ask specific people to look at machines that didn't
come back up from reboot or aren't working right after reboot. It's
important to avoid multiple people operating on a single machine in a
read-write manner and interfering with changes.
Uusally for a mass update/reboot there will be a hackmd or similar
document that tracks what machines have already been rebooted
and who is working on which one. Please check with the leader
for a link to this document.
== Updates and Reboots via playbook
There's several playbooks related to this task:
vhost_update.yml applies updates to a vmhost and all it's guests
vhost_reboot.yml shuts down vm's and reboots a vmhost
vhost_update_reboot.yml does both of the above
For hosts out of outage you probably want to use these to make sure
updates are applied before reboots. Once updates are applied globally
before the outage you will want to just use the reboot playbook.
Additionally there are two more playbooks to check things:
check-for-nonvirt-updates.yml
check-for-updates.yml
See those playbooks for more information, but basically they allow
you to see how many updates are pending on all the virthosts/bare
metal machines and/or all machines. This is good to run at the end
of outages to confirm that everything was updated.
== Doing the upgrade
If possible, system upgrades should be done in advance of the reboot
(with relevant testing of new packages on staging). To do the upgrades,
make sure that the Infrastructure RHEL repo is updated as necessary to
pull in the new packages (xref:infra-repo.adoc[Infrastructure Yum Repo SOP])
Before outage, ansible can be use to just apply all updates to hosts or
apply all updates to staging hosts before those are done. Something like:
ansible -m shell 'yum clean all; yum update -y; rkhunter --propupd' hostlist
== Aftermath
[arabic]
. Make sure that everything's running fine
. Check nagios for alerts and clear them all
. Reenable nagios notification after they are cleared.
. Make sure to perform any manual post-boot setup (such as entering
passphrases for encrypted volumes)
. Consider running check-for-updates or check-for-nonvirt-updates to confirm
that all hosts are updated.
. close fedorastatus outage
. Close outage ticket.
=== Non virthost reboots
If you need to reboot specific hosts and make sure they recover -
consider using:
....
sudo ansible -m reboot hostname
....