This SOP had a bunch of old stuff in it. This syncs it back up with reality mostly. Proofreading/formatting welcome. Questions also welcome. :) Signed-off-by: Kevin Fenzi <kevin@scrye.com>
222 lines
8.3 KiB
Text
222 lines
8.3 KiB
Text
= Mass Upgrade Infrastructure SOP
|
|
|
|
Every once in a while, we need to apply mass upgrades to our servers for
|
|
various security and other upgrades.
|
|
|
|
== Contents
|
|
|
|
* <<_contact_information>>
|
|
* <<_preparation>>
|
|
* <<_staging>>
|
|
* <<_special_considerations>>
|
|
** <<_disable_builders>>
|
|
** <<_post_reboot_action>>
|
|
** <<_schedule_autoqa01_reboot>>
|
|
** <<_bastion01_and_bastion02_and_openvpn_server>>
|
|
** <<_special_yum_directives>>
|
|
* <<_update_leader>>
|
|
* <<_group_a_reboots>>
|
|
* <<_group_b_reboots>>
|
|
* <<_group_c_reboots>>
|
|
* <<_doing_the_upgrade>>
|
|
* <<_doing_the_reboot>>
|
|
* <<_aftermath>>
|
|
|
|
== Contact Information
|
|
|
|
Owner:::
|
|
Fedora Infrastructure Team
|
|
Contact:::
|
|
#fedora-admin, sysadmin-main, infrastructure@lists.fedoraproject.org,
|
|
#fedora-noc
|
|
Location:::
|
|
All over the world.
|
|
Servers:::
|
|
all
|
|
Purpose:::
|
|
Apply kernel/other upgrades to all of our servers
|
|
|
|
== Preparation
|
|
|
|
Mass updates are usually applied every few months or sooner if there's some
|
|
critical bugs fixed. Mass updates are done outside of freeze windows to avoid
|
|
causing any problems for Fedora releases.
|
|
|
|
The following items are all done before the actual mass update:
|
|
|
|
* Plan a outage window or windows outside of a freeze.
|
|
* File a outage ticket in the fedora-infrastructure tracker, using the outage
|
|
template. This should describe the exact time/date and what is included.
|
|
* Get the outage ticket reviewed by someone else to confirm there's no mistakes
|
|
in it.
|
|
* Sent outage announcement to infrastructure and devel-announce lists (for
|
|
outages that affect contributors only) or infrastructure, devel-announce
|
|
and announce (for outages that affect all users).
|
|
* Add a 'planned' outage to fedorastatus. This will show the planned outage
|
|
there for higher visibility.
|
|
* Setup a hackmd or other shared document that lists all the virthosts and
|
|
bare metal hosts that need rebooting and organize it per day. This is used
|
|
to track what admin is handling what server(s).
|
|
|
|
Typically updates/reboots are done on all staging hosts on a monday,
|
|
then all non outage causing hosts on tuesday and then finally the outages
|
|
are on wednsday.
|
|
|
|
== Staging
|
|
|
|
____
|
|
Any updates that can be tested in staging or a pre-production
|
|
environment should be tested there first. Including new kernels, updates
|
|
to core database applications / libraries. Web applications, libraries,
|
|
etc. This is typically done a few days before the actual outage.
|
|
Too far in advance and things may have changed again, so it's important
|
|
to do this just before the production updates.
|
|
____
|
|
|
|
== Non outage causing hosts
|
|
|
|
Some hosts can be safely updated/rebooted without an outage because
|
|
they either have multiple machines in a load balancer or are not
|
|
visible to end users or other reasons. These updates are typically
|
|
done on tuesday of the outage week so they are done before the outage
|
|
on wed. These hosts include proxies and a number of virthosts that
|
|
have vm's that meet this criteria.
|
|
|
|
== Special Considerations
|
|
|
|
While this may not be a complete list, here are some special things that
|
|
must be taken into account before rebooting certain systems:
|
|
|
|
=== Post reboot action
|
|
|
|
The following machines require post-boot actions (mostly entering
|
|
passphrases). Make sure admins that have the passphrases are on hand for
|
|
the reboot:
|
|
|
|
____
|
|
* backup01 (ssh agent passphrase for backup ssh key)
|
|
* sign-vault01 (NSS passphrase for sigul service and luks passphrase)
|
|
* sign-bridge01 (run: 'sigul_bridge -dvv' after it comes back up, not passphrase needed)
|
|
* autosign01 (NSS passphrase for robosignatory service and luks passphrase)
|
|
* buildvm-s390x-15/16/16 ( need sshfs mount of koji volume redone)
|
|
* batcave01 (ssh agent passphrase for ansible ssh key)
|
|
* notifs-backend01 (
|
|
rabbitmqctl eval 'application:set_env(rabbit, consumer_timeout, 36000000).'
|
|
systemctl restart fmn-backend@1; for i in `seq 1 24`; do echo $i; systemctl restart fmn-worker@$i | cat; done
|
|
____
|
|
|
|
=== Bastion01 and Bastion02 and openvpn server
|
|
|
|
If a reboot of bastion01 is done during an outage, nothing needs to be changed
|
|
here. However, if bastion01 will be down for an extended period of time
|
|
openvpn can be switched to bastion02 by stopping openvpn-server@openvpn
|
|
on bastion01 and starting it on bastion02.
|
|
|
|
on bastion01: 'systemctl stop openvpn-server@openvpn'
|
|
on bastion02: 'systemctl start openvpn-server@openvpn'
|
|
|
|
and the process can be reversed after the other is back.
|
|
Clients try 01 first, then 02 if it's down. It's important
|
|
to make sure all the clients are using one machine or the
|
|
other, because if they are split routing between machines
|
|
may be confused.
|
|
|
|
=== batcave01
|
|
|
|
batcave01 is our ansible control host. It's where you run playbooks
|
|
that have been mentioned in this SOP. However, it too needs updating
|
|
and rebooting and you cannot use the vhost_reboot playbook for it,
|
|
since it's rebooting it's own virthost. For this host you should
|
|
go to the virthost and 'virsh shutdown' all the other vm's, then
|
|
'virsh shutdown' batcave01, then reboot the virthost manually.
|
|
|
|
=== noc01 / dhcp server
|
|
|
|
noc01 is our dhcp server. Unfortunately, when rebooting the vmhost that
|
|
contains noc01 vm, it means that that vmhost has no dhcp server to
|
|
answer it when booting and trying to configure network to talk to
|
|
the tang server. To work around this you can run a simple dhcpd
|
|
on batcave01. Start it there and let the vmhost with noc01 come
|
|
up and then stop it. Ideally we would make another dhcp host
|
|
to avoid this issue at some point.
|
|
|
|
batcave01: 'systemctl start dhcpd'
|
|
|
|
remember to stop it after the host comes back up.
|
|
|
|
=== Special package management directives
|
|
|
|
Sometimes we need to exclude something from being updated.
|
|
This can be done with the package_exlcudes variable. Set
|
|
that and the playbooks doing updates will exclude listed items.
|
|
|
|
This variable is set in ansible/host_vars or ansible/group_vars
|
|
for the host or group.
|
|
|
|
== Update Leader
|
|
|
|
Each update should have a Leader appointed. This person will be in
|
|
charge of doing any read-write operations, and delegating to others to
|
|
do tasks. If you aren't specficially asked by the Leader to reboot or
|
|
change something, please don't. The Leader will assign out machine
|
|
groups to reboot, or ask specific people to look at machines that didn't
|
|
come back up from reboot or aren't working right after reboot. It's
|
|
important to avoid multiple people operating on a single machine in a
|
|
read-write manner and interfering with changes.
|
|
|
|
Uusally for a mass update/reboot there will be a hackmd or similar
|
|
document that tracks what machines have already been rebooted
|
|
and who is working on which one. Please check with the leader
|
|
for a link to this document.
|
|
|
|
== Updates and Reboots via playbook
|
|
|
|
There's several playbooks related to this task:
|
|
vhost_update.yml applies updates to a vmhost and all it's guests
|
|
vhost_reboot.yml shuts down vm's and reboots a vmhost
|
|
vhost_update_reboot.yml does both of the above
|
|
|
|
For hosts out of outage you probably want to use these to make sure
|
|
updates are applied before reboots. Once updates are applied globally
|
|
before the outage you will want to just use the reboot playbook.
|
|
|
|
Additionally there are two more playbooks to check things:
|
|
check-for-nonvirt-updates.yml
|
|
check-for-updates.yml
|
|
See those playbooks for more information, but basically they allow
|
|
you to see how many updates are pending on all the virthosts/bare
|
|
metal machines and/or all machines. This is good to run at the end
|
|
of outages to confirm that everything was updated.
|
|
|
|
== Doing the upgrade
|
|
|
|
If possible, system upgrades should be done in advance of the reboot
|
|
(with relevant testing of new packages on staging). To do the upgrades,
|
|
make sure that the Infrastructure RHEL repo is updated as necessary to
|
|
pull in the new packages (xref:infra-repo.adoc[Infrastructure Yum Repo SOP])
|
|
|
|
Before outage, ansible can be use to just apply all updates to hosts or
|
|
apply all updates to staging hosts before those are done. Something like:
|
|
ansible -m shell 'yum clean all; yum update -y; rkhunter --propupd' hostlist
|
|
|
|
== Aftermath
|
|
|
|
[arabic]
|
|
. Make sure that everything's running fine
|
|
. Check nagios for alerts and clear them all
|
|
. Reenable nagios notification after they are cleared.
|
|
. Make sure to perform any manual post-boot setup (such as entering
|
|
passphrases for encrypted volumes)
|
|
. Consider running check-for-updates or check-for-nonvirt-updates to confirm
|
|
that all hosts are updated.
|
|
. close fedorastatus outage
|
|
. Close outage ticket.
|
|
|
|
=== Non virthost reboots
|
|
|
|
If you need to reboot specific hosts and make sure they recover -
|
|
consider using:
|
|
|
|
....
|
|
sudo ansible -m reboot hostname
|
|
....
|