massupgrade: lots of fixes and improvements

This SOP had a bunch of old stuff in it. This syncs it back up with reality mostly. Proofreading/formatting welcome. Questions also welcome. :) Signed-off-by: Kevin Fenzi <kevin@scrye.com>
2023-03-27 12:43:45 -07:00 · 2023-03-27 12:43:45 -07:00 · f03fde0f5a
commit f03fde0f5a
parent a7dd90e1c3
1 changed files with 115 additions and 305 deletions
--- a/modules/sysadmin_guide/pages/massupgrade.adoc
+++ b/modules/sysadmin_guide/pages/massupgrade.adoc
@ -38,39 +38,29 @@ Purpose:::

 == Preparation

-[arabic]
-. Determine which host group you are going to be doing updates/reboots
-on.
-+
-Group "A"::
-  servers that end users will see or note being down and anything that
-  depends on them.
-Group "B"::
-  servers that contributors will see or note being down and anything
-  that depends on them.
-Group "C"::
-  servers that infrastructure will notice are down, or are redundent
-  enough to reboot some with others taking the load.
-. Appoint an 'Update Leader' for the updates.
-. Follow the xref:outage.adoc[Outage Infrastructure SOP] and send advance notification
-to the appropriate lists. Try to schedule the update at a time when many
-admins are around to help/watch for problems and when impact for the
-group affected is less. Do NOT do multiple groups on the same day if
-possible.
-. Plan an order for rebooting the machines considering two factors:
-+
-____
-* Location of systems on the kvm or xen hosts. [You will normally reboot
-all systems on a host together]
-* Impact of systems going down on other services, operations and users.
-Thus since the database servers and nfs servers are the backbone of many
-other systems, they and systems that are on the same xen boxes would be
-rebooted before other boxes.
-____
-. To aid in organizing a mass upgrade/reboot with many people helping,
-it may help to create a checklist of machines in a gobby document.
-. Schedule downtime in nagios.
-. Make doubly sure that various app owners are aware of the reboots
+Mass updates are usually applied every few months or sooner if there's some
+critical bugs fixed. Mass updates are done outside of freeze windows to avoid
+causing any problems for Fedora releases.
+
+The following items are all done before the actual mass update:
+
+* Plan a outage window or windows outside of a freeze.
+* File a outage ticket in the fedora-infrastructure tracker, using the outage
+template. This should describe the exact time/date and what is included.
+* Get the outage ticket reviewed by someone else to confirm there's no mistakes
+in it.
+* Sent outage announcement to infrastructure and devel-announce lists (for
+outages that affect contributors only) or infrastructure, devel-announce
+and announce (for outages that affect all users).
+* Add a 'planned' outage to fedorastatus. This will show the planned outage
+there for higher visibility.
+* Setup a hackmd or other shared document that lists all the virthosts and
+bare metal hosts that need rebooting and organize it per day. This is used
+to track what admin is handling what server(s).
+
+Typically updates/reboots are done on all staging hosts on a monday,
+then all non outage causing hosts on tuesday and then finally the outages
+are on wednsday.

 == Staging

@ -78,43 +68,25 @@ ____
 Any updates that can be tested in staging or a pre-production
 environment should be tested there first. Including new kernels, updates
 to core database applications / libraries. Web applications, libraries,
-etc.
+etc. This is typically done a few days before the actual outage.
+Too far in advance and things may have changed again, so it's important
+to do this just before the production updates.
 ____

+== Non outage causing hosts
+
+Some hosts can be safely updated/rebooted without an outage because
+they either have multiple machines in a load balancer or are not
+visible to end users or other reasons. These updates are typically
+done on tuesday of the outage week so they are done before the outage
+on wed. These hosts include proxies and a number of virthosts that
+have vm's that meet this criteria.
+
 == Special Considerations

 While this may not be a complete list, here are some special things that
 must be taken into account before rebooting certain systems:

-=== Disable builders
-
-Before the following machines are rebooted, all koji builders should be
-disabled and all running jobs allowed to complete:
-
-____
-* db04
-* nfs01
-* kojipkgs02
-____
-
-Builders can be removed from koji, updated and re-added. Use:
-
-....
-koji disable-host NAME
-
-  and
-
-koji enable-host NAME
-....
-
-[NOTE]
-====
-you must be a koji admin
-====
-
-Additionally, rel-eng and builder boxes may need a special version
-of rpm. Make sure to check with rel-eng on any rpm upgrades for them.
-
 === Post reboot action

 The following machines require post-boot actions (mostly entering
@ -122,50 +94,64 @@ passphrases). Make sure admins that have the passphrases are on hand for
 the reboot:

 ____
-* backup-2 (LUKS passphrase on boot)
-* sign-vault01 (NSS passphrase for sigul service)
-* sign-bridge01 (NSS passphrase for sigul bridge service)
-* serverbeach* (requires fixing firewall rules):
+* backup01 (ssh agent passphrase for backup ssh key)
+* sign-vault01 (NSS passphrase for sigul service and luks passphrase)
+* sign-bridge01 (run: 'sigul_bridge -dvv' after it comes back up, not passphrase needed)
+* autosign01 (NSS passphrase for robosignatory service and luks passphrase)
+* buildvm-s390x-15/16/16 ( need sshfs mount of koji volume redone)
+* batcave01 (ssh agent passphrase for ansible ssh key)
+* notifs-backend01 (
+  rabbitmqctl eval 'application:set_env(rabbit, consumer_timeout, 36000000).'
+  systemctl restart fmn-backend@1; for i in `seq 1 24`; do echo $i; systemctl restart fmn-worker@$i | cat; done
 ____

-Each serverbeach host needs 3 or 4 iptables rules added anytime it's
-rebooted or libvirt is upgraded:
-
-....
-iptables -I FORWARD -o virbr0 -j ACCEPT 
-iptables -I FORWARD -i virbr0 -j ACCEPT 
-iptables -t nat -I POSTROUTING -s 192.168.122.3/32 -j SNAT --to-source 66.135.62.187
-....
-
-[NOTE]
-====
-The source is the internal guest ips, the to-source is the external ips
-that map to that guest ip. If there are multiple guests, each one needs
-the above SNAT rule inserted.
-====
-
-=== Schedule autoqa01 reboot
-
-There is currently an autoqa01.c host on cnode01. Check with QA folks
-before rebooting this guest/host.
-
 === Bastion01 and Bastion02 and openvpn server

-We need one of the bastion machines to be up to provide openvpn for all
-machines. Before rebooting bastion02, modify:
-`manifests/nodes/bastion0*.iad2.fedoraproject.org.pp` files to start
-openvpn server on bastion01, wait for all clients to re-connect, reboot
-bastion02 and then revert back to it as openvpn hub.
+If a reboot of bastion01 is done during an outage, nothing needs to be changed
+here. However, if bastion01 will be down for an extended period of time
+openvpn can be switched to bastion02 by stopping openvpn-server@openvpn
+on bastion01 and starting it on bastion02.

-=== Special yum directives
+on bastion01: 'systemctl stop openvpn-server@openvpn'
+on bastion02: 'systemctl start openvpn-server@openvpn'

-Sometimes we will wish to exclude or otherwise modify the yum.conf on a
-machine. For this purpose, all machines have an include, making them
-read
-http://infrastructure.fedoraproject.org/infra/hosts/FQHN/yum.conf.include
-(TODO Fix link)
-from the infrastructure repo. If you need to make such changes, add them
-to the infrastructure repo before doing updates.
+and the process can be reversed after the other is back.
+Clients try 01 first, then 02 if it's down. It's important
+to make sure all the clients are using one machine or the
+other, because if they are split routing between machines
+may be confused.
+
+=== batcave01
+
+batcave01 is our ansible control host. It's where you run playbooks
+that have been mentioned in this SOP. However, it too needs updating
+and rebooting and you cannot use the vhost_reboot playbook for it, 
+since it's rebooting it's own virthost. For this host you should
+go to the virthost and 'virsh shutdown' all the other vm's, then
+'virsh shutdown' batcave01, then reboot the virthost manually.
+
+=== noc01 / dhcp server
+
+noc01 is our dhcp server. Unfortunately, when rebooting the vmhost that
+contains noc01 vm, it means that that vmhost has no dhcp server to
+answer it when booting and trying to configure network to talk to
+the tang server. To work around this you can run a simple dhcpd
+on batcave01. Start it there and let the vmhost with noc01 come
+up and then stop it. Ideally we would make another dhcp host
+to avoid this issue at some point.
+
+batcave01: 'systemctl start dhcpd'
+
+remember to stop it after the host comes back up.
+
+=== Special package management directives
+
+Sometimes we need to exclude something from being updated.
+This can be done with the package_exlcudes variable. Set
+that and the playbooks doing updates will exclude listed items.
+
+This variable is set in ansible/host_vars or ansible/group_vars
+for the host or group.

 == Update Leader

@ -178,168 +164,29 @@ come back up from reboot or aren't working right after reboot. It's
 important to avoid multiple people operating on a single machine in a
 read-write manner and interfering with changes.

-== Group A reboots
+Uusally for a mass update/reboot there will be a hackmd or similar
+document that tracks what machines have already been rebooted
+and who is working on which one. Please check with the leader
+for a link to this document.

-Group A machines are end user critical ones. Outages here should be
-planned at least a week in advance and announced to the announce list.
+== Updates and Reboots via playbook

-List of machines currently in A group (note: this is going to be
-automated)
+There's several playbooks related to this task:
+vhost_update.yml applies updates to a vmhost and all it's guests
+vhost_reboot.yml shuts down vm's and reboots a vmhost
+vhost_update_reboot.yml does both of the above

-These hosts are grouped based on the virt host they reside on:
+For hosts out of outage you probably want to use these to make sure
+updates are applied before reboots. Once updates are applied globally
+before the outage you will want to just use the reboot playbook.

-* torrent02.fedoraproject.org
-* ibiblio02.fedoraproject.org
-* people03.fedoraproject.org
-* ibiblio03.fedoraproject.org
-* collab01.fedoraproject.org
-* serverbeach09.fedoraproject.org
-* db05.iad2.fedoraproject.org
-* virthost03.iad2.fedoraproject.org
-* db01.iad2.fedoraproject.org
-* virthost04.iad2.fedoraproject.org
-* db-fas01.iad2.fedoraproject.org
-* proxy01.iad2.fedoraproject.org
-* virthost05.iad2.fedoraproject.org
-* ask01.iad2.fedoraproject.org
-* virthost06.iad2.fedoraproject.org
-
-These are the rest:
-
-* bapp02.iad2.fedoraproject.org
-* bastion02.iad2.fedoraproject.org
-* app05.fedoraproject.org
-* backup02.fedoraproject.org
-* bastion01.iad2.fedoraproject.org
-* fas01.iad2.fedoraproject.org
-* fas02.iad2.fedoraproject.org
-* log02.iad2.fedoraproject.org
-* memcached03.iad2.fedoraproject.org
-* noc01.iad2.fedoraproject.org
-* ns02.fedoraproject.org
-* ns04.iad2.fedoraproject.org
-* proxy04.fedoraproject.org
-* smtp-mm03.fedoraproject.org
-* batcave02.iad2.fedoraproject.org
-* mm3test.fedoraproject.org
-* packages02.iad2.fedoraproject.org
-
-=== Group B reboots
-
-This Group contains machines that contributors use. Announcements of
-outages here should be at least a week in advance and sent to the
-devel-announce list.
-
-These hosts are grouped based on the virt host they reside on:
-
-* db04.iad2.fedoraproject.org
-* bvirthost01.iad2.fedoraproject.org
-* nfs01.iad2.fedoraproject.org
-* bvirthost02.iad2.fedoraproject.org
-* pkgs01.iad2.fedoraproject.org
-* bvirthost03.iad2.fedoraproject.org
-* kojipkgs02.iad2.fedoraproject.org
-* bvirthost04.iad2.fedoraproject.org
-
-These are the rest:
-
-* koji04.iad2.fedoraproject.org
-* releng03.iad2.fedoraproject.org
-* releng04.iad2.fedoraproject.org
-
-=== Group C reboots
-
-Group C are machines that infrastructure uses, or can be rebooted in
-such a way as to continue to provide services to others via multiple
-machines. Outages here should be announced on the infrastructure list.
-
-Group C hosts that have proxy servers on them:
-
-* proxy02.fedoraproject.org
-* ns05.fedoraproject.org
-* hosted-lists01.fedoraproject.org
-* internetx01.fedoraproject.org
-* app01.dev.fedoraproject.org
-* darkserver01.dev.fedoraproject.org
-* fakefas01.fedoraproject.org
-* proxy06.fedoraproject.org
-* osuosl01.fedoraproject.org
-* proxy07.fedoraproject.org
-* bodhost01.fedoraproject.org
-* proxy03.fedoraproject.org
-* smtp-mm02.fedoraproject.org
-* tummy01.fedoraproject.org
-* app06.fedoraproject.org
-* noc02.fedoraproject.org
-* proxy05.fedoraproject.org
-* smtp-mm01.fedoraproject.org
-* telia01.fedoraproject.org
-* app08.fedoraproject.org
-* proxy08.fedoraproject.org
-* coloamer01.fedoraproject.org
-
-Other Group C hosts:
-
-* ask01.stg.iad2.fedoraproject.org
-* app02.stg.iad2.fedoraproject.org
-* proxy01.stg.iad2.fedoraproject.org
-* releng01.stg.iad2.fedoraproject.org
-* value01.stg.iad2.fedoraproject.org
-* virthost13.iad2.fedoraproject.org
-* db-fas01.stg.iad2.fedoraproject.org
-* pkgs01.stg.iad2.fedoraproject.org
-* packages01.stg.iad2.fedoraproject.org
-* virthost11.iad2.fedoraproject.org
-* app01.stg.iad2.fedoraproject.org
-* koji01.stg.iad2.fedoraproject.org
-* db02.stg.iad2.fedoraproject.org
-* fas01.stg.iad2.fedoraproject.org
-* virthost10.iad2.fedoraproject.org
-* autoqa01.qa.fedoraproject.org
-* autoqa-stg01.qa.fedoraproject.org
-* bastion-comm01.qa.fedoraproject.org
-* batcave-comm01.qa.fedoraproject.org
-* virthost-comm01.qa.fedoraproject.org
-* compose-x86-01.iad2.fedoraproject.org
-* compose-x86-02.iad2.fedoraproject.org
-* download01.iad2.fedoraproject.org
-* download02.iad2.fedoraproject.org
-* download03.iad2.fedoraproject.org
-* download04.iad2.fedoraproject.org
-* download05.iad2.fedoraproject.org
-* download-rdu01.vpn.fedoraproject.org
-* download-rdu02.vpn.fedoraproject.org
-* download-rdu03.vpn.fedoraproject.org
-* fas03.iad2.fedoraproject.org
-* secondary01.iad2.fedoraproject.org
-* memcached04.iad2.fedoraproject.org
-* virthost01.iad2.fedoraproject.org
-* app02.iad2.fedoraproject.org
-* value03.iad2.fedoraproject.org
-* virthost07.iad2.fedoraproject.org
-* app03.iad2.fedoraproject.org
-* value04.iad2.fedoraproject.org
-* ns03.iad2.fedoraproject.org
-* darkserver01.iad2.fedoraproject.org
-* virthost08.iad2.fedoraproject.org
-* app04.iad2.fedoraproject.org
-* packages02.iad2.fedoraproject.org
-* virthost09.iad2.fedoraproject.org
-* hosted03.fedoraproject.org
-* serverbeach06.fedoraproject.org
-* hosted04.fedoraproject.org
-* serverbeach07.fedoraproject.org
-* collab02.fedoraproject.org
-* serverbeach08.fedoraproject.org
-* dhcp01.iad2.fedoraproject.org
-* relepel01.iad2.fedoraproject.org
-* sign-bridge02.iad2.fedoraproject.org
-* koji03.iad2.fedoraproject.org
-* bvirthost05.iad2.fedoraproject.org
-* (disable each builder in turn, update and reenable).
-* ppc11.iad2.fedoraproject.org
-* ppc12.iad2.fedoraproject.org
-* backup03
+Additionally there are two more playbooks to check things:
+check-for-nonvirt-updates.yml
+check-for-updates.yml
+See those playbooks for more information, but basically they allow
+you to see how many updates are pending on all the virthosts/bare
+metal machines and/or all machines. This is good to run at the end
+of outages to confirm that everything was updated.

 == Doing the upgrade

@ -348,55 +195,21 @@ If possible, system upgrades should be done in advance of the reboot
 make sure that the Infrastructure RHEL repo is updated as necessary to
 pull in the new packages (xref:infra-repo.adoc[Infrastructure Yum Repo SOP])

-On batcave01, as root run:
-
-....
-func-yum [--host=hostname] update
-....
-
-..note: --host can be specified multiple times and takes wildcards.
-
-pinging people as necessary if you are unsure about any packages.
-
-Additionally you can see which machines still need rebooted with:
-
-....
-sudo func-command --timeout=10 --oneline /usr/local/bin/needs-reboot.py | grep yes
-....
-
-You can also see which machines would need a reboot if updates were all
-applied with:
-
-....
-sudo func-command --timeout=10 --oneline /usr/local/bin/needs-reboot.py after-updates | grep yes
-....
-
-== Doing the reboot
-
-In the order determined above, reboots will usually be grouped by the
-virtualization hosts that the servers are on. You can see the guests per
-virt host on batcave01 in `/var/log/virthost-lists.out`
-
-To reboot sets of boxes based on which virthost they are we've written a
-special script which facilitates it:
-
-....
-func-vhost-reboot virthost-fqdn
-....
-
-ex:
-
-....
-sudo func-vhost-reboot virthost13.iad2.fedoraproject.org
-....
+Before outage, ansible can be use to just apply all updates to hosts or
+apply all updates to staging hosts before those are done. Something like:
+ansible -m shell 'yum clean all; yum update -y; rkhunter --propupd' hostlist

 == Aftermath

 [arabic]
 . Make sure that everything's running fine
-. Reenable nagios notification as needed
+. Check nagios for alerts and clear them all
+. Reenable nagios notification after they are cleared.
 . Make sure to perform any manual post-boot setup (such as entering
  passphrases for encrypted volumes)
+. Consider running check-for-updates or check-for-nonvirt-updates to confirm
+  that all hosts are updated.
+. close fedorastatus outage
 . Close outage ticket.

 === Non virthost reboots
@ -405,8 +218,5 @@ If you need to reboot specific hosts and make sure they recover -
 consider using:

 ....
-sudo func-host-reboot hostname hostname1 hostname2 ...
+sudo ansible -m reboot hostname
 ....
-
-If you want to reboot the hosts one at a time waiting for each to come
-back before rebooting the next pass a `-o` to `func-host-reboot`.