massupgrade: lots of fixes and improvements
This SOP had a bunch of old stuff in it. This syncs it back up with reality mostly. Proofreading/formatting welcome. Questions also welcome. :) Signed-off-by: Kevin Fenzi <kevin@scrye.com>
This commit is contained in:
parent
a7dd90e1c3
commit
f03fde0f5a
1 changed files with 115 additions and 305 deletions
|
@ -38,39 +38,29 @@ Purpose:::
|
|||
|
||||
== Preparation
|
||||
|
||||
[arabic]
|
||||
. Determine which host group you are going to be doing updates/reboots
|
||||
on.
|
||||
+
|
||||
Group "A"::
|
||||
servers that end users will see or note being down and anything that
|
||||
depends on them.
|
||||
Group "B"::
|
||||
servers that contributors will see or note being down and anything
|
||||
that depends on them.
|
||||
Group "C"::
|
||||
servers that infrastructure will notice are down, or are redundent
|
||||
enough to reboot some with others taking the load.
|
||||
. Appoint an 'Update Leader' for the updates.
|
||||
. Follow the xref:outage.adoc[Outage Infrastructure SOP] and send advance notification
|
||||
to the appropriate lists. Try to schedule the update at a time when many
|
||||
admins are around to help/watch for problems and when impact for the
|
||||
group affected is less. Do NOT do multiple groups on the same day if
|
||||
possible.
|
||||
. Plan an order for rebooting the machines considering two factors:
|
||||
+
|
||||
____
|
||||
* Location of systems on the kvm or xen hosts. [You will normally reboot
|
||||
all systems on a host together]
|
||||
* Impact of systems going down on other services, operations and users.
|
||||
Thus since the database servers and nfs servers are the backbone of many
|
||||
other systems, they and systems that are on the same xen boxes would be
|
||||
rebooted before other boxes.
|
||||
____
|
||||
. To aid in organizing a mass upgrade/reboot with many people helping,
|
||||
it may help to create a checklist of machines in a gobby document.
|
||||
. Schedule downtime in nagios.
|
||||
. Make doubly sure that various app owners are aware of the reboots
|
||||
Mass updates are usually applied every few months or sooner if there's some
|
||||
critical bugs fixed. Mass updates are done outside of freeze windows to avoid
|
||||
causing any problems for Fedora releases.
|
||||
|
||||
The following items are all done before the actual mass update:
|
||||
|
||||
* Plan a outage window or windows outside of a freeze.
|
||||
* File a outage ticket in the fedora-infrastructure tracker, using the outage
|
||||
template. This should describe the exact time/date and what is included.
|
||||
* Get the outage ticket reviewed by someone else to confirm there's no mistakes
|
||||
in it.
|
||||
* Sent outage announcement to infrastructure and devel-announce lists (for
|
||||
outages that affect contributors only) or infrastructure, devel-announce
|
||||
and announce (for outages that affect all users).
|
||||
* Add a 'planned' outage to fedorastatus. This will show the planned outage
|
||||
there for higher visibility.
|
||||
* Setup a hackmd or other shared document that lists all the virthosts and
|
||||
bare metal hosts that need rebooting and organize it per day. This is used
|
||||
to track what admin is handling what server(s).
|
||||
|
||||
Typically updates/reboots are done on all staging hosts on a monday,
|
||||
then all non outage causing hosts on tuesday and then finally the outages
|
||||
are on wednsday.
|
||||
|
||||
== Staging
|
||||
|
||||
|
@ -78,43 +68,25 @@ ____
|
|||
Any updates that can be tested in staging or a pre-production
|
||||
environment should be tested there first. Including new kernels, updates
|
||||
to core database applications / libraries. Web applications, libraries,
|
||||
etc.
|
||||
etc. This is typically done a few days before the actual outage.
|
||||
Too far in advance and things may have changed again, so it's important
|
||||
to do this just before the production updates.
|
||||
____
|
||||
|
||||
== Non outage causing hosts
|
||||
|
||||
Some hosts can be safely updated/rebooted without an outage because
|
||||
they either have multiple machines in a load balancer or are not
|
||||
visible to end users or other reasons. These updates are typically
|
||||
done on tuesday of the outage week so they are done before the outage
|
||||
on wed. These hosts include proxies and a number of virthosts that
|
||||
have vm's that meet this criteria.
|
||||
|
||||
== Special Considerations
|
||||
|
||||
While this may not be a complete list, here are some special things that
|
||||
must be taken into account before rebooting certain systems:
|
||||
|
||||
=== Disable builders
|
||||
|
||||
Before the following machines are rebooted, all koji builders should be
|
||||
disabled and all running jobs allowed to complete:
|
||||
|
||||
____
|
||||
* db04
|
||||
* nfs01
|
||||
* kojipkgs02
|
||||
____
|
||||
|
||||
Builders can be removed from koji, updated and re-added. Use:
|
||||
|
||||
....
|
||||
koji disable-host NAME
|
||||
|
||||
and
|
||||
|
||||
koji enable-host NAME
|
||||
....
|
||||
|
||||
[NOTE]
|
||||
====
|
||||
you must be a koji admin
|
||||
====
|
||||
|
||||
Additionally, rel-eng and builder boxes may need a special version
|
||||
of rpm. Make sure to check with rel-eng on any rpm upgrades for them.
|
||||
|
||||
=== Post reboot action
|
||||
|
||||
The following machines require post-boot actions (mostly entering
|
||||
|
@ -122,50 +94,64 @@ passphrases). Make sure admins that have the passphrases are on hand for
|
|||
the reboot:
|
||||
|
||||
____
|
||||
* backup-2 (LUKS passphrase on boot)
|
||||
* sign-vault01 (NSS passphrase for sigul service)
|
||||
* sign-bridge01 (NSS passphrase for sigul bridge service)
|
||||
* serverbeach* (requires fixing firewall rules):
|
||||
* backup01 (ssh agent passphrase for backup ssh key)
|
||||
* sign-vault01 (NSS passphrase for sigul service and luks passphrase)
|
||||
* sign-bridge01 (run: 'sigul_bridge -dvv' after it comes back up, not passphrase needed)
|
||||
* autosign01 (NSS passphrase for robosignatory service and luks passphrase)
|
||||
* buildvm-s390x-15/16/16 ( need sshfs mount of koji volume redone)
|
||||
* batcave01 (ssh agent passphrase for ansible ssh key)
|
||||
* notifs-backend01 (
|
||||
rabbitmqctl eval 'application:set_env(rabbit, consumer_timeout, 36000000).'
|
||||
systemctl restart fmn-backend@1; for i in `seq 1 24`; do echo $i; systemctl restart fmn-worker@$i | cat; done
|
||||
____
|
||||
|
||||
Each serverbeach host needs 3 or 4 iptables rules added anytime it's
|
||||
rebooted or libvirt is upgraded:
|
||||
|
||||
....
|
||||
iptables -I FORWARD -o virbr0 -j ACCEPT
|
||||
iptables -I FORWARD -i virbr0 -j ACCEPT
|
||||
iptables -t nat -I POSTROUTING -s 192.168.122.3/32 -j SNAT --to-source 66.135.62.187
|
||||
....
|
||||
|
||||
[NOTE]
|
||||
====
|
||||
The source is the internal guest ips, the to-source is the external ips
|
||||
that map to that guest ip. If there are multiple guests, each one needs
|
||||
the above SNAT rule inserted.
|
||||
====
|
||||
|
||||
=== Schedule autoqa01 reboot
|
||||
|
||||
There is currently an autoqa01.c host on cnode01. Check with QA folks
|
||||
before rebooting this guest/host.
|
||||
|
||||
=== Bastion01 and Bastion02 and openvpn server
|
||||
|
||||
We need one of the bastion machines to be up to provide openvpn for all
|
||||
machines. Before rebooting bastion02, modify:
|
||||
`manifests/nodes/bastion0*.iad2.fedoraproject.org.pp` files to start
|
||||
openvpn server on bastion01, wait for all clients to re-connect, reboot
|
||||
bastion02 and then revert back to it as openvpn hub.
|
||||
If a reboot of bastion01 is done during an outage, nothing needs to be changed
|
||||
here. However, if bastion01 will be down for an extended period of time
|
||||
openvpn can be switched to bastion02 by stopping openvpn-server@openvpn
|
||||
on bastion01 and starting it on bastion02.
|
||||
|
||||
=== Special yum directives
|
||||
on bastion01: 'systemctl stop openvpn-server@openvpn'
|
||||
on bastion02: 'systemctl start openvpn-server@openvpn'
|
||||
|
||||
Sometimes we will wish to exclude or otherwise modify the yum.conf on a
|
||||
machine. For this purpose, all machines have an include, making them
|
||||
read
|
||||
http://infrastructure.fedoraproject.org/infra/hosts/FQHN/yum.conf.include
|
||||
(TODO Fix link)
|
||||
from the infrastructure repo. If you need to make such changes, add them
|
||||
to the infrastructure repo before doing updates.
|
||||
and the process can be reversed after the other is back.
|
||||
Clients try 01 first, then 02 if it's down. It's important
|
||||
to make sure all the clients are using one machine or the
|
||||
other, because if they are split routing between machines
|
||||
may be confused.
|
||||
|
||||
=== batcave01
|
||||
|
||||
batcave01 is our ansible control host. It's where you run playbooks
|
||||
that have been mentioned in this SOP. However, it too needs updating
|
||||
and rebooting and you cannot use the vhost_reboot playbook for it,
|
||||
since it's rebooting it's own virthost. For this host you should
|
||||
go to the virthost and 'virsh shutdown' all the other vm's, then
|
||||
'virsh shutdown' batcave01, then reboot the virthost manually.
|
||||
|
||||
=== noc01 / dhcp server
|
||||
|
||||
noc01 is our dhcp server. Unfortunately, when rebooting the vmhost that
|
||||
contains noc01 vm, it means that that vmhost has no dhcp server to
|
||||
answer it when booting and trying to configure network to talk to
|
||||
the tang server. To work around this you can run a simple dhcpd
|
||||
on batcave01. Start it there and let the vmhost with noc01 come
|
||||
up and then stop it. Ideally we would make another dhcp host
|
||||
to avoid this issue at some point.
|
||||
|
||||
batcave01: 'systemctl start dhcpd'
|
||||
|
||||
remember to stop it after the host comes back up.
|
||||
|
||||
=== Special package management directives
|
||||
|
||||
Sometimes we need to exclude something from being updated.
|
||||
This can be done with the package_exlcudes variable. Set
|
||||
that and the playbooks doing updates will exclude listed items.
|
||||
|
||||
This variable is set in ansible/host_vars or ansible/group_vars
|
||||
for the host or group.
|
||||
|
||||
== Update Leader
|
||||
|
||||
|
@ -178,168 +164,29 @@ come back up from reboot or aren't working right after reboot. It's
|
|||
important to avoid multiple people operating on a single machine in a
|
||||
read-write manner and interfering with changes.
|
||||
|
||||
== Group A reboots
|
||||
Uusally for a mass update/reboot there will be a hackmd or similar
|
||||
document that tracks what machines have already been rebooted
|
||||
and who is working on which one. Please check with the leader
|
||||
for a link to this document.
|
||||
|
||||
Group A machines are end user critical ones. Outages here should be
|
||||
planned at least a week in advance and announced to the announce list.
|
||||
== Updates and Reboots via playbook
|
||||
|
||||
List of machines currently in A group (note: this is going to be
|
||||
automated)
|
||||
There's several playbooks related to this task:
|
||||
vhost_update.yml applies updates to a vmhost and all it's guests
|
||||
vhost_reboot.yml shuts down vm's and reboots a vmhost
|
||||
vhost_update_reboot.yml does both of the above
|
||||
|
||||
These hosts are grouped based on the virt host they reside on:
|
||||
For hosts out of outage you probably want to use these to make sure
|
||||
updates are applied before reboots. Once updates are applied globally
|
||||
before the outage you will want to just use the reboot playbook.
|
||||
|
||||
* torrent02.fedoraproject.org
|
||||
* ibiblio02.fedoraproject.org
|
||||
* people03.fedoraproject.org
|
||||
* ibiblio03.fedoraproject.org
|
||||
* collab01.fedoraproject.org
|
||||
* serverbeach09.fedoraproject.org
|
||||
* db05.iad2.fedoraproject.org
|
||||
* virthost03.iad2.fedoraproject.org
|
||||
* db01.iad2.fedoraproject.org
|
||||
* virthost04.iad2.fedoraproject.org
|
||||
* db-fas01.iad2.fedoraproject.org
|
||||
* proxy01.iad2.fedoraproject.org
|
||||
* virthost05.iad2.fedoraproject.org
|
||||
* ask01.iad2.fedoraproject.org
|
||||
* virthost06.iad2.fedoraproject.org
|
||||
|
||||
These are the rest:
|
||||
|
||||
* bapp02.iad2.fedoraproject.org
|
||||
* bastion02.iad2.fedoraproject.org
|
||||
* app05.fedoraproject.org
|
||||
* backup02.fedoraproject.org
|
||||
* bastion01.iad2.fedoraproject.org
|
||||
* fas01.iad2.fedoraproject.org
|
||||
* fas02.iad2.fedoraproject.org
|
||||
* log02.iad2.fedoraproject.org
|
||||
* memcached03.iad2.fedoraproject.org
|
||||
* noc01.iad2.fedoraproject.org
|
||||
* ns02.fedoraproject.org
|
||||
* ns04.iad2.fedoraproject.org
|
||||
* proxy04.fedoraproject.org
|
||||
* smtp-mm03.fedoraproject.org
|
||||
* batcave02.iad2.fedoraproject.org
|
||||
* mm3test.fedoraproject.org
|
||||
* packages02.iad2.fedoraproject.org
|
||||
|
||||
=== Group B reboots
|
||||
|
||||
This Group contains machines that contributors use. Announcements of
|
||||
outages here should be at least a week in advance and sent to the
|
||||
devel-announce list.
|
||||
|
||||
These hosts are grouped based on the virt host they reside on:
|
||||
|
||||
* db04.iad2.fedoraproject.org
|
||||
* bvirthost01.iad2.fedoraproject.org
|
||||
* nfs01.iad2.fedoraproject.org
|
||||
* bvirthost02.iad2.fedoraproject.org
|
||||
* pkgs01.iad2.fedoraproject.org
|
||||
* bvirthost03.iad2.fedoraproject.org
|
||||
* kojipkgs02.iad2.fedoraproject.org
|
||||
* bvirthost04.iad2.fedoraproject.org
|
||||
|
||||
These are the rest:
|
||||
|
||||
* koji04.iad2.fedoraproject.org
|
||||
* releng03.iad2.fedoraproject.org
|
||||
* releng04.iad2.fedoraproject.org
|
||||
|
||||
=== Group C reboots
|
||||
|
||||
Group C are machines that infrastructure uses, or can be rebooted in
|
||||
such a way as to continue to provide services to others via multiple
|
||||
machines. Outages here should be announced on the infrastructure list.
|
||||
|
||||
Group C hosts that have proxy servers on them:
|
||||
|
||||
* proxy02.fedoraproject.org
|
||||
* ns05.fedoraproject.org
|
||||
* hosted-lists01.fedoraproject.org
|
||||
* internetx01.fedoraproject.org
|
||||
* app01.dev.fedoraproject.org
|
||||
* darkserver01.dev.fedoraproject.org
|
||||
* fakefas01.fedoraproject.org
|
||||
* proxy06.fedoraproject.org
|
||||
* osuosl01.fedoraproject.org
|
||||
* proxy07.fedoraproject.org
|
||||
* bodhost01.fedoraproject.org
|
||||
* proxy03.fedoraproject.org
|
||||
* smtp-mm02.fedoraproject.org
|
||||
* tummy01.fedoraproject.org
|
||||
* app06.fedoraproject.org
|
||||
* noc02.fedoraproject.org
|
||||
* proxy05.fedoraproject.org
|
||||
* smtp-mm01.fedoraproject.org
|
||||
* telia01.fedoraproject.org
|
||||
* app08.fedoraproject.org
|
||||
* proxy08.fedoraproject.org
|
||||
* coloamer01.fedoraproject.org
|
||||
|
||||
Other Group C hosts:
|
||||
|
||||
* ask01.stg.iad2.fedoraproject.org
|
||||
* app02.stg.iad2.fedoraproject.org
|
||||
* proxy01.stg.iad2.fedoraproject.org
|
||||
* releng01.stg.iad2.fedoraproject.org
|
||||
* value01.stg.iad2.fedoraproject.org
|
||||
* virthost13.iad2.fedoraproject.org
|
||||
* db-fas01.stg.iad2.fedoraproject.org
|
||||
* pkgs01.stg.iad2.fedoraproject.org
|
||||
* packages01.stg.iad2.fedoraproject.org
|
||||
* virthost11.iad2.fedoraproject.org
|
||||
* app01.stg.iad2.fedoraproject.org
|
||||
* koji01.stg.iad2.fedoraproject.org
|
||||
* db02.stg.iad2.fedoraproject.org
|
||||
* fas01.stg.iad2.fedoraproject.org
|
||||
* virthost10.iad2.fedoraproject.org
|
||||
* autoqa01.qa.fedoraproject.org
|
||||
* autoqa-stg01.qa.fedoraproject.org
|
||||
* bastion-comm01.qa.fedoraproject.org
|
||||
* batcave-comm01.qa.fedoraproject.org
|
||||
* virthost-comm01.qa.fedoraproject.org
|
||||
* compose-x86-01.iad2.fedoraproject.org
|
||||
* compose-x86-02.iad2.fedoraproject.org
|
||||
* download01.iad2.fedoraproject.org
|
||||
* download02.iad2.fedoraproject.org
|
||||
* download03.iad2.fedoraproject.org
|
||||
* download04.iad2.fedoraproject.org
|
||||
* download05.iad2.fedoraproject.org
|
||||
* download-rdu01.vpn.fedoraproject.org
|
||||
* download-rdu02.vpn.fedoraproject.org
|
||||
* download-rdu03.vpn.fedoraproject.org
|
||||
* fas03.iad2.fedoraproject.org
|
||||
* secondary01.iad2.fedoraproject.org
|
||||
* memcached04.iad2.fedoraproject.org
|
||||
* virthost01.iad2.fedoraproject.org
|
||||
* app02.iad2.fedoraproject.org
|
||||
* value03.iad2.fedoraproject.org
|
||||
* virthost07.iad2.fedoraproject.org
|
||||
* app03.iad2.fedoraproject.org
|
||||
* value04.iad2.fedoraproject.org
|
||||
* ns03.iad2.fedoraproject.org
|
||||
* darkserver01.iad2.fedoraproject.org
|
||||
* virthost08.iad2.fedoraproject.org
|
||||
* app04.iad2.fedoraproject.org
|
||||
* packages02.iad2.fedoraproject.org
|
||||
* virthost09.iad2.fedoraproject.org
|
||||
* hosted03.fedoraproject.org
|
||||
* serverbeach06.fedoraproject.org
|
||||
* hosted04.fedoraproject.org
|
||||
* serverbeach07.fedoraproject.org
|
||||
* collab02.fedoraproject.org
|
||||
* serverbeach08.fedoraproject.org
|
||||
* dhcp01.iad2.fedoraproject.org
|
||||
* relepel01.iad2.fedoraproject.org
|
||||
* sign-bridge02.iad2.fedoraproject.org
|
||||
* koji03.iad2.fedoraproject.org
|
||||
* bvirthost05.iad2.fedoraproject.org
|
||||
* (disable each builder in turn, update and reenable).
|
||||
* ppc11.iad2.fedoraproject.org
|
||||
* ppc12.iad2.fedoraproject.org
|
||||
* backup03
|
||||
Additionally there are two more playbooks to check things:
|
||||
check-for-nonvirt-updates.yml
|
||||
check-for-updates.yml
|
||||
See those playbooks for more information, but basically they allow
|
||||
you to see how many updates are pending on all the virthosts/bare
|
||||
metal machines and/or all machines. This is good to run at the end
|
||||
of outages to confirm that everything was updated.
|
||||
|
||||
== Doing the upgrade
|
||||
|
||||
|
@ -348,55 +195,21 @@ If possible, system upgrades should be done in advance of the reboot
|
|||
make sure that the Infrastructure RHEL repo is updated as necessary to
|
||||
pull in the new packages (xref:infra-repo.adoc[Infrastructure Yum Repo SOP])
|
||||
|
||||
On batcave01, as root run:
|
||||
|
||||
....
|
||||
func-yum [--host=hostname] update
|
||||
....
|
||||
|
||||
..note: --host can be specified multiple times and takes wildcards.
|
||||
|
||||
pinging people as necessary if you are unsure about any packages.
|
||||
|
||||
Additionally you can see which machines still need rebooted with:
|
||||
|
||||
....
|
||||
sudo func-command --timeout=10 --oneline /usr/local/bin/needs-reboot.py | grep yes
|
||||
....
|
||||
|
||||
You can also see which machines would need a reboot if updates were all
|
||||
applied with:
|
||||
|
||||
....
|
||||
sudo func-command --timeout=10 --oneline /usr/local/bin/needs-reboot.py after-updates | grep yes
|
||||
....
|
||||
|
||||
== Doing the reboot
|
||||
|
||||
In the order determined above, reboots will usually be grouped by the
|
||||
virtualization hosts that the servers are on. You can see the guests per
|
||||
virt host on batcave01 in `/var/log/virthost-lists.out`
|
||||
|
||||
To reboot sets of boxes based on which virthost they are we've written a
|
||||
special script which facilitates it:
|
||||
|
||||
....
|
||||
func-vhost-reboot virthost-fqdn
|
||||
....
|
||||
|
||||
ex:
|
||||
|
||||
....
|
||||
sudo func-vhost-reboot virthost13.iad2.fedoraproject.org
|
||||
....
|
||||
Before outage, ansible can be use to just apply all updates to hosts or
|
||||
apply all updates to staging hosts before those are done. Something like:
|
||||
ansible -m shell 'yum clean all; yum update -y; rkhunter --propupd' hostlist
|
||||
|
||||
== Aftermath
|
||||
|
||||
[arabic]
|
||||
. Make sure that everything's running fine
|
||||
. Reenable nagios notification as needed
|
||||
. Check nagios for alerts and clear them all
|
||||
. Reenable nagios notification after they are cleared.
|
||||
. Make sure to perform any manual post-boot setup (such as entering
|
||||
passphrases for encrypted volumes)
|
||||
. Consider running check-for-updates or check-for-nonvirt-updates to confirm
|
||||
that all hosts are updated.
|
||||
. close fedorastatus outage
|
||||
. Close outage ticket.
|
||||
|
||||
=== Non virthost reboots
|
||||
|
@ -405,8 +218,5 @@ If you need to reboot specific hosts and make sure they recover -
|
|||
consider using:
|
||||
|
||||
....
|
||||
sudo func-host-reboot hostname hostname1 hostname2 ...
|
||||
sudo ansible -m reboot hostname
|
||||
....
|
||||
|
||||
If you want to reboot the hosts one at a time waiting for each to come
|
||||
back before rebooting the next pass a `-o` to `func-host-reboot`.
|
||||
|
|
Loading…
Add table
Add a link
Reference in a new issue