418 lines
12 KiB
Text
418 lines
12 KiB
Text
= Mass Upgrade Infrastructure SOP
|
|
|
|
Every once in a while, we need to apply mass upgrades to our servers for
|
|
various security and other upgrades.
|
|
|
|
== Contents
|
|
|
|
[arabic]
|
|
. Contact Information
|
|
. Preparation
|
|
. Staging
|
|
. Special Considerations
|
|
+
|
|
____
|
|
* Disable builders
|
|
* Post reboot action
|
|
* Schedule autoqa01 reboot
|
|
* Bastion01 and Bastion02 and openvpn server
|
|
* Special yum directives
|
|
____
|
|
. Update Leader
|
|
. Group A reboots
|
|
. Group B reboots
|
|
. Group C reboots
|
|
. Doing the upgrade
|
|
. Doing the reboot
|
|
. Aftermath
|
|
|
|
== Contact Information
|
|
|
|
Owner:::
|
|
Fedora Infrastructure Team
|
|
Contact:::
|
|
#fedora-admin, sysadmin-main, infrastructure@lists.fedoraproject.org,
|
|
#fedora-noc
|
|
Location:::
|
|
All over the world.
|
|
Servers:::
|
|
all
|
|
Purpose:::
|
|
Apply kernel/other upgrades to all of our servers
|
|
|
|
== Preparation
|
|
|
|
[arabic]
|
|
. Determine which host group you are going to be doing updates/reboots
|
|
on.
|
|
+
|
|
Group "A"::
|
|
servers that end users will see or note being down and anything that
|
|
depends on them.
|
|
Group "B"::
|
|
servers that contributors will see or note being down and anything
|
|
that depends on them.
|
|
Group "C"::
|
|
servers that infrastructure will notice are down, or are redundent
|
|
enough to reboot some with others taking the load.
|
|
. Appoint an 'Update Leader' for the updates.
|
|
. Follow the [61]Outage Infrastructure SOP and send advance notification
|
|
to the appropriate lists. Try to schedule the update at a time when many
|
|
admins are around to help/watch for problems and when impact for the
|
|
group affected is less. Do NOT do multiple groups on the same day if
|
|
possible.
|
|
. Plan an order for rebooting the machines considering two factors:
|
|
+
|
|
____
|
|
* Location of systems on the kvm or xen hosts. [You will normally reboot
|
|
all systems on a host together]
|
|
* Impact of systems going down on other services, operations and users.
|
|
Thus since the database servers and nfs servers are the backbone of many
|
|
other systems, they and systems that are on the same xen boxes would be
|
|
rebooted before other boxes.
|
|
____
|
|
. To aid in organizing a mass upgrade/reboot with many people helping,
|
|
it may help to create a checklist of machines in a gobby document.
|
|
. Schedule downtime in nagios.
|
|
. Make doubly sure that various app owners are aware of the reboots
|
|
|
|
== Staging
|
|
|
|
____
|
|
Any updates that can be tested in staging or a pre-production
|
|
environment should be tested there first. Including new kernels, updates
|
|
to core database applications / libraries. Web applications, libraries,
|
|
etc.
|
|
____
|
|
|
|
== Special Considerations
|
|
|
|
While this may not be a complete list, here are some special things that
|
|
must be taken into account before rebooting certain systems:
|
|
|
|
=== Disable builders
|
|
|
|
Before the following machines are rebooted, all koji builders should be
|
|
disabled and all running jobs allowed to complete:
|
|
|
|
____
|
|
* db04
|
|
* nfs01
|
|
* kojipkgs02
|
|
____
|
|
|
|
Builders can be removed from koji, updated and re-added. Use:
|
|
|
|
....
|
|
koji disable-host NAME
|
|
|
|
and
|
|
|
|
koji enable-host NAME
|
|
....
|
|
|
|
[NOTE]
|
|
.Note
|
|
====
|
|
you must be a koji admin
|
|
====
|
|
Additionally, rel-eng and builder boxes may need a special version
|
|
of rpm. Make sure to check with rel-eng on any rpm upgrades for them.
|
|
|
|
=== Post reboot action
|
|
|
|
The following machines require post-boot actions (mostly entering
|
|
passphrases). Make sure admins that have the passphrases are on hand for
|
|
the reboot:
|
|
|
|
____
|
|
* backup-2 (LUKS passphrase on boot)
|
|
* sign-vault01 (NSS passphrase for sigul service)
|
|
* sign-bridge01 (NSS passphrase for sigul bridge service)
|
|
* serverbeach* (requires fixing firewall rules):
|
|
____
|
|
|
|
Each serverbeach host needs 3 or 4 iptables rules added anytime it's
|
|
rebooted or libvirt is upgraded:
|
|
|
|
....
|
|
iptables -I FORWARD -o virbr0 -j ACCEPT
|
|
iptables -I FORWARD -i virbr0 -j ACCEPT
|
|
iptables -t nat -I POSTROUTING -s 192.168.122.3/32 -j SNAT --to-source 66.135.62.187
|
|
....
|
|
|
|
[NOTE]
|
|
.Note
|
|
====
|
|
The source is the internal guest ips, the to-source is the external ips
|
|
that map to that guest ip. If there are multiple guests, each one needs
|
|
the above SNAT rule inserted.
|
|
====
|
|
=== Schedule autoqa01 reboot
|
|
|
|
There is currently an autoqa01.c host on cnode01. Check with QA folks
|
|
before rebooting this guest/host.
|
|
|
|
=== Bastion01 and Bastion02 and openvpn server
|
|
|
|
We need one of the bastion machines to be up to provide openvpn for all
|
|
machines. Before rebooting bastion02, modify:
|
|
`manifests/nodes/bastion0*.phx2.fedoraproject.org.pp` files to start
|
|
openvpn server on bastion01, wait for all clients to re-connect, reboot
|
|
bastion02 and then revert back to it as openvpn hub.
|
|
|
|
=== Special yum directives
|
|
|
|
Sometimes we will wish to exclude or otherwise modify the yum.conf on a
|
|
machine. For this purpose, all machines have an include, making them
|
|
read
|
|
[62]http://infrastructure.fedoraproject.org/infra/hosts/FQHN/yum.conf.include
|
|
from the infrastructure repo. If you need to make such changes, add them
|
|
to the infrastructure repo before doing updates.
|
|
|
|
== Update Leader
|
|
|
|
Each update should have a Leader appointed. This person will be in
|
|
charge of doing any read-write operations, and delegating to others to
|
|
do tasks. If you aren't specficially asked by the Leader to reboot or
|
|
change something, please don't. The Leader will assign out machine
|
|
groups to reboot, or ask specific people to look at machines that didn't
|
|
come back up from reboot or aren't working right after reboot. It's
|
|
important to avoid multiple people operating on a single machine in a
|
|
read-write manner and interfering with changes.
|
|
|
|
== Group A reboots
|
|
|
|
Group A machines are end user critical ones. Outages here should be
|
|
planned at least a week in advance and announced to the announce list.
|
|
|
|
List of machines currently in A group (note: this is going to be
|
|
automated)
|
|
|
|
These hosts are grouped based on the virt host they reside on:
|
|
|
|
* torrent02.fedoraproject.org
|
|
* ibiblio02.fedoraproject.org
|
|
* people03.fedoraproject.org
|
|
* ibiblio03.fedoraproject.org
|
|
* collab01.fedoraproject.org
|
|
* serverbeach09.fedoraproject.org
|
|
* db05.phx2.fedoraproject.org
|
|
* virthost03.phx2.fedoraproject.org
|
|
* db01.phx2.fedoraproject.org
|
|
* virthost04.phx2.fedoraproject.org
|
|
* db-fas01.phx2.fedoraproject.org
|
|
* proxy01.phx2.fedoraproject.org
|
|
* virthost05.phx2.fedoraproject.org
|
|
* ask01.phx2.fedoraproject.org
|
|
* virthost06.phx2.fedoraproject.org
|
|
|
|
These are the rest:
|
|
|
|
* bapp02.phx2.fedoraproject.org
|
|
* bastion02.phx2.fedoraproject.org
|
|
* app05.fedoraproject.org
|
|
* backup02.fedoraproject.org
|
|
* bastion01.phx2.fedoraproject.org
|
|
* fas01.phx2.fedoraproject.org
|
|
* fas02.phx2.fedoraproject.org
|
|
* log02.phx2.fedoraproject.org
|
|
* memcached03.phx2.fedoraproject.org
|
|
* noc01.phx2.fedoraproject.org
|
|
* ns02.fedoraproject.org
|
|
* ns04.phx2.fedoraproject.org
|
|
* proxy04.fedoraproject.org
|
|
* smtp-mm03.fedoraproject.org
|
|
* batcave02.phx2.fedoraproject.org
|
|
* mm3test.fedoraproject.org
|
|
* packages02.phx2.fedoraproject.org
|
|
|
|
=== Group B reboots
|
|
|
|
This Group contains machines that contributors use. Announcements of
|
|
outages here should be at least a week in advance and sent to the
|
|
devel-announce list.
|
|
|
|
These hosts are grouped based on the virt host they reside on:
|
|
|
|
* db04.phx2.fedoraproject.org
|
|
* bvirthost01.phx2.fedoraproject.org
|
|
* nfs01.phx2.fedoraproject.org
|
|
* bvirthost02.phx2.fedoraproject.org
|
|
* pkgs01.phx2.fedoraproject.org
|
|
* bvirthost03.phx2.fedoraproject.org
|
|
* kojipkgs02.phx2.fedoraproject.org
|
|
* bvirthost04.phx2.fedoraproject.org
|
|
|
|
These are the rest:
|
|
|
|
* koji04.phx2.fedoraproject.org
|
|
* releng03.phx2.fedoraproject.org
|
|
* releng04.phx2.fedoraproject.org
|
|
|
|
=== Group C reboots
|
|
|
|
Group C are machines that infrastructure uses, or can be rebooted in
|
|
such a way as to continue to provide services to others via multiple
|
|
machines. Outages here should be announced on the infrastructure list.
|
|
|
|
Group C hosts that have proxy servers on them:
|
|
|
|
* proxy02.fedoraproject.org
|
|
* ns05.fedoraproject.org
|
|
* hosted-lists01.fedoraproject.org
|
|
* internetx01.fedoraproject.org
|
|
* app01.dev.fedoraproject.org
|
|
* darkserver01.dev.fedoraproject.org
|
|
* fakefas01.fedoraproject.org
|
|
* proxy06.fedoraproject.org
|
|
* osuosl01.fedoraproject.org
|
|
* proxy07.fedoraproject.org
|
|
* bodhost01.fedoraproject.org
|
|
* proxy03.fedoraproject.org
|
|
* smtp-mm02.fedoraproject.org
|
|
* tummy01.fedoraproject.org
|
|
* app06.fedoraproject.org
|
|
* noc02.fedoraproject.org
|
|
* proxy05.fedoraproject.org
|
|
* smtp-mm01.fedoraproject.org
|
|
* telia01.fedoraproject.org
|
|
* app08.fedoraproject.org
|
|
* proxy08.fedoraproject.org
|
|
* coloamer01.fedoraproject.org
|
|
+
|
|
____
|
|
Other Group C hosts:
|
|
____
|
|
* ask01.stg.phx2.fedoraproject.org
|
|
* app02.stg.phx2.fedoraproject.org
|
|
* proxy01.stg.phx2.fedoraproject.org
|
|
* releng01.stg.phx2.fedoraproject.org
|
|
* value01.stg.phx2.fedoraproject.org
|
|
* virthost13.phx2.fedoraproject.org
|
|
* db-fas01.stg.phx2.fedoraproject.org
|
|
* pkgs01.stg.phx2.fedoraproject.org
|
|
* packages01.stg.phx2.fedoraproject.org
|
|
* virthost11.phx2.fedoraproject.org
|
|
* app01.stg.phx2.fedoraproject.org
|
|
* koji01.stg.phx2.fedoraproject.org
|
|
* db02.stg.phx2.fedoraproject.org
|
|
* fas01.stg.phx2.fedoraproject.org
|
|
* virthost10.phx2.fedoraproject.org
|
|
* autoqa01.qa.fedoraproject.org
|
|
* autoqa-stg01.qa.fedoraproject.org
|
|
* bastion-comm01.qa.fedoraproject.org
|
|
* batcave-comm01.qa.fedoraproject.org
|
|
* virthost-comm01.qa.fedoraproject.org
|
|
* compose-x86-01.phx2.fedoraproject.org
|
|
* compose-x86-02.phx2.fedoraproject.org
|
|
* download01.phx2.fedoraproject.org
|
|
* download02.phx2.fedoraproject.org
|
|
* download03.phx2.fedoraproject.org
|
|
* download04.phx2.fedoraproject.org
|
|
* download05.phx2.fedoraproject.org
|
|
* download-rdu01.vpn.fedoraproject.org
|
|
* download-rdu02.vpn.fedoraproject.org
|
|
* download-rdu03.vpn.fedoraproject.org
|
|
* fas03.phx2.fedoraproject.org
|
|
* secondary01.phx2.fedoraproject.org
|
|
* memcached04.phx2.fedoraproject.org
|
|
* virthost01.phx2.fedoraproject.org
|
|
* app02.phx2.fedoraproject.org
|
|
* value03.phx2.fedoraproject.org
|
|
* virthost07.phx2.fedoraproject.org
|
|
* app03.phx2.fedoraproject.org
|
|
* value04.phx2.fedoraproject.org
|
|
* ns03.phx2.fedoraproject.org
|
|
* darkserver01.phx2.fedoraproject.org
|
|
* virthost08.phx2.fedoraproject.org
|
|
* app04.phx2.fedoraproject.org
|
|
* packages02.phx2.fedoraproject.org
|
|
* virthost09.phx2.fedoraproject.org
|
|
* hosted03.fedoraproject.org
|
|
* serverbeach06.fedoraproject.org
|
|
* hosted04.fedoraproject.org
|
|
* serverbeach07.fedoraproject.org
|
|
* collab02.fedoraproject.org
|
|
* serverbeach08.fedoraproject.org
|
|
* dhcp01.phx2.fedoraproject.org
|
|
* relepel01.phx2.fedoraproject.org
|
|
* sign-bridge02.phx2.fedoraproject.org
|
|
* koji03.phx2.fedoraproject.org
|
|
* bvirthost05.phx2.fedoraproject.org
|
|
* (disable each builder in turn, update and reenable).
|
|
* ppc11.phx2.fedoraproject.org
|
|
* ppc12.phx2.fedoraproject.org
|
|
* backup03
|
|
|
|
== Doing the upgrade
|
|
|
|
If possible, system upgrades should be done in advance of the reboot
|
|
(with relevant testing of new packages on staging). To do the upgrades,
|
|
make sure that the Infrastructure RHEL repo is updated as necessary to
|
|
pull in the new packages ([63]Infrastructure Yum Repo SOP)
|
|
|
|
On batcave01, as root run:
|
|
|
|
....
|
|
func-yum [--host=hostname] update
|
|
....
|
|
|
|
..note: --host can be specified multiple times and takes wildcards.
|
|
|
|
pinging people as necessary if you are unsure about any packages.
|
|
|
|
Additionally you can see which machines still need rebooted with:
|
|
|
|
....
|
|
sudo func-command --timeout=10 --oneline /usr/local/bin/needs-reboot.py | grep yes
|
|
....
|
|
|
|
You can also see which machines would need a reboot if updates were all
|
|
applied with:
|
|
|
|
....
|
|
sudo func-command --timeout=10 --oneline /usr/local/bin/needs-reboot.py after-updates | grep yes
|
|
....
|
|
|
|
== Doing the reboot
|
|
|
|
In the order determined above, reboots will usually be grouped by the
|
|
virtualization hosts that the servers are on. You can see the guests per
|
|
virt host on batcave01 in /var/log/virthost-lists.out
|
|
|
|
To reboot sets of boxes based on which virthost they are we've written a
|
|
special script which facilitates it:
|
|
|
|
....
|
|
func-vhost-reboot virthost-fqdn
|
|
....
|
|
|
|
ex:
|
|
|
|
....
|
|
sudo func-vhost-reboot virthost13.phx2.fedoraproject.org
|
|
....
|
|
|
|
== Aftermath
|
|
|
|
[arabic]
|
|
. Make sure that everything's running fine
|
|
. Reenable nagios notification as needed
|
|
. {blank}
|
|
+
|
|
Make sure to perform any manual post-boot setup (such as entering::
|
|
passphrases for encrypted volumes)
|
|
. Close outage ticket.
|
|
|
|
=== Non virthost reboots:
|
|
|
|
If you need to reboot specific hosts and make sure they recover -
|
|
consider using:
|
|
|
|
....
|
|
sudo func-host-reboot hostname hostname1 hostname2 ...
|
|
....
|
|
|
|
If you want to reboot the hosts one at a time waiting for each to come
|
|
back before rebooting the next pass a -o to func-host-reboot.
|