From fc5f853ea05b770672fc6854663c4fef301a519e Mon Sep 17 00:00:00 2001
From: Adam Williamson <awilliam@redhat.com>
Date: Tue, 25 Jul 2023 16:22:05 -0700
Subject: [PATCH] Update and extend the openQA sysadmin guide for 2023

I remembered this thing exists, so I updated it! Some stuff is
just freshening, plus I added some juicy new information too.
Print it out and read it on the toilet, folks.

Signed-off-by: Adam Williamson <awilliam@redhat.com>
---
 modules/sysadmin_guide/pages/openqa.adoc | 271 ++++++++++++++++-------
 1 file changed, 190 insertions(+), 81 deletions(-)

diff --git a/modules/sysadmin_guide/pages/openqa.adoc b/modules/sysadmin_guide/pages/openqa.adoc
index 07b60ee..4cf5a98 100644
--- a/modules/sysadmin_guide/pages/openqa.adoc
+++ b/modules/sysadmin_guide/pages/openqa.adoc
@@ -6,7 +6,7 @@ tests on critical path updates.
 
 OpenQA production instance: https://openqa.fedoraproject.org
 
-OpenQA staging instance: https://openqa.stg.fedoraproject.org
+OpenQA staging (lab) instance: https://openqa.stg.fedoraproject.org
 
 Wiki page on Fedora openQA deployment: https://fedoraproject.org/wiki/OpenQA
 
@@ -21,7 +21,7 @@ Owner::
 Contact::
   #fedora-qa, #fedora-admin, qa-devel mailing list
 People::
-  Adam Williamson (adamwill / adamw), Petr Schindler (pschindl)
+  Adam Williamson (adamwill / adamw), Lukas Ruzicka (lruzicka)
 Machines::
   See ansible inventory groups with 'openqa' in name
 Purpose::
@@ -33,7 +33,7 @@ Each openQA instance consists of a server (these are virtual machines)
 and one or more worker hosts (these are bare metal systems). The server
 schedules tests ("jobs", in openQA parlance) and stores results and
 associated data. The worker hosts run "jobs" and send the results back
-to the server. The server also runs some fedmsg consumers to handle
+to the server. The server also runs some message consumers to handle
 automatic scheduling of jobs and reporting of results to external
 systems (ResultsDB and Wikitcms).
 
@@ -63,15 +63,14 @@ the server, the server can be redeployed from scratch without loss of
 any data (at least, this is the intent).
 
 Also in our deployment, an openQA plugin (which we wrote, but which is
-part of the upstream codebase) is enabled which emits fedmsgs on various
-events. This works by calling fedmsg-logger, so the appropriate fedmsg
-configuration must be in place for this to emit events correctly.
+part of the upstream codebase) is enabled which publishes messages on
+various events.
 
-The server systems run a fedmsg consumer for the purpose of
+The server systems run a message consumer for the purpose of
 automatically scheduling jobs in response to the appearance of new
-composes and critical path updates, and one for the purpose of reporting
-the results of completed jobs to ResultsDB and Wikitcms. These use the
-`fedmsg-hub` system.
+composes and critical path updates, and one each for the purpose of
+reporting the results of completed jobs to ResultsDB and Wikitcms.
+These use the `fm-consumer@` pattern from `fedora-messaging`.
 
 == Worker hosts
 
@@ -98,7 +97,9 @@ networking (openvswitch) to interact with each other. All the
 configuration for this should be handled by the ansible scripts, but
 it's useful to be aware that there is complex software-defined
 networking stuff going on on these hosts which could potentially be the
-source of problems.
+source of problems (backed by openvswitch). There is some more detail
+on this in the wiki page and upstream docs; refer to the ansible plays
+for the details of how it's actually configured.
 
 == Deployment and regular operation
 
@@ -129,15 +130,14 @@ The optimal approach to rebooting an entire openQA deployment is as
 follows:
 
 [arabic]
-. Wait until no jobs are running
-. Stop all `openqa-*` services on the server, so no more will be queued
-. Stop all `openqa-worker@` services on the worker hosts
 . Reboot the server
 . Check for failed services (`systemctl --failed`) and restart any that
 failed
 . Once the server is fully functional, reboot the worker hosts
 . Check for failed services and restart any that failed, particularly
-the NFS mount service
+the NFS mount service, on each worker host
+. Check in the web UI for failed jobs and restart them, especially
+tests of updates
 
 Rebooting the workers *after* the server is important due to the NFS
 share.
@@ -149,9 +149,11 @@ getting confused about running jobs due to the websockets connections
 being restarted.
 
 If only a worker host needs restarting, there is no need to restart the
-server too, but it is best to wait until no jobs are running on that
-host, and stop all `open-worker@` services on the host before rebooting
-it.
+server too. Ideally, wait until no jobs are running on that host, and
+stop all `open-worker@` services on the host before rebooting it; but
+in a pinch, if you reboot with running jobs, they *should* be
+automatically rescheduled. Still, you should manually check in the web
+UI for failed jobs and restart them.
 
 There are two ways to check if jobs are running and if so where. You can
 go to the web UI for the server and click 'All Tests'. If any jobs are
@@ -163,11 +165,107 @@ and click on 'Workers', which will show the status of all known workers
 for that server, and select 'Working' in the state filter box. This will
 show all workers currently working on a job.
 
-Note that if something which would usually be tested (new compose, new
-critpath update...) appears during the reboot window, it likely will
-_not_ be scheduled for testing, as this is done by a fedmsg consumer
-running on the server. You will need to schedule it for testing manually
-in this case (see below).
+== Troubleshooting
+
+=== New tests not being scheduled
+
+Check that `fm-consumer@fedora_openqa_scheduler.service` is enabled,
+running, and not crashing. If that doesn't do the trick, the scheduler
+may be broken or the expected messages may not be being published.
+
+=== Results not being reported to resultsdb and/or the wiki
+
+Check that `fm-consumer@fedora_openqa_resultsdb_reporter.service` and
+`fm-consumer@fedora_openqa_wiki_reporter.service` are enabled,
+running, and not crashing.
+
+=== Services that write to the wiki keep crashing
+
+If `fm-consumer@fedora_openqa_wiki_reporter.service` (and other
+services that write to the wiki, like the `relval` and `relvalami`
+consumers) are constantly failing/crashing, the API token may have
+been overwritten somehow. Re-run the relevant plays (on batcave01):
+
+....
+sudo rbac-playbook groups/openqa.yml -t openqa_dispatcher
+....
+
+If this does not sort it out, you may need help from a wiki admin
+to work out what's going on.
+
+=== Many tests failing on the same worker host, in unusual ways
+
+Sometimes, worker hosts can just "go bad", through memory exhaustion,
+for instance. This usually manifests as unusual test failures (for
+instance, failures very early in a test that aren't caused by invalid
+test files, tests that time out when they usually would not, or tests
+that seem to just die suddenly with a cryptic error message). If you
+encounter this, just reboot the affected worker host. This is more
+common on staging than production, as we intentionally run the older,
+weaker worker hosts on the staging instance. If things are particularly
+bad you may not be able to ssh into the host, and will need to reboot
+it from the sideband controller; if you're not sure how to do this,
+contact someone from sysadmin-main for assistance.
+
+=== Tests failing early, complaining about missing assets
+
+If many tests are failing early with errors suggesting they can't
+find required files, check for failed services on the worker hosts.
+Sometimes the NFS mount service fails and needs restarting.
+
+=== Disk space issues: server local root
+
+If a server is running out of space on its local root partition, the
+cause is almost certainly asset storage. Almost all the space on the
+server root partition is used by test assets (ISO and hard disk image
+files).
+
+openQA has a system for limiting the amount of space used by asset
+storage, which we configure via ansible variables. Check the values of
+the `openqa_assetsize*` variables in the openQA server group variables
+in ansible. If the settings for the server sum to the amount of space
+used, or more than it, those settings may need to be reduced. If there
+seems to be more space used than the settings would allow for, there
+may be an issue preventing the openQA task that actually enforces the
+limits from running: check the "Minion Dashboard" (from the top-right
+menu) in the openQA web UI and look for stuck or failed `limit_assets`
+tasks (or just check whether any have completed recently; the task is
+scheduled after each completed job so it should run frequently). There
+is also an "Assets" link in the menu which gives you a web UI view of
+the limits on each job group and the current size and present assets,
+though note the list of present assets and the current size is updated
+by the `limit_assets` task, so it will be inaccurate if that is not
+being run successfully. You must be an openQA operator to access the
+"Assets" view, and an administrator to access the "Minion Dashboard".
+
+In a pinch, if there is no space and tests are failing, you can wipe
+older, larger asset files in `/var/lib/openqa/share/factory/iso` and
+`/var/lib/openqa/share/factory/hdd` to get things moving again while
+you debug the issue. This is better than letting new tests fail.
+
+=== Disk space issues: testresults and images NFS share
+
+As mentioned above, the server mounts two NFS shares from the infra
+storage server, at `/var/lib/openqa/images` and
+`/var/lib/openqa/testresults` (they are both actually backed by a
+single volume). These are where the screenshots, video and logs of
+the executed tests are stored. If they fill up, tests will start to
+fail.
+
+openQA has a garbage collection mechanism which deletes (most) files
+from (most) jobs when they are six months old, which ought to keep
+usage of these shares in a steady state. However, if we enhance test
+coverage so openQA is running more tests in any given six month
+period than earlier ones, space usage will increase correspondingly.
+It can also increase in response to odd triggers like a bug which
+causes a lot of messages to be logged to a serial console, or a test
+being configured to upload a very large file as a log.
+
+More importantly, there is a snapshot mechanism configured on this
+volume for the production instance, so space usage will always
+gradually increase there. When the volume gets too full, we must
+delete some older snapshots to free up space. This must be done by
+an infra storage admin. The volume's name is `fedora_openqa`.
 
 == Scheduling jobs manually
 
@@ -184,6 +282,13 @@ correctly when restarting, but doesn't always manage to do it right;
 when it goes wrong, the best thing to do is usually to re-run all jobs
 for that medium.
 
+Restarting a job should cause its status indicator (the little colored
+blob) to go blue. If nothing changes, the restart likely failed. An
+error message should explain why, but it always appears at the top
+of the page, so you may need to scroll up to see it. If restarting
+a test fails because an asset (an ISO file or hard disk image) is
+missing, you will need to re-schedule the tests (see below).
+
 To run or re-run the full set of tests for a compose or update, you can
 use the `fedora-openqa` CLI. To run or re-run tests for a compose, use:
 
@@ -192,9 +297,11 @@ fedora-openqa compose -f (COMPOSE LOCATION)
 ....
 
 where `(COMPOSE LOCATION)` is the full URL of the `/compose`
-subdirectory of the compose. This will only work for Pungi-produced
-composes with the expected productmd-format metadata, and a couple of
-other quite special cases.
+subdirectory of the compose. If you have an existing test to use as a
+reference, go to the Settings tab, and the URL will be set as the
+`LOCATION` setting. This will only work for Pungi-produced composes
+with the expected productmd-format metadata, and a couple of other
+quite special cases.
 
 The `-f` argument means 'force', and is necessary to re-run tests:
 usually, the scheduler will refuse to re-schedule tests that have
@@ -203,42 +310,25 @@ already run, and `-f` overrides this.
 To run or re-run tests for an update, use:
 
 ....
-fedora-openqa update -f (UPDATEID) (RELEASE)
+fedora-openqa update -f (UPDATEID)
 ....
 
 where `(UPDATEID)` is the update's ID - something like
-`FEDORA-2018-blahblah` - and `(RELEASE)` is the release for which the
-update is intended (27, 28, etc).
+`FEDORA-2018-blahblah`.
 
-To run or re-run only the tests for a specific medium (usually a single
-image file), you must use the lower-level web API client, with a more
-complex syntax. The command looks something like this:
+To run or re-run only the tests for a specific "flavor", you can pass
+the `--flavor` (update) or `--flavors` (compose) argument - for an
+update it must be a single flavor, for a compose it may be a single
+flavor or a comma-separated list. The names of the flavors are shown
+in the web UI results overview for the compose or update, e.g.
+"Server-boot-iso". For update tests, omit the leading "updates-" in
+the flavor name (so, to re-schedule the "updates-workstation" tests
+for an update, you would pass `--flavor workstation`).
 
-....
-/usr/share/openqa/script/client isos post \
-ISO=Fedora-Server-dvd-x86_64-Rawhide-20180108.n.0.iso DISTRI=fedora VERSION=Rawhide \
-FLAVOR=Server-dvd-iso ARCH=x86_64 BUILD=Fedora-Rawhide-20180108.n.0 CURRREL=27 PREVREL=26 \
-RAWREL=28 IMAGETYPE=dvd SUBVARIANT=Server \
-LOCATION=http://kojipkgs.fedoraproject.org/compose/rawhide/Fedora-Rawhide-20180108.n.0/compose
-....
-
-The `ISO` value is the filename of the image to test (it may not
-actually be an ISO), the `DISTRI` value is always 'fedora', the
-`VERSION` value should be the release number or 'Rawhide', the `FLAVOR`
-value depends on the image being tested (you can check the value from an
-existing test for the same or a similar ISO), the `ARCH` value is the
-arch of the image being tested, the `BUILD` value is the compose ID,
-`CURREL` should be the release number of the current Fedora release at
-the time the test is run, `PREVREL` should be one lower than `CURREL`,
-`RAWREL` should be the release number associated with Rawhide at the
-time the test is run, `IMAGETYPE` depends on the image being tested
-(again, check a similar test for the correct value), `LOCATION` is the
-URL to the /compose subdirectory of the compose location, and
-`SUBVARIANT` again depends on the image being tested. Please ask for
-help if this seems too daunting. To re-run the 'universal' tests on a
-given image, set the `FLAVOR` value to 'universal', then set all other
-values as appropriate to the chosen image. The 'universal' tests are
-only likely to work at all correctly with DVD or netinst images.
+Less commonly, you can schedule tests for scratch builds using
+`fedora-openqa task` and side tags using `fedora-openqa tag`. This
+should usually only be done on the staging instance. See the help
+of `fedora-openqa` for more details.
 
 openQA provides a special script for cloning an existing job but
 optionally changing one or more variable values, which can be useful in
@@ -253,20 +343,37 @@ For interdependent jobs, you may or may not want to use the
 `--skip-deps` argument to avoid re-running the cloned job's parent
 job(s), depending on circumstances.
 
+In very odd circumstances you may need to schedule jobs via an API
+request using the low-level CLI client provided by upstream,
+`openqa-client`; see http://open.qa/docs/#_triggering_tests for details
+on this. You may need to refer to the `schedule.py` file in the
+`fedora_openqa` source to figure out exactly what settings to pass to
+the scheduler when doing this. It's extremely unusual to have to do
+this, though, so probably don't worry about it.
+
 == Manual updates
 
 In general updates to any of the components of the deployments should be
 handled via ansible: push the changes out in the appropriate way (git
-repo update, package update, etc.) and then run the ansible plays.
+repo update, package update, etc.) and then run the ansible plays. There
+is an `openqa_scratch` variable which can be set to a list of Koji
+task IDs for scratch builds; these will be downloaded and configured as
+a side repository. This can be used to deploy a newer build of openQA
+and/or os-autoinst before it has reached updates-testing if desired
+(usually we would do this only on the staging instance). Also, the
+`openqa_repo` variable can be set to "updates-testing" to install or
+update openQA components with updates-testing enabled, to get a new
+version before it has waited a week to reach stable.
+
 However, sometimes we do want to update or test a change to something
 manually for some reason. Here are some notes on those cases.
 
 For updating openQA and/or os-autoinst packages: ideally, ensure no jobs
 are running. Then, update all installed subpackages on the server. The
 server services should be automatically restarted as part of the package
-update. Then, update all installed subpackages on the worker hosts, and
-restart all worker services. A 'for' loop can help with that, for
-instance:
+update. Then, update all installed subpackages on the worker hosts.
+Usually this should cause the worker services to be restarted, but if
+not, a 'for' loop can help with that, for instance:
 
 ....
 for i in {1..10}; do systemctl restart openqa-worker@$i.service; done
@@ -279,11 +386,10 @@ For updating the openQA tests:
 ....
 cd /var/lib/openqa/share/tests/fedora
 git pull (or git checkout (branch) or whatever)
-./templates --clean
-./templates-updates --update
+./fifloader.py -c -l templates.fif.json templates-updates.fif.json
 ....
 
-The templates steps are only necessary if there are any changes to the
+The fifloader step is only necessary if there are any changes to the
 templates files.
 
 For updating the scheduler code:
@@ -292,18 +398,19 @@ For updating the scheduler code:
 cd /root/fedora_openqa
 git pull (or whatever changes)
 python setup.py install
-systemctl restart fedmsg-hub
+systemctl restart fm-consumer@fedora_openqa_scheduler.service
+systemctl restart fm-consumer@fedora_openqa_resultsdb_reporter.service
+systemctl restart fm-consumer@fedora_openqa_wiki_reporter.service
 ....
 
 Updating other components of the scheduling process follow the same
 pattern: update the code or package, then remember to restart
-fedmsg-hub, or the fedmsg consumers won't use the new code. It's
-relatively common for the openQA instances to need fedfind updates in
-advance of them being pushed to stable, for example when a new compose
-type is invented and fedfind doesn't understand it, openQA can end up
-trying to schedule tests for it, or the scheduler consumer can crash;
-when this happens we have to fix and update fedfind on the openQA
-instances ASAP.
+the message consumers. It's possible for the openQA instances to need
+fedfind updates in advance of them being pushed to stable, for example
+when a new compose type is invented and fedfind doesn't understand it,
+openQA can end up trying to schedule tests for it, or the scheduler
+consumer can crash; when this happens we have to fix and update
+fedfind on the openQA instances ASAP.
 
 == Logging
 
@@ -338,26 +445,28 @@ images are part of the tool itself).
 This process isn't 100% reliable; `virt-install` can sometimes fail,
 either just quasi-randomly or every time, in which case the cause of the
 failure needs to be figured out and fixed so the affected image can be
-(re-)built.
+(re-)built. This kind of failure is quite "invisible", as when
+regeneration of an image fails, we just keep the old version; this
+might be the problem if update tests start failing because the initial
+update to bring the system fully up to date times out, for instance.
 
-The i686 and x86_64 images for each instance are built on the server, as
-its native arch is x86_64. The images for other arches are built on one
-worker host for each arch (nominated by inclusion in an ansible
-inventory group that exists for this purpose); those hosts have write
-access to the NFS share for this purpose.
+The images for each arch are built on one worker host of that arch
+(nominated by inclusion in an ansible inventory group that exists for
+this purpose); those hosts have write access to the NFS share for this
+purpose.
 
 == Compose check reports (check-compose)
 
 An additional ansible role runs on each openQA server, called
 `check-compose`. This role installs a tool (also called `check-compose`)
-and an associated fedmsg consumer. The consumer kicks in when all openQA
+and an associated message consumer. The consumer kicks in when all openQA
 tests for any compose finish, and uses the `check-compose` tool to send
 out an email report summarizing the results of the tests (well, the
 production server sends out emails, the staging server just logs the
 contents of the report). This role isn't really a part of openQA proper,
 but is run on the openQA servers as it seems like as good a place as any
-to do it. As with all other fedmsg consumers, if making manual changes
-or updates to the components, remember to restart `fedmsg-hub` service
+to do it. As with all other message consumers, if making manual changes
+or updates to the components, remember to restart the consumer service
 afterwards.
 
 == Autocloud ResultsDB forwarder (autocloudreporter)