Update and extend the openQA sysadmin guide for 2023
I remembered this thing exists, so I updated it! Some stuff is just freshening, plus I added some juicy new information too. Print it out and read it on the toilet, folks. Signed-off-by: Adam Williamson <awilliam@redhat.com>
This commit is contained in:
parent
2bb2853ec5
commit
fc5f853ea0
1 changed files with 190 additions and 81 deletions
|
@ -6,7 +6,7 @@ tests on critical path updates.
|
|||
|
||||
OpenQA production instance: https://openqa.fedoraproject.org
|
||||
|
||||
OpenQA staging instance: https://openqa.stg.fedoraproject.org
|
||||
OpenQA staging (lab) instance: https://openqa.stg.fedoraproject.org
|
||||
|
||||
Wiki page on Fedora openQA deployment: https://fedoraproject.org/wiki/OpenQA
|
||||
|
||||
|
@ -21,7 +21,7 @@ Owner::
|
|||
Contact::
|
||||
#fedora-qa, #fedora-admin, qa-devel mailing list
|
||||
People::
|
||||
Adam Williamson (adamwill / adamw), Petr Schindler (pschindl)
|
||||
Adam Williamson (adamwill / adamw), Lukas Ruzicka (lruzicka)
|
||||
Machines::
|
||||
See ansible inventory groups with 'openqa' in name
|
||||
Purpose::
|
||||
|
@ -33,7 +33,7 @@ Each openQA instance consists of a server (these are virtual machines)
|
|||
and one or more worker hosts (these are bare metal systems). The server
|
||||
schedules tests ("jobs", in openQA parlance) and stores results and
|
||||
associated data. The worker hosts run "jobs" and send the results back
|
||||
to the server. The server also runs some fedmsg consumers to handle
|
||||
to the server. The server also runs some message consumers to handle
|
||||
automatic scheduling of jobs and reporting of results to external
|
||||
systems (ResultsDB and Wikitcms).
|
||||
|
||||
|
@ -63,15 +63,14 @@ the server, the server can be redeployed from scratch without loss of
|
|||
any data (at least, this is the intent).
|
||||
|
||||
Also in our deployment, an openQA plugin (which we wrote, but which is
|
||||
part of the upstream codebase) is enabled which emits fedmsgs on various
|
||||
events. This works by calling fedmsg-logger, so the appropriate fedmsg
|
||||
configuration must be in place for this to emit events correctly.
|
||||
part of the upstream codebase) is enabled which publishes messages on
|
||||
various events.
|
||||
|
||||
The server systems run a fedmsg consumer for the purpose of
|
||||
The server systems run a message consumer for the purpose of
|
||||
automatically scheduling jobs in response to the appearance of new
|
||||
composes and critical path updates, and one for the purpose of reporting
|
||||
the results of completed jobs to ResultsDB and Wikitcms. These use the
|
||||
`fedmsg-hub` system.
|
||||
composes and critical path updates, and one each for the purpose of
|
||||
reporting the results of completed jobs to ResultsDB and Wikitcms.
|
||||
These use the `fm-consumer@` pattern from `fedora-messaging`.
|
||||
|
||||
== Worker hosts
|
||||
|
||||
|
@ -98,7 +97,9 @@ networking (openvswitch) to interact with each other. All the
|
|||
configuration for this should be handled by the ansible scripts, but
|
||||
it's useful to be aware that there is complex software-defined
|
||||
networking stuff going on on these hosts which could potentially be the
|
||||
source of problems.
|
||||
source of problems (backed by openvswitch). There is some more detail
|
||||
on this in the wiki page and upstream docs; refer to the ansible plays
|
||||
for the details of how it's actually configured.
|
||||
|
||||
== Deployment and regular operation
|
||||
|
||||
|
@ -129,15 +130,14 @@ The optimal approach to rebooting an entire openQA deployment is as
|
|||
follows:
|
||||
|
||||
[arabic]
|
||||
. Wait until no jobs are running
|
||||
. Stop all `openqa-*` services on the server, so no more will be queued
|
||||
. Stop all `openqa-worker@` services on the worker hosts
|
||||
. Reboot the server
|
||||
. Check for failed services (`systemctl --failed`) and restart any that
|
||||
failed
|
||||
. Once the server is fully functional, reboot the worker hosts
|
||||
. Check for failed services and restart any that failed, particularly
|
||||
the NFS mount service
|
||||
the NFS mount service, on each worker host
|
||||
. Check in the web UI for failed jobs and restart them, especially
|
||||
tests of updates
|
||||
|
||||
Rebooting the workers *after* the server is important due to the NFS
|
||||
share.
|
||||
|
@ -149,9 +149,11 @@ getting confused about running jobs due to the websockets connections
|
|||
being restarted.
|
||||
|
||||
If only a worker host needs restarting, there is no need to restart the
|
||||
server too, but it is best to wait until no jobs are running on that
|
||||
host, and stop all `open-worker@` services on the host before rebooting
|
||||
it.
|
||||
server too. Ideally, wait until no jobs are running on that host, and
|
||||
stop all `open-worker@` services on the host before rebooting it; but
|
||||
in a pinch, if you reboot with running jobs, they *should* be
|
||||
automatically rescheduled. Still, you should manually check in the web
|
||||
UI for failed jobs and restart them.
|
||||
|
||||
There are two ways to check if jobs are running and if so where. You can
|
||||
go to the web UI for the server and click 'All Tests'. If any jobs are
|
||||
|
@ -163,11 +165,107 @@ and click on 'Workers', which will show the status of all known workers
|
|||
for that server, and select 'Working' in the state filter box. This will
|
||||
show all workers currently working on a job.
|
||||
|
||||
Note that if something which would usually be tested (new compose, new
|
||||
critpath update...) appears during the reboot window, it likely will
|
||||
_not_ be scheduled for testing, as this is done by a fedmsg consumer
|
||||
running on the server. You will need to schedule it for testing manually
|
||||
in this case (see below).
|
||||
== Troubleshooting
|
||||
|
||||
=== New tests not being scheduled
|
||||
|
||||
Check that `fm-consumer@fedora_openqa_scheduler.service` is enabled,
|
||||
running, and not crashing. If that doesn't do the trick, the scheduler
|
||||
may be broken or the expected messages may not be being published.
|
||||
|
||||
=== Results not being reported to resultsdb and/or the wiki
|
||||
|
||||
Check that `fm-consumer@fedora_openqa_resultsdb_reporter.service` and
|
||||
`fm-consumer@fedora_openqa_wiki_reporter.service` are enabled,
|
||||
running, and not crashing.
|
||||
|
||||
=== Services that write to the wiki keep crashing
|
||||
|
||||
If `fm-consumer@fedora_openqa_wiki_reporter.service` (and other
|
||||
services that write to the wiki, like the `relval` and `relvalami`
|
||||
consumers) are constantly failing/crashing, the API token may have
|
||||
been overwritten somehow. Re-run the relevant plays (on batcave01):
|
||||
|
||||
....
|
||||
sudo rbac-playbook groups/openqa.yml -t openqa_dispatcher
|
||||
....
|
||||
|
||||
If this does not sort it out, you may need help from a wiki admin
|
||||
to work out what's going on.
|
||||
|
||||
=== Many tests failing on the same worker host, in unusual ways
|
||||
|
||||
Sometimes, worker hosts can just "go bad", through memory exhaustion,
|
||||
for instance. This usually manifests as unusual test failures (for
|
||||
instance, failures very early in a test that aren't caused by invalid
|
||||
test files, tests that time out when they usually would not, or tests
|
||||
that seem to just die suddenly with a cryptic error message). If you
|
||||
encounter this, just reboot the affected worker host. This is more
|
||||
common on staging than production, as we intentionally run the older,
|
||||
weaker worker hosts on the staging instance. If things are particularly
|
||||
bad you may not be able to ssh into the host, and will need to reboot
|
||||
it from the sideband controller; if you're not sure how to do this,
|
||||
contact someone from sysadmin-main for assistance.
|
||||
|
||||
=== Tests failing early, complaining about missing assets
|
||||
|
||||
If many tests are failing early with errors suggesting they can't
|
||||
find required files, check for failed services on the worker hosts.
|
||||
Sometimes the NFS mount service fails and needs restarting.
|
||||
|
||||
=== Disk space issues: server local root
|
||||
|
||||
If a server is running out of space on its local root partition, the
|
||||
cause is almost certainly asset storage. Almost all the space on the
|
||||
server root partition is used by test assets (ISO and hard disk image
|
||||
files).
|
||||
|
||||
openQA has a system for limiting the amount of space used by asset
|
||||
storage, which we configure via ansible variables. Check the values of
|
||||
the `openqa_assetsize*` variables in the openQA server group variables
|
||||
in ansible. If the settings for the server sum to the amount of space
|
||||
used, or more than it, those settings may need to be reduced. If there
|
||||
seems to be more space used than the settings would allow for, there
|
||||
may be an issue preventing the openQA task that actually enforces the
|
||||
limits from running: check the "Minion Dashboard" (from the top-right
|
||||
menu) in the openQA web UI and look for stuck or failed `limit_assets`
|
||||
tasks (or just check whether any have completed recently; the task is
|
||||
scheduled after each completed job so it should run frequently). There
|
||||
is also an "Assets" link in the menu which gives you a web UI view of
|
||||
the limits on each job group and the current size and present assets,
|
||||
though note the list of present assets and the current size is updated
|
||||
by the `limit_assets` task, so it will be inaccurate if that is not
|
||||
being run successfully. You must be an openQA operator to access the
|
||||
"Assets" view, and an administrator to access the "Minion Dashboard".
|
||||
|
||||
In a pinch, if there is no space and tests are failing, you can wipe
|
||||
older, larger asset files in `/var/lib/openqa/share/factory/iso` and
|
||||
`/var/lib/openqa/share/factory/hdd` to get things moving again while
|
||||
you debug the issue. This is better than letting new tests fail.
|
||||
|
||||
=== Disk space issues: testresults and images NFS share
|
||||
|
||||
As mentioned above, the server mounts two NFS shares from the infra
|
||||
storage server, at `/var/lib/openqa/images` and
|
||||
`/var/lib/openqa/testresults` (they are both actually backed by a
|
||||
single volume). These are where the screenshots, video and logs of
|
||||
the executed tests are stored. If they fill up, tests will start to
|
||||
fail.
|
||||
|
||||
openQA has a garbage collection mechanism which deletes (most) files
|
||||
from (most) jobs when they are six months old, which ought to keep
|
||||
usage of these shares in a steady state. However, if we enhance test
|
||||
coverage so openQA is running more tests in any given six month
|
||||
period than earlier ones, space usage will increase correspondingly.
|
||||
It can also increase in response to odd triggers like a bug which
|
||||
causes a lot of messages to be logged to a serial console, or a test
|
||||
being configured to upload a very large file as a log.
|
||||
|
||||
More importantly, there is a snapshot mechanism configured on this
|
||||
volume for the production instance, so space usage will always
|
||||
gradually increase there. When the volume gets too full, we must
|
||||
delete some older snapshots to free up space. This must be done by
|
||||
an infra storage admin. The volume's name is `fedora_openqa`.
|
||||
|
||||
== Scheduling jobs manually
|
||||
|
||||
|
@ -184,6 +282,13 @@ correctly when restarting, but doesn't always manage to do it right;
|
|||
when it goes wrong, the best thing to do is usually to re-run all jobs
|
||||
for that medium.
|
||||
|
||||
Restarting a job should cause its status indicator (the little colored
|
||||
blob) to go blue. If nothing changes, the restart likely failed. An
|
||||
error message should explain why, but it always appears at the top
|
||||
of the page, so you may need to scroll up to see it. If restarting
|
||||
a test fails because an asset (an ISO file or hard disk image) is
|
||||
missing, you will need to re-schedule the tests (see below).
|
||||
|
||||
To run or re-run the full set of tests for a compose or update, you can
|
||||
use the `fedora-openqa` CLI. To run or re-run tests for a compose, use:
|
||||
|
||||
|
@ -192,9 +297,11 @@ fedora-openqa compose -f (COMPOSE LOCATION)
|
|||
....
|
||||
|
||||
where `(COMPOSE LOCATION)` is the full URL of the `/compose`
|
||||
subdirectory of the compose. This will only work for Pungi-produced
|
||||
composes with the expected productmd-format metadata, and a couple of
|
||||
other quite special cases.
|
||||
subdirectory of the compose. If you have an existing test to use as a
|
||||
reference, go to the Settings tab, and the URL will be set as the
|
||||
`LOCATION` setting. This will only work for Pungi-produced composes
|
||||
with the expected productmd-format metadata, and a couple of other
|
||||
quite special cases.
|
||||
|
||||
The `-f` argument means 'force', and is necessary to re-run tests:
|
||||
usually, the scheduler will refuse to re-schedule tests that have
|
||||
|
@ -203,42 +310,25 @@ already run, and `-f` overrides this.
|
|||
To run or re-run tests for an update, use:
|
||||
|
||||
....
|
||||
fedora-openqa update -f (UPDATEID) (RELEASE)
|
||||
fedora-openqa update -f (UPDATEID)
|
||||
....
|
||||
|
||||
where `(UPDATEID)` is the update's ID - something like
|
||||
`FEDORA-2018-blahblah` - and `(RELEASE)` is the release for which the
|
||||
update is intended (27, 28, etc).
|
||||
`FEDORA-2018-blahblah`.
|
||||
|
||||
To run or re-run only the tests for a specific medium (usually a single
|
||||
image file), you must use the lower-level web API client, with a more
|
||||
complex syntax. The command looks something like this:
|
||||
To run or re-run only the tests for a specific "flavor", you can pass
|
||||
the `--flavor` (update) or `--flavors` (compose) argument - for an
|
||||
update it must be a single flavor, for a compose it may be a single
|
||||
flavor or a comma-separated list. The names of the flavors are shown
|
||||
in the web UI results overview for the compose or update, e.g.
|
||||
"Server-boot-iso". For update tests, omit the leading "updates-" in
|
||||
the flavor name (so, to re-schedule the "updates-workstation" tests
|
||||
for an update, you would pass `--flavor workstation`).
|
||||
|
||||
....
|
||||
/usr/share/openqa/script/client isos post \
|
||||
ISO=Fedora-Server-dvd-x86_64-Rawhide-20180108.n.0.iso DISTRI=fedora VERSION=Rawhide \
|
||||
FLAVOR=Server-dvd-iso ARCH=x86_64 BUILD=Fedora-Rawhide-20180108.n.0 CURRREL=27 PREVREL=26 \
|
||||
RAWREL=28 IMAGETYPE=dvd SUBVARIANT=Server \
|
||||
LOCATION=http://kojipkgs.fedoraproject.org/compose/rawhide/Fedora-Rawhide-20180108.n.0/compose
|
||||
....
|
||||
|
||||
The `ISO` value is the filename of the image to test (it may not
|
||||
actually be an ISO), the `DISTRI` value is always 'fedora', the
|
||||
`VERSION` value should be the release number or 'Rawhide', the `FLAVOR`
|
||||
value depends on the image being tested (you can check the value from an
|
||||
existing test for the same or a similar ISO), the `ARCH` value is the
|
||||
arch of the image being tested, the `BUILD` value is the compose ID,
|
||||
`CURREL` should be the release number of the current Fedora release at
|
||||
the time the test is run, `PREVREL` should be one lower than `CURREL`,
|
||||
`RAWREL` should be the release number associated with Rawhide at the
|
||||
time the test is run, `IMAGETYPE` depends on the image being tested
|
||||
(again, check a similar test for the correct value), `LOCATION` is the
|
||||
URL to the /compose subdirectory of the compose location, and
|
||||
`SUBVARIANT` again depends on the image being tested. Please ask for
|
||||
help if this seems too daunting. To re-run the 'universal' tests on a
|
||||
given image, set the `FLAVOR` value to 'universal', then set all other
|
||||
values as appropriate to the chosen image. The 'universal' tests are
|
||||
only likely to work at all correctly with DVD or netinst images.
|
||||
Less commonly, you can schedule tests for scratch builds using
|
||||
`fedora-openqa task` and side tags using `fedora-openqa tag`. This
|
||||
should usually only be done on the staging instance. See the help
|
||||
of `fedora-openqa` for more details.
|
||||
|
||||
openQA provides a special script for cloning an existing job but
|
||||
optionally changing one or more variable values, which can be useful in
|
||||
|
@ -253,20 +343,37 @@ For interdependent jobs, you may or may not want to use the
|
|||
`--skip-deps` argument to avoid re-running the cloned job's parent
|
||||
job(s), depending on circumstances.
|
||||
|
||||
In very odd circumstances you may need to schedule jobs via an API
|
||||
request using the low-level CLI client provided by upstream,
|
||||
`openqa-client`; see http://open.qa/docs/#_triggering_tests for details
|
||||
on this. You may need to refer to the `schedule.py` file in the
|
||||
`fedora_openqa` source to figure out exactly what settings to pass to
|
||||
the scheduler when doing this. It's extremely unusual to have to do
|
||||
this, though, so probably don't worry about it.
|
||||
|
||||
== Manual updates
|
||||
|
||||
In general updates to any of the components of the deployments should be
|
||||
handled via ansible: push the changes out in the appropriate way (git
|
||||
repo update, package update, etc.) and then run the ansible plays.
|
||||
repo update, package update, etc.) and then run the ansible plays. There
|
||||
is an `openqa_scratch` variable which can be set to a list of Koji
|
||||
task IDs for scratch builds; these will be downloaded and configured as
|
||||
a side repository. This can be used to deploy a newer build of openQA
|
||||
and/or os-autoinst before it has reached updates-testing if desired
|
||||
(usually we would do this only on the staging instance). Also, the
|
||||
`openqa_repo` variable can be set to "updates-testing" to install or
|
||||
update openQA components with updates-testing enabled, to get a new
|
||||
version before it has waited a week to reach stable.
|
||||
|
||||
However, sometimes we do want to update or test a change to something
|
||||
manually for some reason. Here are some notes on those cases.
|
||||
|
||||
For updating openQA and/or os-autoinst packages: ideally, ensure no jobs
|
||||
are running. Then, update all installed subpackages on the server. The
|
||||
server services should be automatically restarted as part of the package
|
||||
update. Then, update all installed subpackages on the worker hosts, and
|
||||
restart all worker services. A 'for' loop can help with that, for
|
||||
instance:
|
||||
update. Then, update all installed subpackages on the worker hosts.
|
||||
Usually this should cause the worker services to be restarted, but if
|
||||
not, a 'for' loop can help with that, for instance:
|
||||
|
||||
....
|
||||
for i in {1..10}; do systemctl restart openqa-worker@$i.service; done
|
||||
|
@ -279,11 +386,10 @@ For updating the openQA tests:
|
|||
....
|
||||
cd /var/lib/openqa/share/tests/fedora
|
||||
git pull (or git checkout (branch) or whatever)
|
||||
./templates --clean
|
||||
./templates-updates --update
|
||||
./fifloader.py -c -l templates.fif.json templates-updates.fif.json
|
||||
....
|
||||
|
||||
The templates steps are only necessary if there are any changes to the
|
||||
The fifloader step is only necessary if there are any changes to the
|
||||
templates files.
|
||||
|
||||
For updating the scheduler code:
|
||||
|
@ -292,18 +398,19 @@ For updating the scheduler code:
|
|||
cd /root/fedora_openqa
|
||||
git pull (or whatever changes)
|
||||
python setup.py install
|
||||
systemctl restart fedmsg-hub
|
||||
systemctl restart fm-consumer@fedora_openqa_scheduler.service
|
||||
systemctl restart fm-consumer@fedora_openqa_resultsdb_reporter.service
|
||||
systemctl restart fm-consumer@fedora_openqa_wiki_reporter.service
|
||||
....
|
||||
|
||||
Updating other components of the scheduling process follow the same
|
||||
pattern: update the code or package, then remember to restart
|
||||
fedmsg-hub, or the fedmsg consumers won't use the new code. It's
|
||||
relatively common for the openQA instances to need fedfind updates in
|
||||
advance of them being pushed to stable, for example when a new compose
|
||||
type is invented and fedfind doesn't understand it, openQA can end up
|
||||
trying to schedule tests for it, or the scheduler consumer can crash;
|
||||
when this happens we have to fix and update fedfind on the openQA
|
||||
instances ASAP.
|
||||
the message consumers. It's possible for the openQA instances to need
|
||||
fedfind updates in advance of them being pushed to stable, for example
|
||||
when a new compose type is invented and fedfind doesn't understand it,
|
||||
openQA can end up trying to schedule tests for it, or the scheduler
|
||||
consumer can crash; when this happens we have to fix and update
|
||||
fedfind on the openQA instances ASAP.
|
||||
|
||||
== Logging
|
||||
|
||||
|
@ -338,26 +445,28 @@ images are part of the tool itself).
|
|||
This process isn't 100% reliable; `virt-install` can sometimes fail,
|
||||
either just quasi-randomly or every time, in which case the cause of the
|
||||
failure needs to be figured out and fixed so the affected image can be
|
||||
(re-)built.
|
||||
(re-)built. This kind of failure is quite "invisible", as when
|
||||
regeneration of an image fails, we just keep the old version; this
|
||||
might be the problem if update tests start failing because the initial
|
||||
update to bring the system fully up to date times out, for instance.
|
||||
|
||||
The i686 and x86_64 images for each instance are built on the server, as
|
||||
its native arch is x86_64. The images for other arches are built on one
|
||||
worker host for each arch (nominated by inclusion in an ansible
|
||||
inventory group that exists for this purpose); those hosts have write
|
||||
access to the NFS share for this purpose.
|
||||
The images for each arch are built on one worker host of that arch
|
||||
(nominated by inclusion in an ansible inventory group that exists for
|
||||
this purpose); those hosts have write access to the NFS share for this
|
||||
purpose.
|
||||
|
||||
== Compose check reports (check-compose)
|
||||
|
||||
An additional ansible role runs on each openQA server, called
|
||||
`check-compose`. This role installs a tool (also called `check-compose`)
|
||||
and an associated fedmsg consumer. The consumer kicks in when all openQA
|
||||
and an associated message consumer. The consumer kicks in when all openQA
|
||||
tests for any compose finish, and uses the `check-compose` tool to send
|
||||
out an email report summarizing the results of the tests (well, the
|
||||
production server sends out emails, the staging server just logs the
|
||||
contents of the report). This role isn't really a part of openQA proper,
|
||||
but is run on the openQA servers as it seems like as good a place as any
|
||||
to do it. As with all other fedmsg consumers, if making manual changes
|
||||
or updates to the components, remember to restart `fedmsg-hub` service
|
||||
to do it. As with all other message consumers, if making manual changes
|
||||
or updates to the components, remember to restart the consumer service
|
||||
afterwards.
|
||||
|
||||
== Autocloud ResultsDB forwarder (autocloudreporter)
|
||||
|
|
Loading…
Add table
Add a link
Reference in a new issue