I remembered this thing exists, so I updated it! Some stuff is just freshening, plus I added some juicy new information too. Print it out and read it on the toilet, folks. Signed-off-by: Adam Williamson <awilliam@redhat.com>
479 lines
22 KiB
Text
479 lines
22 KiB
Text
= OpenQA Infrastructure SOP
|
|
|
|
OpenQA is an automated test system used to run validation tests on
|
|
nightly and candidate Fedora composes, and also to run a subset of these
|
|
tests on critical path updates.
|
|
|
|
OpenQA production instance: https://openqa.fedoraproject.org
|
|
|
|
OpenQA staging (lab) instance: https://openqa.stg.fedoraproject.org
|
|
|
|
Wiki page on Fedora openQA deployment: https://fedoraproject.org/wiki/OpenQA
|
|
|
|
Upstream project page: http://open.qa/
|
|
|
|
Upstream repositories: https://github.com/os-autoinst
|
|
|
|
== Contact Information
|
|
|
|
Owner::
|
|
Fedora QA devel
|
|
Contact::
|
|
#fedora-qa, #fedora-admin, qa-devel mailing list
|
|
People::
|
|
Adam Williamson (adamwill / adamw), Lukas Ruzicka (lruzicka)
|
|
Machines::
|
|
See ansible inventory groups with 'openqa' in name
|
|
Purpose::
|
|
Run automated tests on VMs via screen recognition and VNC input
|
|
|
|
== Architecture
|
|
|
|
Each openQA instance consists of a server (these are virtual machines)
|
|
and one or more worker hosts (these are bare metal systems). The server
|
|
schedules tests ("jobs", in openQA parlance) and stores results and
|
|
associated data. The worker hosts run "jobs" and send the results back
|
|
to the server. The server also runs some message consumers to handle
|
|
automatic scheduling of jobs and reporting of results to external
|
|
systems (ResultsDB and Wikitcms).
|
|
|
|
== Server
|
|
|
|
The server runs a web UI for viewing scheduled, running and completed
|
|
tests and their data, with an admin interface where many aspects of the
|
|
system can be configured (though we do not use the web UI for several
|
|
aspects of configuration). There are several separate services that run
|
|
on each server, and communicate with each other mainly via dbus. Each
|
|
server requires its own PostgreSQL database. The web UI and websockets
|
|
server are made externally available via reverse proxying through an
|
|
Apache server.
|
|
|
|
It hosts an NFS share that contains the tests, the 'needles'
|
|
(screenshots with metadata as JSON files that are used for screen
|
|
matching), and test 'assets' like ISO files and disk images. The path is
|
|
`/var/lib/openqa/share/factory`.
|
|
|
|
In our deployment, the PostgreSQL database for each instance is hosted
|
|
by the QA database server. Also, some paths on the server are themselves
|
|
mounted as NFS shares from the infra storage server. This is so that
|
|
these are not lost if the server is re-deployed, and can easily be
|
|
backed up. These locations contain the data from each executed job. As
|
|
both the database and these key data files are not actually stored on
|
|
the server, the server can be redeployed from scratch without loss of
|
|
any data (at least, this is the intent).
|
|
|
|
Also in our deployment, an openQA plugin (which we wrote, but which is
|
|
part of the upstream codebase) is enabled which publishes messages on
|
|
various events.
|
|
|
|
The server systems run a message consumer for the purpose of
|
|
automatically scheduling jobs in response to the appearance of new
|
|
composes and critical path updates, and one each for the purpose of
|
|
reporting the results of completed jobs to ResultsDB and Wikitcms.
|
|
These use the `fm-consumer@` pattern from `fedora-messaging`.
|
|
|
|
== Worker hosts
|
|
|
|
The worker hosts run several individual worker 'instances' (via
|
|
systemd's 'instantiated service' mechanism), each of which registers
|
|
with the server and accepts jobs from it, uploading the results of the
|
|
job and some associated data to the server on completion. The worker
|
|
instances and server communicate both via a conventional web API
|
|
provided by the server and via websockets. When a worker runs a job, it
|
|
starts a qemu virtual machine (directly - libvirt is not used) and
|
|
interacts with it via VNC and the serial console, following a set of
|
|
steps dictating what it should do and what response it should expect in
|
|
terms of screen contents or serial console output. The server 'pushes'
|
|
jobs to the worker instances over a websocket connection.
|
|
|
|
Each worker host must mount the `/var/lib/openqa/share/factory` NFS
|
|
share provided by the server. If this share is not mounted, any jobs run
|
|
will fail immediately due to expected asset and test files not being
|
|
found.
|
|
|
|
Some worker hosts for each instance are denominated 'tap workers',
|
|
meaning they run some advanced jobs which use software-defined
|
|
networking (openvswitch) to interact with each other. All the
|
|
configuration for this should be handled by the ansible scripts, but
|
|
it's useful to be aware that there is complex software-defined
|
|
networking stuff going on on these hosts which could potentially be the
|
|
source of problems (backed by openvswitch). There is some more detail
|
|
on this in the wiki page and upstream docs; refer to the ansible plays
|
|
for the details of how it's actually configured.
|
|
|
|
== Deployment and regular operation
|
|
|
|
Deployment and normal update of the openQA systems should run entirely
|
|
through Ansible. Just running the appropriate ansible plays for the
|
|
systems should complete the entire deployment / update process, though
|
|
it is best to check after running them that there are no failed services
|
|
on any of the systems (restart any that failed), and that the web UI is
|
|
properly accessible.
|
|
|
|
Regular operation of the openQA deployments is entirely automated. Jobs
|
|
should be scheduled and run automatically when new composes and critical
|
|
path updates appear, and results should be reported to ResultsDB and
|
|
Wikitcms (when appropriate). Dynamically generated assets should be
|
|
regenerated regularly, including across release boundaries (see the
|
|
section on createhdds below): no manual intervention should be required
|
|
when a new Fedora release appears. If any of this does not happen,
|
|
something is wrong, and manual inspection is needed.
|
|
|
|
Our usual practice is to upgrade the openQA systems to new Fedora
|
|
releases promptly as they appear, using `dnf system-upgrade`. This is
|
|
done manually. We usually upgrade the staging instance first and watch
|
|
for problems for a week or two before upgrading production.
|
|
|
|
== Rebooting / restarting
|
|
|
|
The optimal approach to rebooting an entire openQA deployment is as
|
|
follows:
|
|
|
|
[arabic]
|
|
. Reboot the server
|
|
. Check for failed services (`systemctl --failed`) and restart any that
|
|
failed
|
|
. Once the server is fully functional, reboot the worker hosts
|
|
. Check for failed services and restart any that failed, particularly
|
|
the NFS mount service, on each worker host
|
|
. Check in the web UI for failed jobs and restart them, especially
|
|
tests of updates
|
|
|
|
Rebooting the workers *after* the server is important due to the NFS
|
|
share.
|
|
|
|
If only the server needs restarting, the entire procedure above should
|
|
ideally be followed in any case, to ensure there are no issues with the
|
|
NFS mount breaking due to the server reboot, or the server and worker
|
|
getting confused about running jobs due to the websockets connections
|
|
being restarted.
|
|
|
|
If only a worker host needs restarting, there is no need to restart the
|
|
server too. Ideally, wait until no jobs are running on that host, and
|
|
stop all `open-worker@` services on the host before rebooting it; but
|
|
in a pinch, if you reboot with running jobs, they *should* be
|
|
automatically rescheduled. Still, you should manually check in the web
|
|
UI for failed jobs and restart them.
|
|
|
|
There are two ways to check if jobs are running and if so where. You can
|
|
go to the web UI for the server and click 'All Tests'. If any jobs are
|
|
running, you can open each one individually (click the link in the
|
|
'Test' column) and look at the 'Assigned worker', which will tell you
|
|
which host the job is running on. Or, if you have admin access, you can
|
|
go to the admin menu (top right of the web UI, once you are logged in)
|
|
and click on 'Workers', which will show the status of all known workers
|
|
for that server, and select 'Working' in the state filter box. This will
|
|
show all workers currently working on a job.
|
|
|
|
== Troubleshooting
|
|
|
|
=== New tests not being scheduled
|
|
|
|
Check that `fm-consumer@fedora_openqa_scheduler.service` is enabled,
|
|
running, and not crashing. If that doesn't do the trick, the scheduler
|
|
may be broken or the expected messages may not be being published.
|
|
|
|
=== Results not being reported to resultsdb and/or the wiki
|
|
|
|
Check that `fm-consumer@fedora_openqa_resultsdb_reporter.service` and
|
|
`fm-consumer@fedora_openqa_wiki_reporter.service` are enabled,
|
|
running, and not crashing.
|
|
|
|
=== Services that write to the wiki keep crashing
|
|
|
|
If `fm-consumer@fedora_openqa_wiki_reporter.service` (and other
|
|
services that write to the wiki, like the `relval` and `relvalami`
|
|
consumers) are constantly failing/crashing, the API token may have
|
|
been overwritten somehow. Re-run the relevant plays (on batcave01):
|
|
|
|
....
|
|
sudo rbac-playbook groups/openqa.yml -t openqa_dispatcher
|
|
....
|
|
|
|
If this does not sort it out, you may need help from a wiki admin
|
|
to work out what's going on.
|
|
|
|
=== Many tests failing on the same worker host, in unusual ways
|
|
|
|
Sometimes, worker hosts can just "go bad", through memory exhaustion,
|
|
for instance. This usually manifests as unusual test failures (for
|
|
instance, failures very early in a test that aren't caused by invalid
|
|
test files, tests that time out when they usually would not, or tests
|
|
that seem to just die suddenly with a cryptic error message). If you
|
|
encounter this, just reboot the affected worker host. This is more
|
|
common on staging than production, as we intentionally run the older,
|
|
weaker worker hosts on the staging instance. If things are particularly
|
|
bad you may not be able to ssh into the host, and will need to reboot
|
|
it from the sideband controller; if you're not sure how to do this,
|
|
contact someone from sysadmin-main for assistance.
|
|
|
|
=== Tests failing early, complaining about missing assets
|
|
|
|
If many tests are failing early with errors suggesting they can't
|
|
find required files, check for failed services on the worker hosts.
|
|
Sometimes the NFS mount service fails and needs restarting.
|
|
|
|
=== Disk space issues: server local root
|
|
|
|
If a server is running out of space on its local root partition, the
|
|
cause is almost certainly asset storage. Almost all the space on the
|
|
server root partition is used by test assets (ISO and hard disk image
|
|
files).
|
|
|
|
openQA has a system for limiting the amount of space used by asset
|
|
storage, which we configure via ansible variables. Check the values of
|
|
the `openqa_assetsize*` variables in the openQA server group variables
|
|
in ansible. If the settings for the server sum to the amount of space
|
|
used, or more than it, those settings may need to be reduced. If there
|
|
seems to be more space used than the settings would allow for, there
|
|
may be an issue preventing the openQA task that actually enforces the
|
|
limits from running: check the "Minion Dashboard" (from the top-right
|
|
menu) in the openQA web UI and look for stuck or failed `limit_assets`
|
|
tasks (or just check whether any have completed recently; the task is
|
|
scheduled after each completed job so it should run frequently). There
|
|
is also an "Assets" link in the menu which gives you a web UI view of
|
|
the limits on each job group and the current size and present assets,
|
|
though note the list of present assets and the current size is updated
|
|
by the `limit_assets` task, so it will be inaccurate if that is not
|
|
being run successfully. You must be an openQA operator to access the
|
|
"Assets" view, and an administrator to access the "Minion Dashboard".
|
|
|
|
In a pinch, if there is no space and tests are failing, you can wipe
|
|
older, larger asset files in `/var/lib/openqa/share/factory/iso` and
|
|
`/var/lib/openqa/share/factory/hdd` to get things moving again while
|
|
you debug the issue. This is better than letting new tests fail.
|
|
|
|
=== Disk space issues: testresults and images NFS share
|
|
|
|
As mentioned above, the server mounts two NFS shares from the infra
|
|
storage server, at `/var/lib/openqa/images` and
|
|
`/var/lib/openqa/testresults` (they are both actually backed by a
|
|
single volume). These are where the screenshots, video and logs of
|
|
the executed tests are stored. If they fill up, tests will start to
|
|
fail.
|
|
|
|
openQA has a garbage collection mechanism which deletes (most) files
|
|
from (most) jobs when they are six months old, which ought to keep
|
|
usage of these shares in a steady state. However, if we enhance test
|
|
coverage so openQA is running more tests in any given six month
|
|
period than earlier ones, space usage will increase correspondingly.
|
|
It can also increase in response to odd triggers like a bug which
|
|
causes a lot of messages to be logged to a serial console, or a test
|
|
being configured to upload a very large file as a log.
|
|
|
|
More importantly, there is a snapshot mechanism configured on this
|
|
volume for the production instance, so space usage will always
|
|
gradually increase there. When the volume gets too full, we must
|
|
delete some older snapshots to free up space. This must be done by
|
|
an infra storage admin. The volume's name is `fedora_openqa`.
|
|
|
|
== Scheduling jobs manually
|
|
|
|
While it is not normally necessary, you may sometimes need to run or
|
|
re-run jobs manually.
|
|
|
|
The simplest cases can be handled by an admin from the web UI: for a
|
|
logged-in admin, all scheduled and running tests can be cancelled (from
|
|
various views), and all completed tests can be restarted. 'Restarting' a
|
|
job actually effectively clones it and schedules the clone to be run: it
|
|
creates a new job with a new job ID, and the previous job still exists.
|
|
openQA attempts to handle complex cases of inter-dependent jobs
|
|
correctly when restarting, but doesn't always manage to do it right;
|
|
when it goes wrong, the best thing to do is usually to re-run all jobs
|
|
for that medium.
|
|
|
|
Restarting a job should cause its status indicator (the little colored
|
|
blob) to go blue. If nothing changes, the restart likely failed. An
|
|
error message should explain why, but it always appears at the top
|
|
of the page, so you may need to scroll up to see it. If restarting
|
|
a test fails because an asset (an ISO file or hard disk image) is
|
|
missing, you will need to re-schedule the tests (see below).
|
|
|
|
To run or re-run the full set of tests for a compose or update, you can
|
|
use the `fedora-openqa` CLI. To run or re-run tests for a compose, use:
|
|
|
|
....
|
|
fedora-openqa compose -f (COMPOSE LOCATION)
|
|
....
|
|
|
|
where `(COMPOSE LOCATION)` is the full URL of the `/compose`
|
|
subdirectory of the compose. If you have an existing test to use as a
|
|
reference, go to the Settings tab, and the URL will be set as the
|
|
`LOCATION` setting. This will only work for Pungi-produced composes
|
|
with the expected productmd-format metadata, and a couple of other
|
|
quite special cases.
|
|
|
|
The `-f` argument means 'force', and is necessary to re-run tests:
|
|
usually, the scheduler will refuse to re-schedule tests that have
|
|
already run, and `-f` overrides this.
|
|
|
|
To run or re-run tests for an update, use:
|
|
|
|
....
|
|
fedora-openqa update -f (UPDATEID)
|
|
....
|
|
|
|
where `(UPDATEID)` is the update's ID - something like
|
|
`FEDORA-2018-blahblah`.
|
|
|
|
To run or re-run only the tests for a specific "flavor", you can pass
|
|
the `--flavor` (update) or `--flavors` (compose) argument - for an
|
|
update it must be a single flavor, for a compose it may be a single
|
|
flavor or a comma-separated list. The names of the flavors are shown
|
|
in the web UI results overview for the compose or update, e.g.
|
|
"Server-boot-iso". For update tests, omit the leading "updates-" in
|
|
the flavor name (so, to re-schedule the "updates-workstation" tests
|
|
for an update, you would pass `--flavor workstation`).
|
|
|
|
Less commonly, you can schedule tests for scratch builds using
|
|
`fedora-openqa task` and side tags using `fedora-openqa tag`. This
|
|
should usually only be done on the staging instance. See the help
|
|
of `fedora-openqa` for more details.
|
|
|
|
openQA provides a special script for cloning an existing job but
|
|
optionally changing one or more variable values, which can be useful in
|
|
some situations. Using it looks like this:
|
|
|
|
....
|
|
/usr/share/openqa/script/clone_job.pl --skip-download --from localhost 123 RAWREL=28
|
|
....
|
|
|
|
to clone job 123 with the `RAWREL` variable set to '28', for instance.
|
|
For interdependent jobs, you may or may not want to use the
|
|
`--skip-deps` argument to avoid re-running the cloned job's parent
|
|
job(s), depending on circumstances.
|
|
|
|
In very odd circumstances you may need to schedule jobs via an API
|
|
request using the low-level CLI client provided by upstream,
|
|
`openqa-client`; see http://open.qa/docs/#_triggering_tests for details
|
|
on this. You may need to refer to the `schedule.py` file in the
|
|
`fedora_openqa` source to figure out exactly what settings to pass to
|
|
the scheduler when doing this. It's extremely unusual to have to do
|
|
this, though, so probably don't worry about it.
|
|
|
|
== Manual updates
|
|
|
|
In general updates to any of the components of the deployments should be
|
|
handled via ansible: push the changes out in the appropriate way (git
|
|
repo update, package update, etc.) and then run the ansible plays. There
|
|
is an `openqa_scratch` variable which can be set to a list of Koji
|
|
task IDs for scratch builds; these will be downloaded and configured as
|
|
a side repository. This can be used to deploy a newer build of openQA
|
|
and/or os-autoinst before it has reached updates-testing if desired
|
|
(usually we would do this only on the staging instance). Also, the
|
|
`openqa_repo` variable can be set to "updates-testing" to install or
|
|
update openQA components with updates-testing enabled, to get a new
|
|
version before it has waited a week to reach stable.
|
|
|
|
However, sometimes we do want to update or test a change to something
|
|
manually for some reason. Here are some notes on those cases.
|
|
|
|
For updating openQA and/or os-autoinst packages: ideally, ensure no jobs
|
|
are running. Then, update all installed subpackages on the server. The
|
|
server services should be automatically restarted as part of the package
|
|
update. Then, update all installed subpackages on the worker hosts.
|
|
Usually this should cause the worker services to be restarted, but if
|
|
not, a 'for' loop can help with that, for instance:
|
|
|
|
....
|
|
for i in {1..10}; do systemctl restart openqa-worker@$i.service; done
|
|
....
|
|
|
|
on a host with ten worker instances.
|
|
|
|
For updating the openQA tests:
|
|
|
|
....
|
|
cd /var/lib/openqa/share/tests/fedora
|
|
git pull (or git checkout (branch) or whatever)
|
|
./fifloader.py -c -l templates.fif.json templates-updates.fif.json
|
|
....
|
|
|
|
The fifloader step is only necessary if there are any changes to the
|
|
templates files.
|
|
|
|
For updating the scheduler code:
|
|
|
|
....
|
|
cd /root/fedora_openqa
|
|
git pull (or whatever changes)
|
|
python setup.py install
|
|
systemctl restart fm-consumer@fedora_openqa_scheduler.service
|
|
systemctl restart fm-consumer@fedora_openqa_resultsdb_reporter.service
|
|
systemctl restart fm-consumer@fedora_openqa_wiki_reporter.service
|
|
....
|
|
|
|
Updating other components of the scheduling process follow the same
|
|
pattern: update the code or package, then remember to restart
|
|
the message consumers. It's possible for the openQA instances to need
|
|
fedfind updates in advance of them being pushed to stable, for example
|
|
when a new compose type is invented and fedfind doesn't understand it,
|
|
openQA can end up trying to schedule tests for it, or the scheduler
|
|
consumer can crash; when this happens we have to fix and update
|
|
fedfind on the openQA instances ASAP.
|
|
|
|
== Logging
|
|
|
|
Just about all useful logging information for all aspects of openQA and
|
|
the scheduling and report tools is logged to the journal, except that
|
|
the Apache server logs may be of interest in debugging issues related to
|
|
accessing the web UI or websockets server. To get more detailed logging
|
|
from openQA components, change the logging level in
|
|
`/etc/openqa/openqa.ini` from 'info' to 'debug' and restart the relevant
|
|
services. Any run of the Ansible plays will reset this back to 'info'.
|
|
|
|
Occasionally the test execution logs may be useful in figuring out why
|
|
all tests are failing very early, or some specific tests are failing due
|
|
to an asset going missing, etc. Each job's execution logs can be
|
|
accessed through the web UI, on the _Logs & Assets_ tab of the job page;
|
|
the files are `autoinst-log.txt` and `worker-log.txt`.
|
|
|
|
== Dynamic asset generation (createhdds)
|
|
|
|
Some of the hard disk image file 'assets' used by the openQA tests are
|
|
created by a tool called `createhdds`, which is checked out of a git
|
|
repo to `/root/createhdds` on the servers and also on some guests. This
|
|
tool uses `virt-install` and the Python bindings for `libguestfs` to
|
|
create various hard disk images the tests need to run. It is usually run
|
|
in two different ways. The ansible plays run it in a mode where it will
|
|
only create expected images that are entirely missing: this is mainly
|
|
meant to facilitate initial deployment. The plays also install a file to
|
|
`/etc/cron.daily` causing it to be run daily in a mode where it will
|
|
also recreate images that are 'too old' (the age-out conditions for
|
|
images are part of the tool itself).
|
|
|
|
This process isn't 100% reliable; `virt-install` can sometimes fail,
|
|
either just quasi-randomly or every time, in which case the cause of the
|
|
failure needs to be figured out and fixed so the affected image can be
|
|
(re-)built. This kind of failure is quite "invisible", as when
|
|
regeneration of an image fails, we just keep the old version; this
|
|
might be the problem if update tests start failing because the initial
|
|
update to bring the system fully up to date times out, for instance.
|
|
|
|
The images for each arch are built on one worker host of that arch
|
|
(nominated by inclusion in an ansible inventory group that exists for
|
|
this purpose); those hosts have write access to the NFS share for this
|
|
purpose.
|
|
|
|
== Compose check reports (check-compose)
|
|
|
|
An additional ansible role runs on each openQA server, called
|
|
`check-compose`. This role installs a tool (also called `check-compose`)
|
|
and an associated message consumer. The consumer kicks in when all openQA
|
|
tests for any compose finish, and uses the `check-compose` tool to send
|
|
out an email report summarizing the results of the tests (well, the
|
|
production server sends out emails, the staging server just logs the
|
|
contents of the report). This role isn't really a part of openQA proper,
|
|
but is run on the openQA servers as it seems like as good a place as any
|
|
to do it. As with all other message consumers, if making manual changes
|
|
or updates to the components, remember to restart the consumer service
|
|
afterwards.
|
|
|
|
== Autocloud ResultsDB forwarder (autocloudreporter)
|
|
|
|
An ansible role called `autocloudreporter` also runs on the openQA
|
|
production server. This has nothing to do with openQA at all, but is run
|
|
there for convenience. This role deploys a fedmsg consumer that listens
|
|
for fedmsgs indicating that Autocloud (a separate automated test system
|
|
which tests cloud images) has completed a test run, then forwards those
|
|
results to ResultsDB.
|