2021-09-07 15:17:33 +02:00
|
|
|
= OpenQA Infrastructure SOP
|
2021-07-26 10:39:47 +02:00
|
|
|
|
2021-09-07 15:17:33 +02:00
|
|
|
OpenQA is an automated test system used to run validation tests on
|
2021-07-26 10:39:47 +02:00
|
|
|
nightly and candidate Fedora composes, and also to run a subset of these
|
|
|
|
tests on critical path updates.
|
|
|
|
|
2021-09-07 15:17:33 +02:00
|
|
|
OpenQA production instance: https://openqa.fedoraproject.org
|
|
|
|
|
|
|
|
OpenQA staging instance: https://openqa.stg.fedoraproject.org
|
|
|
|
|
|
|
|
Wiki page on Fedora openQA deployment: https://fedoraproject.org/wiki/OpenQA
|
|
|
|
|
|
|
|
Upstream project page: http://open.qa/
|
|
|
|
|
|
|
|
Upstream repositories: https://github.com/os-autoinst
|
2021-07-26 10:39:47 +02:00
|
|
|
|
|
|
|
== Contact Information
|
|
|
|
|
|
|
|
Owner::
|
|
|
|
Fedora QA devel
|
|
|
|
Contact::
|
|
|
|
#fedora-qa, #fedora-admin, qa-devel mailing list
|
|
|
|
People::
|
|
|
|
Adam Williamson (adamwill / adamw), Petr Schindler (pschindl)
|
|
|
|
Machines::
|
|
|
|
See ansible inventory groups with 'openqa' in name
|
|
|
|
Purpose::
|
|
|
|
Run automated tests on VMs via screen recognition and VNC input
|
|
|
|
|
|
|
|
== Architecture
|
|
|
|
|
|
|
|
Each openQA instance consists of a server (these are virtual machines)
|
|
|
|
and one or more worker hosts (these are bare metal systems). The server
|
|
|
|
schedules tests ("jobs", in openQA parlance) and stores results and
|
|
|
|
associated data. The worker hosts run "jobs" and send the results back
|
|
|
|
to the server. The server also runs some fedmsg consumers to handle
|
|
|
|
automatic scheduling of jobs and reporting of results to external
|
|
|
|
systems (ResultsDB and Wikitcms).
|
|
|
|
|
|
|
|
== Server
|
|
|
|
|
|
|
|
The server runs a web UI for viewing scheduled, running and completed
|
|
|
|
tests and their data, with an admin interface where many aspects of the
|
|
|
|
system can be configured (though we do not use the web UI for several
|
|
|
|
aspects of configuration). There are several separate services that run
|
|
|
|
on each server, and communicate with each other mainly via dbus. Each
|
|
|
|
server requires its own PostgreSQL database. The web UI and websockets
|
|
|
|
server are made externally available via reverse proxying through an
|
|
|
|
Apache server.
|
|
|
|
|
|
|
|
It hosts an NFS share that contains the tests, the 'needles'
|
|
|
|
(screenshots with metadata as JSON files that are used for screen
|
|
|
|
matching), and test 'assets' like ISO files and disk images. The path is
|
|
|
|
`/var/lib/openqa/share/factory`.
|
|
|
|
|
|
|
|
In our deployment, the PostgreSQL database for each instance is hosted
|
|
|
|
by the QA database server. Also, some paths on the server are themselves
|
|
|
|
mounted as NFS shares from the infra storage server. This is so that
|
|
|
|
these are not lost if the server is re-deployed, and can easily be
|
|
|
|
backed up. These locations contain the data from each executed job. As
|
|
|
|
both the database and these key data files are not actually stored on
|
|
|
|
the server, the server can be redeployed from scratch without loss of
|
|
|
|
any data (at least, this is the intent).
|
|
|
|
|
|
|
|
Also in our deployment, an openQA plugin (which we wrote, but which is
|
|
|
|
part of the upstream codebase) is enabled which emits fedmsgs on various
|
|
|
|
events. This works by calling fedmsg-logger, so the appropriate fedmsg
|
|
|
|
configuration must be in place for this to emit events correctly.
|
|
|
|
|
|
|
|
The server systems run a fedmsg consumer for the purpose of
|
|
|
|
automatically scheduling jobs in response to the appearance of new
|
|
|
|
composes and critical path updates, and one for the purpose of reporting
|
|
|
|
the results of completed jobs to ResultsDB and Wikitcms. These use the
|
|
|
|
`fedmsg-hub` system.
|
|
|
|
|
|
|
|
== Worker hosts
|
|
|
|
|
|
|
|
The worker hosts run several individual worker 'instances' (via
|
|
|
|
systemd's 'instantiated service' mechanism), each of which registers
|
|
|
|
with the server and accepts jobs from it, uploading the results of the
|
|
|
|
job and some associated data to the server on completion. The worker
|
|
|
|
instances and server communicate both via a conventional web API
|
|
|
|
provided by the server and via websockets. When a worker runs a job, it
|
|
|
|
starts a qemu virtual machine (directly - libvirt is not used) and
|
|
|
|
interacts with it via VNC and the serial console, following a set of
|
|
|
|
steps dictating what it should do and what response it should expect in
|
|
|
|
terms of screen contents or serial console output. The server 'pushes'
|
|
|
|
jobs to the worker instances over a websocket connection.
|
|
|
|
|
|
|
|
Each worker host must mount the `/var/lib/openqa/share/factory` NFS
|
|
|
|
share provided by the server. If this share is not mounted, any jobs run
|
|
|
|
will fail immediately due to expected asset and test files not being
|
|
|
|
found.
|
|
|
|
|
|
|
|
Some worker hosts for each instance are denominated 'tap workers',
|
|
|
|
meaning they run some advanced jobs which use software-defined
|
|
|
|
networking (openvswitch) to interact with each other. All the
|
|
|
|
configuration for this should be handled by the ansible scripts, but
|
|
|
|
it's useful to be aware that there is complex software-defined
|
|
|
|
networking stuff going on on these hosts which could potentially be the
|
|
|
|
source of problems.
|
|
|
|
|
|
|
|
== Deployment and regular operation
|
|
|
|
|
|
|
|
Deployment and normal update of the openQA systems should run entirely
|
|
|
|
through Ansible. Just running the appropriate ansible plays for the
|
|
|
|
systems should complete the entire deployment / update process, though
|
|
|
|
it is best to check after running them that there are no failed services
|
|
|
|
on any of the systems (restart any that failed), and that the web UI is
|
|
|
|
properly accessible.
|
|
|
|
|
|
|
|
Regular operation of the openQA deployments is entirely automated. Jobs
|
|
|
|
should be scheduled and run automatically when new composes and critical
|
|
|
|
path updates appear, and results should be reported to ResultsDB and
|
|
|
|
Wikitcms (when appropriate). Dynamically generated assets should be
|
|
|
|
regenerated regularly, including across release boundaries (see the
|
|
|
|
section on createhdds below): no manual intervention should be required
|
|
|
|
when a new Fedora release appears. If any of this does not happen,
|
|
|
|
something is wrong, and manual inspection is needed.
|
|
|
|
|
|
|
|
Our usual practice is to upgrade the openQA systems to new Fedora
|
|
|
|
releases promptly as they appear, using `dnf system-upgrade`. This is
|
|
|
|
done manually. We usually upgrade the staging instance first and watch
|
|
|
|
for problems for a week or two before upgrading production.
|
|
|
|
|
|
|
|
== Rebooting / restarting
|
|
|
|
|
|
|
|
The optimal approach to rebooting an entire openQA deployment is as
|
|
|
|
follows:
|
|
|
|
|
|
|
|
[arabic]
|
|
|
|
. Wait until no jobs are running
|
|
|
|
. Stop all `openqa-*` services on the server, so no more will be queued
|
|
|
|
. Stop all `openqa-worker@` services on the worker hosts
|
|
|
|
. Reboot the server
|
|
|
|
. Check for failed services (`systemctl --failed`) and restart any that
|
|
|
|
failed
|
|
|
|
. Once the server is fully functional, reboot the worker hosts
|
|
|
|
. Check for failed services and restart any that failed, particularly
|
|
|
|
the NFS mount service
|
|
|
|
|
|
|
|
Rebooting the workers *after* the server is important due to the NFS
|
|
|
|
share.
|
|
|
|
|
|
|
|
If only the server needs restarting, the entire procedure above should
|
|
|
|
ideally be followed in any case, to ensure there are no issues with the
|
|
|
|
NFS mount breaking due to the server reboot, or the server and worker
|
|
|
|
getting confused about running jobs due to the websockets connections
|
|
|
|
being restarted.
|
|
|
|
|
|
|
|
If only a worker host needs restarting, there is no need to restart the
|
|
|
|
server too, but it is best to wait until no jobs are running on that
|
|
|
|
host, and stop all `open-worker@` services on the host before rebooting
|
|
|
|
it.
|
|
|
|
|
|
|
|
There are two ways to check if jobs are running and if so where. You can
|
|
|
|
go to the web UI for the server and click 'All Tests'. If any jobs are
|
|
|
|
running, you can open each one individually (click the link in the
|
|
|
|
'Test' column) and look at the 'Assigned worker', which will tell you
|
|
|
|
which host the job is running on. Or, if you have admin access, you can
|
|
|
|
go to the admin menu (top right of the web UI, once you are logged in)
|
|
|
|
and click on 'Workers', which will show the status of all known workers
|
|
|
|
for that server, and select 'Working' in the state filter box. This will
|
|
|
|
show all workers currently working on a job.
|
|
|
|
|
|
|
|
Note that if something which would usually be tested (new compose, new
|
|
|
|
critpath update...) appears during the reboot window, it likely will
|
|
|
|
_not_ be scheduled for testing, as this is done by a fedmsg consumer
|
|
|
|
running on the server. You will need to schedule it for testing manually
|
|
|
|
in this case (see below).
|
|
|
|
|
|
|
|
== Scheduling jobs manually
|
|
|
|
|
|
|
|
While it is not normally necessary, you may sometimes need to run or
|
|
|
|
re-run jobs manually.
|
|
|
|
|
|
|
|
The simplest cases can be handled by an admin from the web UI: for a
|
|
|
|
logged-in admin, all scheduled and running tests can be cancelled (from
|
|
|
|
various views), and all completed tests can be restarted. 'Restarting' a
|
|
|
|
job actually effectively clones it and schedules the clone to be run: it
|
|
|
|
creates a new job with a new job ID, and the previous job still exists.
|
|
|
|
openQA attempts to handle complex cases of inter-dependent jobs
|
|
|
|
correctly when restarting, but doesn't always manage to do it right;
|
|
|
|
when it goes wrong, the best thing to do is usually to re-run all jobs
|
|
|
|
for that medium.
|
|
|
|
|
|
|
|
To run or re-run the full set of tests for a compose or update, you can
|
|
|
|
use the `fedora-openqa` CLI. To run or re-run tests for a compose, use:
|
|
|
|
|
|
|
|
....
|
|
|
|
fedora-openqa compose -f (COMPOSE LOCATION)
|
|
|
|
....
|
|
|
|
|
|
|
|
where `(COMPOSE LOCATION)` is the full URL of the `/compose`
|
|
|
|
subdirectory of the compose. This will only work for Pungi-produced
|
|
|
|
composes with the expected productmd-format metadata, and a couple of
|
|
|
|
other quite special cases.
|
|
|
|
|
|
|
|
The `-f` argument means 'force', and is necessary to re-run tests:
|
|
|
|
usually, the scheduler will refuse to re-schedule tests that have
|
|
|
|
already run, and `-f` overrides this.
|
|
|
|
|
|
|
|
To run or re-run tests for an update, use:
|
|
|
|
|
|
|
|
....
|
|
|
|
fedora-openqa update -f (UPDATEID) (RELEASE)
|
|
|
|
....
|
|
|
|
|
|
|
|
where `(UPDATEID)` is the update's ID - something like
|
|
|
|
`FEDORA-2018-blahblah` - and `(RELEASE)` is the release for which the
|
|
|
|
update is intended (27, 28, etc).
|
|
|
|
|
|
|
|
To run or re-run only the tests for a specific medium (usually a single
|
|
|
|
image file), you must use the lower-level web API client, with a more
|
|
|
|
complex syntax. The command looks something like this:
|
|
|
|
|
|
|
|
....
|
|
|
|
/usr/share/openqa/script/client isos post \
|
|
|
|
ISO=Fedora-Server-dvd-x86_64-Rawhide-20180108.n.0.iso DISTRI=fedora VERSION=Rawhide \
|
|
|
|
FLAVOR=Server-dvd-iso ARCH=x86_64 BUILD=Fedora-Rawhide-20180108.n.0 CURRREL=27 PREVREL=26 \
|
|
|
|
RAWREL=28 IMAGETYPE=dvd SUBVARIANT=Server \
|
|
|
|
LOCATION=http://kojipkgs.fedoraproject.org/compose/rawhide/Fedora-Rawhide-20180108.n.0/compose
|
|
|
|
....
|
|
|
|
|
|
|
|
The `ISO` value is the filename of the image to test (it may not
|
|
|
|
actually be an ISO), the `DISTRI` value is always 'fedora', the
|
|
|
|
`VERSION` value should be the release number or 'Rawhide', the `FLAVOR`
|
|
|
|
value depends on the image being tested (you can check the value from an
|
|
|
|
existing test for the same or a similar ISO), the `ARCH` value is the
|
|
|
|
arch of the image being tested, the `BUILD` value is the compose ID,
|
|
|
|
`CURREL` should be the release number of the current Fedora release at
|
|
|
|
the time the test is run, `PREVREL` should be one lower than `CURREL`,
|
|
|
|
`RAWREL` should be the release number associated with Rawhide at the
|
|
|
|
time the test is run, `IMAGETYPE` depends on the image being tested
|
|
|
|
(again, check a similar test for the correct value), `LOCATION` is the
|
|
|
|
URL to the /compose subdirectory of the compose location, and
|
|
|
|
`SUBVARIANT` again depends on the image being tested. Please ask for
|
|
|
|
help if this seems too daunting. To re-run the 'universal' tests on a
|
|
|
|
given image, set the `FLAVOR` value to 'universal', then set all other
|
|
|
|
values as appropriate to the chosen image. The 'universal' tests are
|
|
|
|
only likely to work at all correctly with DVD or netinst images.
|
|
|
|
|
|
|
|
openQA provides a special script for cloning an existing job but
|
|
|
|
optionally changing one or more variable values, which can be useful in
|
|
|
|
some situations. Using it looks like this:
|
|
|
|
|
|
|
|
....
|
|
|
|
/usr/share/openqa/script/clone_job.pl --skip-download --from localhost 123 RAWREL=28
|
|
|
|
....
|
|
|
|
|
|
|
|
to clone job 123 with the `RAWREL` variable set to '28', for instance.
|
|
|
|
For interdependent jobs, you may or may not want to use the
|
|
|
|
`--skip-deps` argument to avoid re-running the cloned job's parent
|
|
|
|
job(s), depending on circumstances.
|
|
|
|
|
|
|
|
== Manual updates
|
|
|
|
|
|
|
|
In general updates to any of the components of the deployments should be
|
|
|
|
handled via ansible: push the changes out in the appropriate way (git
|
|
|
|
repo update, package update, etc.) and then run the ansible plays.
|
|
|
|
However, sometimes we do want to update or test a change to something
|
|
|
|
manually for some reason. Here are some notes on those cases.
|
|
|
|
|
|
|
|
For updating openQA and/or os-autoinst packages: ideally, ensure no jobs
|
|
|
|
are running. Then, update all installed subpackages on the server. The
|
|
|
|
server services should be automatically restarted as part of the package
|
|
|
|
update. Then, update all installed subpackages on the worker hosts, and
|
|
|
|
restart all worker services. A 'for' loop can help with that, for
|
|
|
|
instance:
|
|
|
|
|
|
|
|
....
|
|
|
|
for i in {1..10}; do systemctl restart openqa-worker@$i.service; done
|
|
|
|
....
|
|
|
|
|
|
|
|
on a host with ten worker instances.
|
|
|
|
|
|
|
|
For updating the openQA tests:
|
|
|
|
|
|
|
|
....
|
|
|
|
cd /var/lib/openqa/share/tests/fedora
|
|
|
|
git pull (or git checkout (branch) or whatever)
|
|
|
|
./templates --clean
|
|
|
|
./templates-updates --update
|
|
|
|
....
|
|
|
|
|
|
|
|
The templates steps are only necessary if there are any changes to the
|
|
|
|
templates files.
|
|
|
|
|
|
|
|
For updating the scheduler code:
|
|
|
|
|
|
|
|
....
|
|
|
|
cd /root/fedora_openqa
|
|
|
|
git pull (or whatever changes)
|
|
|
|
python setup.py install
|
|
|
|
systemctl restart fedmsg-hub
|
|
|
|
....
|
|
|
|
|
|
|
|
Updating other components of the scheduling process follow the same
|
|
|
|
pattern: update the code or package, then remember to restart
|
|
|
|
fedmsg-hub, or the fedmsg consumers won't use the new code. It's
|
|
|
|
relatively common for the openQA instances to need fedfind updates in
|
|
|
|
advance of them being pushed to stable, for example when a new compose
|
|
|
|
type is invented and fedfind doesn't understand it, openQA can end up
|
|
|
|
trying to schedule tests for it, or the scheduler consumer can crash;
|
|
|
|
when this happens we have to fix and update fedfind on the openQA
|
|
|
|
instances ASAP.
|
|
|
|
|
|
|
|
== Logging
|
|
|
|
|
|
|
|
Just about all useful logging information for all aspects of openQA and
|
|
|
|
the scheduling and report tools is logged to the journal, except that
|
|
|
|
the Apache server logs may be of interest in debugging issues related to
|
|
|
|
accessing the web UI or websockets server. To get more detailed logging
|
|
|
|
from openQA components, change the logging level in
|
|
|
|
`/etc/openqa/openqa.ini` from 'info' to 'debug' and restart the relevant
|
|
|
|
services. Any run of the Ansible plays will reset this back to 'info'.
|
|
|
|
|
|
|
|
Occasionally the test execution logs may be useful in figuring out why
|
|
|
|
all tests are failing very early, or some specific tests are failing due
|
|
|
|
to an asset going missing, etc. Each job's execution logs can be
|
|
|
|
accessed through the web UI, on the _Logs & Assets_ tab of the job page;
|
|
|
|
the files are `autoinst-log.txt` and `worker-log.txt`.
|
|
|
|
|
|
|
|
== Dynamic asset generation (createhdds)
|
|
|
|
|
|
|
|
Some of the hard disk image file 'assets' used by the openQA tests are
|
|
|
|
created by a tool called `createhdds`, which is checked out of a git
|
|
|
|
repo to `/root/createhdds` on the servers and also on some guests. This
|
|
|
|
tool uses `virt-install` and the Python bindings for `libguestfs` to
|
|
|
|
create various hard disk images the tests need to run. It is usually run
|
|
|
|
in two different ways. The ansible plays run it in a mode where it will
|
|
|
|
only create expected images that are entirely missing: this is mainly
|
|
|
|
meant to facilitate initial deployment. The plays also install a file to
|
|
|
|
`/etc/cron.daily` causing it to be run daily in a mode where it will
|
|
|
|
also recreate images that are 'too old' (the age-out conditions for
|
|
|
|
images are part of the tool itself).
|
|
|
|
|
|
|
|
This process isn't 100% reliable; `virt-install` can sometimes fail,
|
|
|
|
either just quasi-randomly or every time, in which case the cause of the
|
|
|
|
failure needs to be figured out and fixed so the affected image can be
|
|
|
|
(re-)built.
|
|
|
|
|
|
|
|
The i686 and x86_64 images for each instance are built on the server, as
|
|
|
|
its native arch is x86_64. The images for other arches are built on one
|
|
|
|
worker host for each arch (nominated by inclusion in an ansible
|
|
|
|
inventory group that exists for this purpose); those hosts have write
|
|
|
|
access to the NFS share for this purpose.
|
|
|
|
|
|
|
|
== Compose check reports (check-compose)
|
|
|
|
|
|
|
|
An additional ansible role runs on each openQA server, called
|
|
|
|
`check-compose`. This role installs a tool (also called `check-compose`)
|
|
|
|
and an associated fedmsg consumer. The consumer kicks in when all openQA
|
|
|
|
tests for any compose finish, and uses the `check-compose` tool to send
|
|
|
|
out an email report summarizing the results of the tests (well, the
|
|
|
|
production server sends out emails, the staging server just logs the
|
|
|
|
contents of the report). This role isn't really a part of openQA proper,
|
|
|
|
but is run on the openQA servers as it seems like as good a place as any
|
|
|
|
to do it. As with all other fedmsg consumers, if making manual changes
|
|
|
|
or updates to the components, remember to restart `fedmsg-hub` service
|
|
|
|
afterwards.
|
|
|
|
|
|
|
|
== Autocloud ResultsDB forwarder (autocloudreporter)
|
|
|
|
|
|
|
|
An ansible role called `autocloudreporter` also runs on the openQA
|
|
|
|
production server. This has nothing to do with openQA at all, but is run
|
|
|
|
there for convenience. This role deploys a fedmsg consumer that listens
|
|
|
|
for fedmsgs indicating that Autocloud (a separate automated test system
|
|
|
|
which tests cloud images) has completed a test run, then forwards those
|
|
|
|
results to ResultsDB.
|