Update and extend the openQA sysadmin guide for 2023
I remembered this thing exists, so I updated it! Some stuff is just freshening, plus I added some juicy new information too. Print it out and read it on the toilet, folks. Signed-off-by: Adam Williamson <awilliam@redhat.com>
This commit is contained in:
parent
2bb2853ec5
commit
fc5f853ea0
1 changed files with 190 additions and 81 deletions
|
@ -6,7 +6,7 @@ tests on critical path updates.
|
||||||
|
|
||||||
OpenQA production instance: https://openqa.fedoraproject.org
|
OpenQA production instance: https://openqa.fedoraproject.org
|
||||||
|
|
||||||
OpenQA staging instance: https://openqa.stg.fedoraproject.org
|
OpenQA staging (lab) instance: https://openqa.stg.fedoraproject.org
|
||||||
|
|
||||||
Wiki page on Fedora openQA deployment: https://fedoraproject.org/wiki/OpenQA
|
Wiki page on Fedora openQA deployment: https://fedoraproject.org/wiki/OpenQA
|
||||||
|
|
||||||
|
@ -21,7 +21,7 @@ Owner::
|
||||||
Contact::
|
Contact::
|
||||||
#fedora-qa, #fedora-admin, qa-devel mailing list
|
#fedora-qa, #fedora-admin, qa-devel mailing list
|
||||||
People::
|
People::
|
||||||
Adam Williamson (adamwill / adamw), Petr Schindler (pschindl)
|
Adam Williamson (adamwill / adamw), Lukas Ruzicka (lruzicka)
|
||||||
Machines::
|
Machines::
|
||||||
See ansible inventory groups with 'openqa' in name
|
See ansible inventory groups with 'openqa' in name
|
||||||
Purpose::
|
Purpose::
|
||||||
|
@ -33,7 +33,7 @@ Each openQA instance consists of a server (these are virtual machines)
|
||||||
and one or more worker hosts (these are bare metal systems). The server
|
and one or more worker hosts (these are bare metal systems). The server
|
||||||
schedules tests ("jobs", in openQA parlance) and stores results and
|
schedules tests ("jobs", in openQA parlance) and stores results and
|
||||||
associated data. The worker hosts run "jobs" and send the results back
|
associated data. The worker hosts run "jobs" and send the results back
|
||||||
to the server. The server also runs some fedmsg consumers to handle
|
to the server. The server also runs some message consumers to handle
|
||||||
automatic scheduling of jobs and reporting of results to external
|
automatic scheduling of jobs and reporting of results to external
|
||||||
systems (ResultsDB and Wikitcms).
|
systems (ResultsDB and Wikitcms).
|
||||||
|
|
||||||
|
@ -63,15 +63,14 @@ the server, the server can be redeployed from scratch without loss of
|
||||||
any data (at least, this is the intent).
|
any data (at least, this is the intent).
|
||||||
|
|
||||||
Also in our deployment, an openQA plugin (which we wrote, but which is
|
Also in our deployment, an openQA plugin (which we wrote, but which is
|
||||||
part of the upstream codebase) is enabled which emits fedmsgs on various
|
part of the upstream codebase) is enabled which publishes messages on
|
||||||
events. This works by calling fedmsg-logger, so the appropriate fedmsg
|
various events.
|
||||||
configuration must be in place for this to emit events correctly.
|
|
||||||
|
|
||||||
The server systems run a fedmsg consumer for the purpose of
|
The server systems run a message consumer for the purpose of
|
||||||
automatically scheduling jobs in response to the appearance of new
|
automatically scheduling jobs in response to the appearance of new
|
||||||
composes and critical path updates, and one for the purpose of reporting
|
composes and critical path updates, and one each for the purpose of
|
||||||
the results of completed jobs to ResultsDB and Wikitcms. These use the
|
reporting the results of completed jobs to ResultsDB and Wikitcms.
|
||||||
`fedmsg-hub` system.
|
These use the `fm-consumer@` pattern from `fedora-messaging`.
|
||||||
|
|
||||||
== Worker hosts
|
== Worker hosts
|
||||||
|
|
||||||
|
@ -98,7 +97,9 @@ networking (openvswitch) to interact with each other. All the
|
||||||
configuration for this should be handled by the ansible scripts, but
|
configuration for this should be handled by the ansible scripts, but
|
||||||
it's useful to be aware that there is complex software-defined
|
it's useful to be aware that there is complex software-defined
|
||||||
networking stuff going on on these hosts which could potentially be the
|
networking stuff going on on these hosts which could potentially be the
|
||||||
source of problems.
|
source of problems (backed by openvswitch). There is some more detail
|
||||||
|
on this in the wiki page and upstream docs; refer to the ansible plays
|
||||||
|
for the details of how it's actually configured.
|
||||||
|
|
||||||
== Deployment and regular operation
|
== Deployment and regular operation
|
||||||
|
|
||||||
|
@ -129,15 +130,14 @@ The optimal approach to rebooting an entire openQA deployment is as
|
||||||
follows:
|
follows:
|
||||||
|
|
||||||
[arabic]
|
[arabic]
|
||||||
. Wait until no jobs are running
|
|
||||||
. Stop all `openqa-*` services on the server, so no more will be queued
|
|
||||||
. Stop all `openqa-worker@` services on the worker hosts
|
|
||||||
. Reboot the server
|
. Reboot the server
|
||||||
. Check for failed services (`systemctl --failed`) and restart any that
|
. Check for failed services (`systemctl --failed`) and restart any that
|
||||||
failed
|
failed
|
||||||
. Once the server is fully functional, reboot the worker hosts
|
. Once the server is fully functional, reboot the worker hosts
|
||||||
. Check for failed services and restart any that failed, particularly
|
. Check for failed services and restart any that failed, particularly
|
||||||
the NFS mount service
|
the NFS mount service, on each worker host
|
||||||
|
. Check in the web UI for failed jobs and restart them, especially
|
||||||
|
tests of updates
|
||||||
|
|
||||||
Rebooting the workers *after* the server is important due to the NFS
|
Rebooting the workers *after* the server is important due to the NFS
|
||||||
share.
|
share.
|
||||||
|
@ -149,9 +149,11 @@ getting confused about running jobs due to the websockets connections
|
||||||
being restarted.
|
being restarted.
|
||||||
|
|
||||||
If only a worker host needs restarting, there is no need to restart the
|
If only a worker host needs restarting, there is no need to restart the
|
||||||
server too, but it is best to wait until no jobs are running on that
|
server too. Ideally, wait until no jobs are running on that host, and
|
||||||
host, and stop all `open-worker@` services on the host before rebooting
|
stop all `open-worker@` services on the host before rebooting it; but
|
||||||
it.
|
in a pinch, if you reboot with running jobs, they *should* be
|
||||||
|
automatically rescheduled. Still, you should manually check in the web
|
||||||
|
UI for failed jobs and restart them.
|
||||||
|
|
||||||
There are two ways to check if jobs are running and if so where. You can
|
There are two ways to check if jobs are running and if so where. You can
|
||||||
go to the web UI for the server and click 'All Tests'. If any jobs are
|
go to the web UI for the server and click 'All Tests'. If any jobs are
|
||||||
|
@ -163,11 +165,107 @@ and click on 'Workers', which will show the status of all known workers
|
||||||
for that server, and select 'Working' in the state filter box. This will
|
for that server, and select 'Working' in the state filter box. This will
|
||||||
show all workers currently working on a job.
|
show all workers currently working on a job.
|
||||||
|
|
||||||
Note that if something which would usually be tested (new compose, new
|
== Troubleshooting
|
||||||
critpath update...) appears during the reboot window, it likely will
|
|
||||||
_not_ be scheduled for testing, as this is done by a fedmsg consumer
|
=== New tests not being scheduled
|
||||||
running on the server. You will need to schedule it for testing manually
|
|
||||||
in this case (see below).
|
Check that `fm-consumer@fedora_openqa_scheduler.service` is enabled,
|
||||||
|
running, and not crashing. If that doesn't do the trick, the scheduler
|
||||||
|
may be broken or the expected messages may not be being published.
|
||||||
|
|
||||||
|
=== Results not being reported to resultsdb and/or the wiki
|
||||||
|
|
||||||
|
Check that `fm-consumer@fedora_openqa_resultsdb_reporter.service` and
|
||||||
|
`fm-consumer@fedora_openqa_wiki_reporter.service` are enabled,
|
||||||
|
running, and not crashing.
|
||||||
|
|
||||||
|
=== Services that write to the wiki keep crashing
|
||||||
|
|
||||||
|
If `fm-consumer@fedora_openqa_wiki_reporter.service` (and other
|
||||||
|
services that write to the wiki, like the `relval` and `relvalami`
|
||||||
|
consumers) are constantly failing/crashing, the API token may have
|
||||||
|
been overwritten somehow. Re-run the relevant plays (on batcave01):
|
||||||
|
|
||||||
|
....
|
||||||
|
sudo rbac-playbook groups/openqa.yml -t openqa_dispatcher
|
||||||
|
....
|
||||||
|
|
||||||
|
If this does not sort it out, you may need help from a wiki admin
|
||||||
|
to work out what's going on.
|
||||||
|
|
||||||
|
=== Many tests failing on the same worker host, in unusual ways
|
||||||
|
|
||||||
|
Sometimes, worker hosts can just "go bad", through memory exhaustion,
|
||||||
|
for instance. This usually manifests as unusual test failures (for
|
||||||
|
instance, failures very early in a test that aren't caused by invalid
|
||||||
|
test files, tests that time out when they usually would not, or tests
|
||||||
|
that seem to just die suddenly with a cryptic error message). If you
|
||||||
|
encounter this, just reboot the affected worker host. This is more
|
||||||
|
common on staging than production, as we intentionally run the older,
|
||||||
|
weaker worker hosts on the staging instance. If things are particularly
|
||||||
|
bad you may not be able to ssh into the host, and will need to reboot
|
||||||
|
it from the sideband controller; if you're not sure how to do this,
|
||||||
|
contact someone from sysadmin-main for assistance.
|
||||||
|
|
||||||
|
=== Tests failing early, complaining about missing assets
|
||||||
|
|
||||||
|
If many tests are failing early with errors suggesting they can't
|
||||||
|
find required files, check for failed services on the worker hosts.
|
||||||
|
Sometimes the NFS mount service fails and needs restarting.
|
||||||
|
|
||||||
|
=== Disk space issues: server local root
|
||||||
|
|
||||||
|
If a server is running out of space on its local root partition, the
|
||||||
|
cause is almost certainly asset storage. Almost all the space on the
|
||||||
|
server root partition is used by test assets (ISO and hard disk image
|
||||||
|
files).
|
||||||
|
|
||||||
|
openQA has a system for limiting the amount of space used by asset
|
||||||
|
storage, which we configure via ansible variables. Check the values of
|
||||||
|
the `openqa_assetsize*` variables in the openQA server group variables
|
||||||
|
in ansible. If the settings for the server sum to the amount of space
|
||||||
|
used, or more than it, those settings may need to be reduced. If there
|
||||||
|
seems to be more space used than the settings would allow for, there
|
||||||
|
may be an issue preventing the openQA task that actually enforces the
|
||||||
|
limits from running: check the "Minion Dashboard" (from the top-right
|
||||||
|
menu) in the openQA web UI and look for stuck or failed `limit_assets`
|
||||||
|
tasks (or just check whether any have completed recently; the task is
|
||||||
|
scheduled after each completed job so it should run frequently). There
|
||||||
|
is also an "Assets" link in the menu which gives you a web UI view of
|
||||||
|
the limits on each job group and the current size and present assets,
|
||||||
|
though note the list of present assets and the current size is updated
|
||||||
|
by the `limit_assets` task, so it will be inaccurate if that is not
|
||||||
|
being run successfully. You must be an openQA operator to access the
|
||||||
|
"Assets" view, and an administrator to access the "Minion Dashboard".
|
||||||
|
|
||||||
|
In a pinch, if there is no space and tests are failing, you can wipe
|
||||||
|
older, larger asset files in `/var/lib/openqa/share/factory/iso` and
|
||||||
|
`/var/lib/openqa/share/factory/hdd` to get things moving again while
|
||||||
|
you debug the issue. This is better than letting new tests fail.
|
||||||
|
|
||||||
|
=== Disk space issues: testresults and images NFS share
|
||||||
|
|
||||||
|
As mentioned above, the server mounts two NFS shares from the infra
|
||||||
|
storage server, at `/var/lib/openqa/images` and
|
||||||
|
`/var/lib/openqa/testresults` (they are both actually backed by a
|
||||||
|
single volume). These are where the screenshots, video and logs of
|
||||||
|
the executed tests are stored. If they fill up, tests will start to
|
||||||
|
fail.
|
||||||
|
|
||||||
|
openQA has a garbage collection mechanism which deletes (most) files
|
||||||
|
from (most) jobs when they are six months old, which ought to keep
|
||||||
|
usage of these shares in a steady state. However, if we enhance test
|
||||||
|
coverage so openQA is running more tests in any given six month
|
||||||
|
period than earlier ones, space usage will increase correspondingly.
|
||||||
|
It can also increase in response to odd triggers like a bug which
|
||||||
|
causes a lot of messages to be logged to a serial console, or a test
|
||||||
|
being configured to upload a very large file as a log.
|
||||||
|
|
||||||
|
More importantly, there is a snapshot mechanism configured on this
|
||||||
|
volume for the production instance, so space usage will always
|
||||||
|
gradually increase there. When the volume gets too full, we must
|
||||||
|
delete some older snapshots to free up space. This must be done by
|
||||||
|
an infra storage admin. The volume's name is `fedora_openqa`.
|
||||||
|
|
||||||
== Scheduling jobs manually
|
== Scheduling jobs manually
|
||||||
|
|
||||||
|
@ -184,6 +282,13 @@ correctly when restarting, but doesn't always manage to do it right;
|
||||||
when it goes wrong, the best thing to do is usually to re-run all jobs
|
when it goes wrong, the best thing to do is usually to re-run all jobs
|
||||||
for that medium.
|
for that medium.
|
||||||
|
|
||||||
|
Restarting a job should cause its status indicator (the little colored
|
||||||
|
blob) to go blue. If nothing changes, the restart likely failed. An
|
||||||
|
error message should explain why, but it always appears at the top
|
||||||
|
of the page, so you may need to scroll up to see it. If restarting
|
||||||
|
a test fails because an asset (an ISO file or hard disk image) is
|
||||||
|
missing, you will need to re-schedule the tests (see below).
|
||||||
|
|
||||||
To run or re-run the full set of tests for a compose or update, you can
|
To run or re-run the full set of tests for a compose or update, you can
|
||||||
use the `fedora-openqa` CLI. To run or re-run tests for a compose, use:
|
use the `fedora-openqa` CLI. To run or re-run tests for a compose, use:
|
||||||
|
|
||||||
|
@ -192,9 +297,11 @@ fedora-openqa compose -f (COMPOSE LOCATION)
|
||||||
....
|
....
|
||||||
|
|
||||||
where `(COMPOSE LOCATION)` is the full URL of the `/compose`
|
where `(COMPOSE LOCATION)` is the full URL of the `/compose`
|
||||||
subdirectory of the compose. This will only work for Pungi-produced
|
subdirectory of the compose. If you have an existing test to use as a
|
||||||
composes with the expected productmd-format metadata, and a couple of
|
reference, go to the Settings tab, and the URL will be set as the
|
||||||
other quite special cases.
|
`LOCATION` setting. This will only work for Pungi-produced composes
|
||||||
|
with the expected productmd-format metadata, and a couple of other
|
||||||
|
quite special cases.
|
||||||
|
|
||||||
The `-f` argument means 'force', and is necessary to re-run tests:
|
The `-f` argument means 'force', and is necessary to re-run tests:
|
||||||
usually, the scheduler will refuse to re-schedule tests that have
|
usually, the scheduler will refuse to re-schedule tests that have
|
||||||
|
@ -203,42 +310,25 @@ already run, and `-f` overrides this.
|
||||||
To run or re-run tests for an update, use:
|
To run or re-run tests for an update, use:
|
||||||
|
|
||||||
....
|
....
|
||||||
fedora-openqa update -f (UPDATEID) (RELEASE)
|
fedora-openqa update -f (UPDATEID)
|
||||||
....
|
....
|
||||||
|
|
||||||
where `(UPDATEID)` is the update's ID - something like
|
where `(UPDATEID)` is the update's ID - something like
|
||||||
`FEDORA-2018-blahblah` - and `(RELEASE)` is the release for which the
|
`FEDORA-2018-blahblah`.
|
||||||
update is intended (27, 28, etc).
|
|
||||||
|
|
||||||
To run or re-run only the tests for a specific medium (usually a single
|
To run or re-run only the tests for a specific "flavor", you can pass
|
||||||
image file), you must use the lower-level web API client, with a more
|
the `--flavor` (update) or `--flavors` (compose) argument - for an
|
||||||
complex syntax. The command looks something like this:
|
update it must be a single flavor, for a compose it may be a single
|
||||||
|
flavor or a comma-separated list. The names of the flavors are shown
|
||||||
|
in the web UI results overview for the compose or update, e.g.
|
||||||
|
"Server-boot-iso". For update tests, omit the leading "updates-" in
|
||||||
|
the flavor name (so, to re-schedule the "updates-workstation" tests
|
||||||
|
for an update, you would pass `--flavor workstation`).
|
||||||
|
|
||||||
....
|
Less commonly, you can schedule tests for scratch builds using
|
||||||
/usr/share/openqa/script/client isos post \
|
`fedora-openqa task` and side tags using `fedora-openqa tag`. This
|
||||||
ISO=Fedora-Server-dvd-x86_64-Rawhide-20180108.n.0.iso DISTRI=fedora VERSION=Rawhide \
|
should usually only be done on the staging instance. See the help
|
||||||
FLAVOR=Server-dvd-iso ARCH=x86_64 BUILD=Fedora-Rawhide-20180108.n.0 CURRREL=27 PREVREL=26 \
|
of `fedora-openqa` for more details.
|
||||||
RAWREL=28 IMAGETYPE=dvd SUBVARIANT=Server \
|
|
||||||
LOCATION=http://kojipkgs.fedoraproject.org/compose/rawhide/Fedora-Rawhide-20180108.n.0/compose
|
|
||||||
....
|
|
||||||
|
|
||||||
The `ISO` value is the filename of the image to test (it may not
|
|
||||||
actually be an ISO), the `DISTRI` value is always 'fedora', the
|
|
||||||
`VERSION` value should be the release number or 'Rawhide', the `FLAVOR`
|
|
||||||
value depends on the image being tested (you can check the value from an
|
|
||||||
existing test for the same or a similar ISO), the `ARCH` value is the
|
|
||||||
arch of the image being tested, the `BUILD` value is the compose ID,
|
|
||||||
`CURREL` should be the release number of the current Fedora release at
|
|
||||||
the time the test is run, `PREVREL` should be one lower than `CURREL`,
|
|
||||||
`RAWREL` should be the release number associated with Rawhide at the
|
|
||||||
time the test is run, `IMAGETYPE` depends on the image being tested
|
|
||||||
(again, check a similar test for the correct value), `LOCATION` is the
|
|
||||||
URL to the /compose subdirectory of the compose location, and
|
|
||||||
`SUBVARIANT` again depends on the image being tested. Please ask for
|
|
||||||
help if this seems too daunting. To re-run the 'universal' tests on a
|
|
||||||
given image, set the `FLAVOR` value to 'universal', then set all other
|
|
||||||
values as appropriate to the chosen image. The 'universal' tests are
|
|
||||||
only likely to work at all correctly with DVD or netinst images.
|
|
||||||
|
|
||||||
openQA provides a special script for cloning an existing job but
|
openQA provides a special script for cloning an existing job but
|
||||||
optionally changing one or more variable values, which can be useful in
|
optionally changing one or more variable values, which can be useful in
|
||||||
|
@ -253,20 +343,37 @@ For interdependent jobs, you may or may not want to use the
|
||||||
`--skip-deps` argument to avoid re-running the cloned job's parent
|
`--skip-deps` argument to avoid re-running the cloned job's parent
|
||||||
job(s), depending on circumstances.
|
job(s), depending on circumstances.
|
||||||
|
|
||||||
|
In very odd circumstances you may need to schedule jobs via an API
|
||||||
|
request using the low-level CLI client provided by upstream,
|
||||||
|
`openqa-client`; see http://open.qa/docs/#_triggering_tests for details
|
||||||
|
on this. You may need to refer to the `schedule.py` file in the
|
||||||
|
`fedora_openqa` source to figure out exactly what settings to pass to
|
||||||
|
the scheduler when doing this. It's extremely unusual to have to do
|
||||||
|
this, though, so probably don't worry about it.
|
||||||
|
|
||||||
== Manual updates
|
== Manual updates
|
||||||
|
|
||||||
In general updates to any of the components of the deployments should be
|
In general updates to any of the components of the deployments should be
|
||||||
handled via ansible: push the changes out in the appropriate way (git
|
handled via ansible: push the changes out in the appropriate way (git
|
||||||
repo update, package update, etc.) and then run the ansible plays.
|
repo update, package update, etc.) and then run the ansible plays. There
|
||||||
|
is an `openqa_scratch` variable which can be set to a list of Koji
|
||||||
|
task IDs for scratch builds; these will be downloaded and configured as
|
||||||
|
a side repository. This can be used to deploy a newer build of openQA
|
||||||
|
and/or os-autoinst before it has reached updates-testing if desired
|
||||||
|
(usually we would do this only on the staging instance). Also, the
|
||||||
|
`openqa_repo` variable can be set to "updates-testing" to install or
|
||||||
|
update openQA components with updates-testing enabled, to get a new
|
||||||
|
version before it has waited a week to reach stable.
|
||||||
|
|
||||||
However, sometimes we do want to update or test a change to something
|
However, sometimes we do want to update or test a change to something
|
||||||
manually for some reason. Here are some notes on those cases.
|
manually for some reason. Here are some notes on those cases.
|
||||||
|
|
||||||
For updating openQA and/or os-autoinst packages: ideally, ensure no jobs
|
For updating openQA and/or os-autoinst packages: ideally, ensure no jobs
|
||||||
are running. Then, update all installed subpackages on the server. The
|
are running. Then, update all installed subpackages on the server. The
|
||||||
server services should be automatically restarted as part of the package
|
server services should be automatically restarted as part of the package
|
||||||
update. Then, update all installed subpackages on the worker hosts, and
|
update. Then, update all installed subpackages on the worker hosts.
|
||||||
restart all worker services. A 'for' loop can help with that, for
|
Usually this should cause the worker services to be restarted, but if
|
||||||
instance:
|
not, a 'for' loop can help with that, for instance:
|
||||||
|
|
||||||
....
|
....
|
||||||
for i in {1..10}; do systemctl restart openqa-worker@$i.service; done
|
for i in {1..10}; do systemctl restart openqa-worker@$i.service; done
|
||||||
|
@ -279,11 +386,10 @@ For updating the openQA tests:
|
||||||
....
|
....
|
||||||
cd /var/lib/openqa/share/tests/fedora
|
cd /var/lib/openqa/share/tests/fedora
|
||||||
git pull (or git checkout (branch) or whatever)
|
git pull (or git checkout (branch) or whatever)
|
||||||
./templates --clean
|
./fifloader.py -c -l templates.fif.json templates-updates.fif.json
|
||||||
./templates-updates --update
|
|
||||||
....
|
....
|
||||||
|
|
||||||
The templates steps are only necessary if there are any changes to the
|
The fifloader step is only necessary if there are any changes to the
|
||||||
templates files.
|
templates files.
|
||||||
|
|
||||||
For updating the scheduler code:
|
For updating the scheduler code:
|
||||||
|
@ -292,18 +398,19 @@ For updating the scheduler code:
|
||||||
cd /root/fedora_openqa
|
cd /root/fedora_openqa
|
||||||
git pull (or whatever changes)
|
git pull (or whatever changes)
|
||||||
python setup.py install
|
python setup.py install
|
||||||
systemctl restart fedmsg-hub
|
systemctl restart fm-consumer@fedora_openqa_scheduler.service
|
||||||
|
systemctl restart fm-consumer@fedora_openqa_resultsdb_reporter.service
|
||||||
|
systemctl restart fm-consumer@fedora_openqa_wiki_reporter.service
|
||||||
....
|
....
|
||||||
|
|
||||||
Updating other components of the scheduling process follow the same
|
Updating other components of the scheduling process follow the same
|
||||||
pattern: update the code or package, then remember to restart
|
pattern: update the code or package, then remember to restart
|
||||||
fedmsg-hub, or the fedmsg consumers won't use the new code. It's
|
the message consumers. It's possible for the openQA instances to need
|
||||||
relatively common for the openQA instances to need fedfind updates in
|
fedfind updates in advance of them being pushed to stable, for example
|
||||||
advance of them being pushed to stable, for example when a new compose
|
when a new compose type is invented and fedfind doesn't understand it,
|
||||||
type is invented and fedfind doesn't understand it, openQA can end up
|
openQA can end up trying to schedule tests for it, or the scheduler
|
||||||
trying to schedule tests for it, or the scheduler consumer can crash;
|
consumer can crash; when this happens we have to fix and update
|
||||||
when this happens we have to fix and update fedfind on the openQA
|
fedfind on the openQA instances ASAP.
|
||||||
instances ASAP.
|
|
||||||
|
|
||||||
== Logging
|
== Logging
|
||||||
|
|
||||||
|
@ -338,26 +445,28 @@ images are part of the tool itself).
|
||||||
This process isn't 100% reliable; `virt-install` can sometimes fail,
|
This process isn't 100% reliable; `virt-install` can sometimes fail,
|
||||||
either just quasi-randomly or every time, in which case the cause of the
|
either just quasi-randomly or every time, in which case the cause of the
|
||||||
failure needs to be figured out and fixed so the affected image can be
|
failure needs to be figured out and fixed so the affected image can be
|
||||||
(re-)built.
|
(re-)built. This kind of failure is quite "invisible", as when
|
||||||
|
regeneration of an image fails, we just keep the old version; this
|
||||||
|
might be the problem if update tests start failing because the initial
|
||||||
|
update to bring the system fully up to date times out, for instance.
|
||||||
|
|
||||||
The i686 and x86_64 images for each instance are built on the server, as
|
The images for each arch are built on one worker host of that arch
|
||||||
its native arch is x86_64. The images for other arches are built on one
|
(nominated by inclusion in an ansible inventory group that exists for
|
||||||
worker host for each arch (nominated by inclusion in an ansible
|
this purpose); those hosts have write access to the NFS share for this
|
||||||
inventory group that exists for this purpose); those hosts have write
|
purpose.
|
||||||
access to the NFS share for this purpose.
|
|
||||||
|
|
||||||
== Compose check reports (check-compose)
|
== Compose check reports (check-compose)
|
||||||
|
|
||||||
An additional ansible role runs on each openQA server, called
|
An additional ansible role runs on each openQA server, called
|
||||||
`check-compose`. This role installs a tool (also called `check-compose`)
|
`check-compose`. This role installs a tool (also called `check-compose`)
|
||||||
and an associated fedmsg consumer. The consumer kicks in when all openQA
|
and an associated message consumer. The consumer kicks in when all openQA
|
||||||
tests for any compose finish, and uses the `check-compose` tool to send
|
tests for any compose finish, and uses the `check-compose` tool to send
|
||||||
out an email report summarizing the results of the tests (well, the
|
out an email report summarizing the results of the tests (well, the
|
||||||
production server sends out emails, the staging server just logs the
|
production server sends out emails, the staging server just logs the
|
||||||
contents of the report). This role isn't really a part of openQA proper,
|
contents of the report). This role isn't really a part of openQA proper,
|
||||||
but is run on the openQA servers as it seems like as good a place as any
|
but is run on the openQA servers as it seems like as good a place as any
|
||||||
to do it. As with all other fedmsg consumers, if making manual changes
|
to do it. As with all other message consumers, if making manual changes
|
||||||
or updates to the components, remember to restart `fedmsg-hub` service
|
or updates to the components, remember to restart the consumer service
|
||||||
afterwards.
|
afterwards.
|
||||||
|
|
||||||
== Autocloud ResultsDB forwarder (autocloudreporter)
|
== Autocloud ResultsDB forwarder (autocloudreporter)
|
||||||
|
|
Loading…
Add table
Add a link
Reference in a new issue