openQA and branching SOP changes related to openQA at branch time #360

Open
adamwill wants to merge 1 commit from openqa-update into master
2 changed files with 215 additions and 13 deletions

View file

@ -4,12 +4,20 @@ include::_partials/attributes.adoc[]
== Description
At each alpha freeze we branch the pending release away from `devel/`
At a certain point in each cycle, we branch the pending release away from rawhide,
which allows rawhide (currently F{rawhide}) to move on while the pending release goes into
bugfix and polish mode.
You will find below the list of steps to follow to branch a new Fedora release.
== Co-ordinate with Quality and CI teams for gating
Branching is a disruptive event for the openQA and Fedora CI systems.
Before starting the branching process, contact the Quality and CI teams to alert them.
Ideally, someone from the Quality team should follow along with the branching process in real time.
The "Branching procedure" in the xref:sysadmin_guide:openqa.adoc[openQA infrastructure SOP] must be carried out alongside this SOP.
If this is not done, there is a high chance that updates for the new branched release and Rawhide will fail gating due to unresolved test failures.
== Mass resigning
When we branch off of rawhide, the branched release packages are already signed by
@ -190,6 +198,24 @@ infra ansible repo. This change includes, updating `koji-sync-listener.py`,
Please check these files from the https://pagure.io/fedora-infra/ansible/c/549e5d3ace41c04fdbef9d81f359f16c2fe0c2fa?branch=main[commit] for your reference.
== Fedfind metadata
The https://fedorapeople.org/groups/qa/metadata/release.json[metadata file] used by https://pagure.io/fedora-qa/fedfind[fedfind] needs to be updated for the new branch.
The file is in the QA team's fedorapeople space. Log in to `fedorapeople.org` and edit it at `/srv/groups/qa/metadata/release.json`.
It should be changed right around the time the new release is added to Bodhi.
When you open it, the `branched` array should be empty.
Make the new Branched release number the only entry in this array.
So when branching Fedora 43, the line should change from:
```
"branched": [],
```
to:
```
"branched": [43],
```
This is the only change required.
If you do not have privileges to edit the file, ask in the Fedora Quality chat for someone to edit it.
== Toddlers
=== Add new SLA to the toddlers App

View file

@ -115,15 +115,200 @@ should be scheduled and run automatically when new composes and critical
path updates appear, and results should be reported to ResultsDB and
Wikitcms (when appropriate). Dynamically generated assets should be
regenerated regularly, including across release boundaries (see the
section on createhdds below): no manual intervention should be required
when a new Fedora release appears. If any of this does not happen,
something is wrong, and manual inspection is needed.
section on createhdds below). If any of this does not happen, something
is wrong, and manual inspection is needed. However, at branching (when
a new Fedora release branches from Rawhide), some manual intervention
is usually required to ensure the smoothest possible transition. See
<<Branching procedure>> below.
Our usual practice is to upgrade the openQA systems to new Fedora
releases promptly as they appear, using `dnf system-upgrade`. This is
done manually. We usually upgrade the staging instance first and watch
for problems for a week or two before upgrading production.
== Updating 'needles'
Needles are the 'magic screenshots' openQA uses for testing. A needle
consists of two files - `somefile.png` (the screenshot itself), and
`somefile.json` (the metadata). The names must match. Needles are
usually created using the openQA web UI. This will create the two
files in the `/var/lib/openqa/share/tests/fedora/needles` directory
on the server. In Fedora openQA we do not leave it like this. We keep
the needles in a https://pagure.io/fedora-qa/os-autoinst-distri-fedora[git repository]. After creating a
needle, you should copy the files out to a local checkout of that
repository, place it in the appropriate subdirectory - we organize our
needles into subdirectories - commit it, push the commit, then update
the checkout back on the server, and remove the "working copy" of the
needle in the top-level `needles` directory. Also remember to update
the checkout on the other instance. If the lab instance is on a
different branch and you need it to have the new needle, rebase that
branch on the updated `main` branch and force-push it back (but of
course, make sure your local checkout of the feature branch is fully
up to date before rebasing and force pushing).
== Branching procedure
Branching is a disruptive time for openQA operation. Since openQA
associates a release number with Rawhide, two things change from its
perspective during branching: the release number associated with
Rawhide changes, and the release number formerly associated with
Rawhide is now 'taken over' by the new branched release. As the
branching process takes some time, and there is no "perfect" point
at which the transition can be done smoothly, it is normal that
update tests for both Rawhide and the new branched release will fail
for some hours around the branching process. The best we can do is to
mitigate this as far as possible.
openQA's behavior around branching will depend on when fedfind's
https://fedorapeople.org/groups/qa/metadata/release.json[release metadata] is updated. Until that is updated, openQA will
continue to believe that Rawhide "owns" the "old" release number: the
RAWREL variable will be set to that number, and tests of updates for
that release number will behave as if it is Rawhide. If updates with
the new release number are created before this metadata is updated,
the openQA scheduler will be confused by them and ignore them.
Once the metadata is updated, openQA will act as if branching has
happened - tests will be scheduled for updates with the "new" number,
tests for updates for the "old" number will act consistently with it
being Branched, not Rawhide.
The key tasks to make Branching go as smoothly as possible in openQA
are:
* Get the fedfind metadata updated as close as possible to 'the right'
time, which should be just before the first update for the new number
is created in Bodhi. This task in the xref:release_guide:sop_mass_branching.adoc[Mass Branching SOP],
but releng may contact us to do the edit if they don't have permissions
* Build base disk images for the new release number as soon as the
metadata is updated and a post-branching Rawhide compose exists
* Rebuild base disk images for the old release number as soon as the
first post-branching Branched compose exists
* Trigger tests for any Rawhide updates for which they were missed
* Create version identification needles for the new release number
as soon as possible
* Disable desktop_background test for the Branched release if a new
background image does not yet exist
* Aggressively monitor and retry failures
=== Rebuilding base disk images
Base disk images can only be rebuilt successfully once a post-branch
compose for the release has completed and synced to https://dl.fedoraproject.org/pub/fedora/linux/development/[dl.fedoraproject.org].
Stay in touch with the release engineering team and monitor the chat
channel to keep up with this process. Once you have verified that a
post-branch Rawhide compose has synced to https://dl.fedoraproject.org/pub/fedora/linux/development/rawhide/[the rawhide repository],
build new base images for Rawhide. Once you have verified that the
first Branched compose has synced to the numbered directory under
the development directory, rebuild base images for that release (most
will already exist, but will be pre-branch Rawhide images, which will
likely cause issues in the tests).
To (re)build images, log into the openQA worker hosts tasked with disk
image builds for each arch on each openQA instance. These are listed
in the `openqa_hdds_workers` group in the ansible inventory. Become
root, then go to the correct directory, and run the command to rebuild
all images for the release, where `NN` is the *new* release number:
```
cd /var/lib/openqa/share/factory/hdd/fixed
/root/createhdds/createhdds.py all -r NN -f
```
So when branching Fedora 43 from Rawhide, we would pass `-r 44` to
build the new Rawhide base disk images, and `-r 43` to rebuild the
43 base disk images with the new Branched compose.
=== Triggering missed Rawhide tests
If any critical path Rawhide updates are created under the new release
number before the fedfind metadata is updated, openQA will fail to
schedule tests for them. Once the metadata is updated and new Rawhide
base images are built, check in the Bodhi web UI for any Rawhide
updates that have failed gating. Look on the automated tests page for
each update. If they show tests as missing (rather than failed),
check the openQA web UI and see if you can find any tests for the
update. If not, you will need to trigger the tests for that update by
running:
```
fedora-openqa update -f (UPDATE ID)
```
from the openQA server.
=== Creating new version identification needles
The installer tests have a check that the installer shows the correct
release number. Needles for the new Rawhide release number will not
yet exist at the time of branching. The first time install tests run
for the new Rawhide release number and reach the point where this
check happens, they will fail looking for a needle with the tag
`version_NN_ident`, where `NN` is the new Rawhide release number.
On one of the openQA instances, use the web UI needle editor to create
a new needle with the correct match area and tags (reference the
existing needles for the previous release https://pagure.io/fedora-qa/os-autoinst-distri-fedora/blob/main/f/needles/anaconda/identification[here]). You will
need to create two needles, one for the GTK UI and one for the web UI,
so long as we test images with both installer UIs. Creating a needle
adds a .json and a .png file in the
`/var/lib/openqa/share/tests/fedora/needles` directory. These files
need to be copied out and checked into the os-autoinst-distri-fedora
git repo, in the `anaconda/identification` folder, then pushed back
to both openQA instances, after which the 'working copy' in the top-
level `needles` directory can be removed.
=== Disabling the desktop_background test
The desktop_background test will initially fail for all updates for
the newly-branched release. If the release already contains a new,
unique background (different from that of the current stable release)
you can create a new needle for it, following much the same process
as for the version identification needles. Otherwise, the test must be
disabled until the new background is ready.
Here is https://pagure.io/fedora-qa/os-autoinst-distri-fedora/c/f8810b67b4fc461d1e060c0c9449991c6b18b68d?branch=main[a sample commit] that disabled the test for
Fedora 42. You can just follow that example with a new commit, push it
out, and pull it to both openQA instances. Then re-run all failed
instances of the test.
=== Restarting failures
The update gating configuration is updated during the main releng
branch SOP, so update gating will be active for the newly-branched
release and for Rawhide under its new release number almost
immediately. It is therefore critical that we ensure all failed update
tests are re-run until they pass or the failure is deemed 'genuine'
(i.e. not due to the branching process, but a real bug in the update).
Throughout the branching process, constantly keep an eye on the webUI
summary page for the Fedora Updates and Fedora AArch64 Updates groups.
Also keep the web UI detail pages for one new-Rawhide and one
new-Branched update open, and retry failures as you think they may be
addressed.
Once you are sure tests are generally working for both new-Rawhide and
new-Branched, systematically go through and restart all failed tests.
Some of the restarts may fail (due to normal flakiness, or the heavy
load of running so many tests at once) - keep an eye on these, and
keep up the restarts until they pass or the failure appears 'genuine'.
If tests are failing and the cause looks like something to do with the
branching process - classic symptoms are RPM signature errors or 404s
from the mirror system - contact the release engineering team to get
these rectified, and retry once they tell you the issue is addressed.
Beware of tests with the wrong `RAWREL` value. This variable records
the Rawhide release number. When running tests after branching it
should always be the new Rawhide release number; running tests after
branching with it set to the old number can cause various failures.
It's common for some tests scheduled around branching to fail and have
the old `RAWREL` value; you will find you cannot get these tests to
pass with regular restarts. Always check the `RAWREL` value of failed
tests, and if it's wrong, instead of just restarting the test through
the web UI, retrigger the tests for that update from the server:
```
fedora-openqa update -f (UPDATE ID)
```
Your end goal, as always, is for all outstanding failures to be
definitely identified as genuine bugs in the update, with a comment
linking to a Bodhi comment or bug report that identifies the issue.
== Rebooting / restarting
The optimal approach to rebooting an entire openQA deployment is as
@ -468,12 +653,3 @@ but is run on the openQA servers as it seems like as good a place as any
to do it. As with all other message consumers, if making manual changes
or updates to the components, remember to restart the consumer service
afterwards.
== Autocloud ResultsDB forwarder (autocloudreporter)
An ansible role called `autocloudreporter` also runs on the openQA
production server. This has nothing to do with openQA at all, but is run
there for convenience. This role deploys a fedmsg consumer that listens
for fedmsgs indicating that Autocloud (a separate automated test system
which tests cloud images) has completed a test run, then forwards those
results to ResultsDB.