openQA and branching SOP changes related to openQA at branch time #360
2 changed files with 215 additions and 13 deletions
|
@ -4,12 +4,20 @@ include::_partials/attributes.adoc[]
|
|||
|
||||
== Description
|
||||
|
||||
At each alpha freeze we branch the pending release away from `devel/`
|
||||
At a certain point in each cycle, we branch the pending release away from rawhide,
|
||||
which allows rawhide (currently F{rawhide}) to move on while the pending release goes into
|
||||
bugfix and polish mode.
|
||||
|
||||
You will find below the list of steps to follow to branch a new Fedora release.
|
||||
|
||||
== Co-ordinate with Quality and CI teams for gating
|
||||
|
||||
Branching is a disruptive event for the openQA and Fedora CI systems.
|
||||
Before starting the branching process, contact the Quality and CI teams to alert them.
|
||||
Ideally, someone from the Quality team should follow along with the branching process in real time.
|
||||
The "Branching procedure" in the xref:sysadmin_guide:openqa.adoc[openQA infrastructure SOP] must be carried out alongside this SOP.
|
||||
If this is not done, there is a high chance that updates for the new branched release and Rawhide will fail gating due to unresolved test failures.
|
||||
|
||||
== Mass resigning
|
||||
|
||||
When we branch off of rawhide, the branched release packages are already signed by
|
||||
|
@ -190,6 +198,24 @@ infra ansible repo. This change includes, updating `koji-sync-listener.py`,
|
|||
|
||||
Please check these files from the https://pagure.io/fedora-infra/ansible/c/549e5d3ace41c04fdbef9d81f359f16c2fe0c2fa?branch=main[commit] for your reference.
|
||||
|
||||
== Fedfind metadata
|
||||
|
||||
The https://fedorapeople.org/groups/qa/metadata/release.json[metadata file] used by https://pagure.io/fedora-qa/fedfind[fedfind] needs to be updated for the new branch.
|
||||
The file is in the QA team's fedorapeople space. Log in to `fedorapeople.org` and edit it at `/srv/groups/qa/metadata/release.json`.
|
||||
It should be changed right around the time the new release is added to Bodhi.
|
||||
When you open it, the `branched` array should be empty.
|
||||
Make the new Branched release number the only entry in this array.
|
||||
So when branching Fedora 43, the line should change from:
|
||||
```
|
||||
"branched": [],
|
||||
```
|
||||
to:
|
||||
```
|
||||
"branched": [43],
|
||||
```
|
||||
This is the only change required.
|
||||
If you do not have privileges to edit the file, ask in the Fedora Quality chat for someone to edit it.
|
||||
|
||||
== Toddlers
|
||||
|
||||
=== Add new SLA to the toddlers App
|
||||
|
|
|
@ -115,15 +115,200 @@ should be scheduled and run automatically when new composes and critical
|
|||
path updates appear, and results should be reported to ResultsDB and
|
||||
Wikitcms (when appropriate). Dynamically generated assets should be
|
||||
regenerated regularly, including across release boundaries (see the
|
||||
section on createhdds below): no manual intervention should be required
|
||||
when a new Fedora release appears. If any of this does not happen,
|
||||
something is wrong, and manual inspection is needed.
|
||||
section on createhdds below). If any of this does not happen, something
|
||||
is wrong, and manual inspection is needed. However, at branching (when
|
||||
a new Fedora release branches from Rawhide), some manual intervention
|
||||
is usually required to ensure the smoothest possible transition. See
|
||||
<<Branching procedure>> below.
|
||||
|
||||
Our usual practice is to upgrade the openQA systems to new Fedora
|
||||
releases promptly as they appear, using `dnf system-upgrade`. This is
|
||||
done manually. We usually upgrade the staging instance first and watch
|
||||
for problems for a week or two before upgrading production.
|
||||
|
||||
== Updating 'needles'
|
||||
|
||||
Needles are the 'magic screenshots' openQA uses for testing. A needle
|
||||
consists of two files - `somefile.png` (the screenshot itself), and
|
||||
`somefile.json` (the metadata). The names must match. Needles are
|
||||
usually created using the openQA web UI. This will create the two
|
||||
files in the `/var/lib/openqa/share/tests/fedora/needles` directory
|
||||
on the server. In Fedora openQA we do not leave it like this. We keep
|
||||
the needles in a https://pagure.io/fedora-qa/os-autoinst-distri-fedora[git repository]. After creating a
|
||||
needle, you should copy the files out to a local checkout of that
|
||||
repository, place it in the appropriate subdirectory - we organize our
|
||||
needles into subdirectories - commit it, push the commit, then update
|
||||
the checkout back on the server, and remove the "working copy" of the
|
||||
needle in the top-level `needles` directory. Also remember to update
|
||||
the checkout on the other instance. If the lab instance is on a
|
||||
different branch and you need it to have the new needle, rebase that
|
||||
branch on the updated `main` branch and force-push it back (but of
|
||||
course, make sure your local checkout of the feature branch is fully
|
||||
up to date before rebasing and force pushing).
|
||||
|
||||
== Branching procedure
|
||||
|
||||
Branching is a disruptive time for openQA operation. Since openQA
|
||||
associates a release number with Rawhide, two things change from its
|
||||
perspective during branching: the release number associated with
|
||||
Rawhide changes, and the release number formerly associated with
|
||||
Rawhide is now 'taken over' by the new branched release. As the
|
||||
branching process takes some time, and there is no "perfect" point
|
||||
at which the transition can be done smoothly, it is normal that
|
||||
update tests for both Rawhide and the new branched release will fail
|
||||
for some hours around the branching process. The best we can do is to
|
||||
mitigate this as far as possible.
|
||||
|
||||
openQA's behavior around branching will depend on when fedfind's
|
||||
https://fedorapeople.org/groups/qa/metadata/release.json[release metadata] is updated. Until that is updated, openQA will
|
||||
continue to believe that Rawhide "owns" the "old" release number: the
|
||||
RAWREL variable will be set to that number, and tests of updates for
|
||||
that release number will behave as if it is Rawhide. If updates with
|
||||
the new release number are created before this metadata is updated,
|
||||
the openQA scheduler will be confused by them and ignore them.
|
||||
|
||||
Once the metadata is updated, openQA will act as if branching has
|
||||
happened - tests will be scheduled for updates with the "new" number,
|
||||
tests for updates for the "old" number will act consistently with it
|
||||
being Branched, not Rawhide.
|
||||
|
||||
The key tasks to make Branching go as smoothly as possible in openQA
|
||||
are:
|
||||
|
||||
* Get the fedfind metadata updated as close as possible to 'the right'
|
||||
time, which should be just before the first update for the new number
|
||||
is created in Bodhi. This task in the xref:release_guide:sop_mass_branching.adoc[Mass Branching SOP],
|
||||
but releng may contact us to do the edit if they don't have permissions
|
||||
* Build base disk images for the new release number as soon as the
|
||||
metadata is updated and a post-branching Rawhide compose exists
|
||||
* Rebuild base disk images for the old release number as soon as the
|
||||
first post-branching Branched compose exists
|
||||
* Trigger tests for any Rawhide updates for which they were missed
|
||||
* Create version identification needles for the new release number
|
||||
as soon as possible
|
||||
* Disable desktop_background test for the Branched release if a new
|
||||
background image does not yet exist
|
||||
* Aggressively monitor and retry failures
|
||||
|
||||
=== Rebuilding base disk images
|
||||
|
||||
Base disk images can only be rebuilt successfully once a post-branch
|
||||
compose for the release has completed and synced to https://dl.fedoraproject.org/pub/fedora/linux/development/[dl.fedoraproject.org].
|
||||
Stay in touch with the release engineering team and monitor the chat
|
||||
channel to keep up with this process. Once you have verified that a
|
||||
post-branch Rawhide compose has synced to https://dl.fedoraproject.org/pub/fedora/linux/development/rawhide/[the rawhide repository],
|
||||
build new base images for Rawhide. Once you have verified that the
|
||||
first Branched compose has synced to the numbered directory under
|
||||
the development directory, rebuild base images for that release (most
|
||||
will already exist, but will be pre-branch Rawhide images, which will
|
||||
likely cause issues in the tests).
|
||||
|
||||
To (re)build images, log into the openQA worker hosts tasked with disk
|
||||
image builds for each arch on each openQA instance. These are listed
|
||||
in the `openqa_hdds_workers` group in the ansible inventory. Become
|
||||
root, then go to the correct directory, and run the command to rebuild
|
||||
all images for the release, where `NN` is the *new* release number:
|
||||
```
|
||||
cd /var/lib/openqa/share/factory/hdd/fixed
|
||||
/root/createhdds/createhdds.py all -r NN -f
|
||||
```
|
||||
So when branching Fedora 43 from Rawhide, we would pass `-r 44` to
|
||||
build the new Rawhide base disk images, and `-r 43` to rebuild the
|
||||
43 base disk images with the new Branched compose.
|
||||
|
||||
=== Triggering missed Rawhide tests
|
||||
|
||||
If any critical path Rawhide updates are created under the new release
|
||||
number before the fedfind metadata is updated, openQA will fail to
|
||||
schedule tests for them. Once the metadata is updated and new Rawhide
|
||||
base images are built, check in the Bodhi web UI for any Rawhide
|
||||
updates that have failed gating. Look on the automated tests page for
|
||||
each update. If they show tests as missing (rather than failed),
|
||||
check the openQA web UI and see if you can find any tests for the
|
||||
update. If not, you will need to trigger the tests for that update by
|
||||
running:
|
||||
```
|
||||
fedora-openqa update -f (UPDATE ID)
|
||||
```
|
||||
from the openQA server.
|
||||
|
||||
=== Creating new version identification needles
|
||||
|
||||
The installer tests have a check that the installer shows the correct
|
||||
release number. Needles for the new Rawhide release number will not
|
||||
yet exist at the time of branching. The first time install tests run
|
||||
for the new Rawhide release number and reach the point where this
|
||||
check happens, they will fail looking for a needle with the tag
|
||||
`version_NN_ident`, where `NN` is the new Rawhide release number.
|
||||
On one of the openQA instances, use the web UI needle editor to create
|
||||
a new needle with the correct match area and tags (reference the
|
||||
existing needles for the previous release https://pagure.io/fedora-qa/os-autoinst-distri-fedora/blob/main/f/needles/anaconda/identification[here]). You will
|
||||
need to create two needles, one for the GTK UI and one for the web UI,
|
||||
so long as we test images with both installer UIs. Creating a needle
|
||||
adds a .json and a .png file in the
|
||||
`/var/lib/openqa/share/tests/fedora/needles` directory. These files
|
||||
need to be copied out and checked into the os-autoinst-distri-fedora
|
||||
git repo, in the `anaconda/identification` folder, then pushed back
|
||||
to both openQA instances, after which the 'working copy' in the top-
|
||||
level `needles` directory can be removed.
|
||||
|
||||
=== Disabling the desktop_background test
|
||||
|
||||
The desktop_background test will initially fail for all updates for
|
||||
the newly-branched release. If the release already contains a new,
|
||||
unique background (different from that of the current stable release)
|
||||
you can create a new needle for it, following much the same process
|
||||
as for the version identification needles. Otherwise, the test must be
|
||||
disabled until the new background is ready.
|
||||
|
||||
Here is https://pagure.io/fedora-qa/os-autoinst-distri-fedora/c/f8810b67b4fc461d1e060c0c9449991c6b18b68d?branch=main[a sample commit] that disabled the test for
|
||||
Fedora 42. You can just follow that example with a new commit, push it
|
||||
out, and pull it to both openQA instances. Then re-run all failed
|
||||
instances of the test.
|
||||
|
||||
=== Restarting failures
|
||||
|
||||
The update gating configuration is updated during the main releng
|
||||
branch SOP, so update gating will be active for the newly-branched
|
||||
release and for Rawhide under its new release number almost
|
||||
immediately. It is therefore critical that we ensure all failed update
|
||||
tests are re-run until they pass or the failure is deemed 'genuine'
|
||||
(i.e. not due to the branching process, but a real bug in the update).
|
||||
|
||||
Throughout the branching process, constantly keep an eye on the webUI
|
||||
summary page for the Fedora Updates and Fedora AArch64 Updates groups.
|
||||
Also keep the web UI detail pages for one new-Rawhide and one
|
||||
new-Branched update open, and retry failures as you think they may be
|
||||
addressed.
|
||||
|
||||
Once you are sure tests are generally working for both new-Rawhide and
|
||||
new-Branched, systematically go through and restart all failed tests.
|
||||
Some of the restarts may fail (due to normal flakiness, or the heavy
|
||||
load of running so many tests at once) - keep an eye on these, and
|
||||
keep up the restarts until they pass or the failure appears 'genuine'.
|
||||
|
||||
If tests are failing and the cause looks like something to do with the
|
||||
branching process - classic symptoms are RPM signature errors or 404s
|
||||
from the mirror system - contact the release engineering team to get
|
||||
these rectified, and retry once they tell you the issue is addressed.
|
||||
|
||||
Beware of tests with the wrong `RAWREL` value. This variable records
|
||||
the Rawhide release number. When running tests after branching it
|
||||
should always be the new Rawhide release number; running tests after
|
||||
branching with it set to the old number can cause various failures.
|
||||
It's common for some tests scheduled around branching to fail and have
|
||||
the old `RAWREL` value; you will find you cannot get these tests to
|
||||
pass with regular restarts. Always check the `RAWREL` value of failed
|
||||
tests, and if it's wrong, instead of just restarting the test through
|
||||
the web UI, retrigger the tests for that update from the server:
|
||||
```
|
||||
fedora-openqa update -f (UPDATE ID)
|
||||
```
|
||||
|
||||
Your end goal, as always, is for all outstanding failures to be
|
||||
definitely identified as genuine bugs in the update, with a comment
|
||||
linking to a Bodhi comment or bug report that identifies the issue.
|
||||
|
||||
== Rebooting / restarting
|
||||
|
||||
The optimal approach to rebooting an entire openQA deployment is as
|
||||
|
@ -468,12 +653,3 @@ but is run on the openQA servers as it seems like as good a place as any
|
|||
to do it. As with all other message consumers, if making manual changes
|
||||
or updates to the components, remember to restart the consumer service
|
||||
afterwards.
|
||||
|
||||
== Autocloud ResultsDB forwarder (autocloudreporter)
|
||||
|
||||
An ansible role called `autocloudreporter` also runs on the openQA
|
||||
production server. This has nothing to do with openQA at all, but is run
|
||||
there for convenience. This role deploys a fedmsg consumer that listens
|
||||
for fedmsgs indicating that Autocloud (a separate automated test system
|
||||
which tests cloud images) has completed a test run, then forwards those
|
||||
results to ResultsDB.
|
||||
|
|
Loading…
Add table
Add a link
Reference in a new issue