2 changed files with 215 additions and 13 deletions
--- a/modules/release_guide/pages/sop_mass_branching.adoc
+++ b/modules/release_guide/pages/sop_mass_branching.adoc
@ -4,12 +4,20 @@ include::_partials/attributes.adoc[]

 == Description

-At each alpha freeze we branch the pending release away from `devel/`
+At a certain point in each cycle, we branch the pending release away from rawhide,
 which allows rawhide (currently F{rawhide}) to move on while the pending release goes into
 bugfix and polish mode.

 You will find below the list of steps to follow to branch a new Fedora release.

+== Co-ordinate with Quality and CI teams for gating
+
+Branching is a disruptive event for the openQA and Fedora CI systems.
+Before starting the branching process, contact the Quality and CI teams to alert them.
+Ideally, someone from the Quality team should follow along with the branching process in real time.
+The "Branching procedure" in the xref:sysadmin_guide:openqa.adoc[openQA infrastructure SOP] must be carried out alongside this SOP.
+If this is not done, there is a high chance that updates for the new branched release and Rawhide will fail gating due to unresolved test failures.
+
 == Mass resigning

 When we branch off of rawhide, the branched release packages are already signed by
@ -190,6 +198,24 @@ infra ansible repo. This change includes, updating `koji-sync-listener.py`,

 Please check these files from the https://pagure.io/fedora-infra/ansible/c/549e5d3ace41c04fdbef9d81f359f16c2fe0c2fa?branch=main[commit] for your reference.

+== Fedfind metadata
+
+The https://fedorapeople.org/groups/qa/metadata/release.json[metadata file] used by https://pagure.io/fedora-qa/fedfind[fedfind] needs to be updated for the new branch.
+The file is in the QA team's fedorapeople space. Log in to `fedorapeople.org` and edit it at `/srv/groups/qa/metadata/release.json`.
+It should be changed right around the time the new release is added to Bodhi.
+When you open it, the `branched` array should be empty.
+Make the new Branched release number the only entry in this array.
+So when branching Fedora 43, the line should change from:
+```
+        "branched": [],
+```
+to:
+```
+        "branched": [43],
+```
+This is the only change required.
+If you do not have privileges to edit the file, ask in the Fedora Quality chat for someone to edit it.
+
 == Toddlers

 === Add new SLA to the toddlers App
--- a/modules/sysadmin_guide/pages/openqa.adoc
+++ b/modules/sysadmin_guide/pages/openqa.adoc
@ -115,15 +115,200 @@ should be scheduled and run automatically when new composes and critical
 path updates appear, and results should be reported to ResultsDB and
 Wikitcms (when appropriate). Dynamically generated assets should be
 regenerated regularly, including across release boundaries (see the
-section on createhdds below): no manual intervention should be required
-when a new Fedora release appears. If any of this does not happen,
-something is wrong, and manual inspection is needed.
+section on createhdds below). If any of this does not happen, something
+is wrong, and manual inspection is needed. However, at branching (when
+a new Fedora release branches from Rawhide), some manual intervention
+is usually required to ensure the smoothest possible transition. See
+<<Branching procedure>> below.

 Our usual practice is to upgrade the openQA systems to new Fedora
 releases promptly as they appear, using `dnf system-upgrade`. This is
 done manually. We usually upgrade the staging instance first and watch
 for problems for a week or two before upgrading production.

+== Updating 'needles'
+
+Needles are the 'magic screenshots' openQA uses for testing. A needle
+consists of two files - `somefile.png` (the screenshot itself), and
+`somefile.json` (the metadata). The names must match. Needles are
+usually created using the openQA web UI. This will create the two
+files in the `/var/lib/openqa/share/tests/fedora/needles` directory
+on the server. In Fedora openQA we do not leave it like this. We keep
+the needles in a https://pagure.io/fedora-qa/os-autoinst-distri-fedora[git repository]. After creating a
+needle, you should copy the files out to a local checkout of that
+repository, place it in the appropriate subdirectory - we organize our
+needles into subdirectories - commit it, push the commit, then update
+the checkout back on the server, and remove the "working copy" of the
+needle in the top-level `needles` directory. Also remember to update
+the checkout on the other instance. If the lab instance is on a
+different branch and you need it to have the new needle, rebase that
+branch on the updated `main` branch and force-push it back (but of
+course, make sure your local checkout of the feature branch is fully
+up to date before rebasing and force pushing).
+
+== Branching procedure
+
+Branching is a disruptive time for openQA operation. Since openQA
+associates a release number with Rawhide, two things change from its
+perspective during branching: the release number associated with
+Rawhide changes, and the release number formerly associated with
+Rawhide is now 'taken over' by the new branched release. As the
+branching process takes some time, and there is no "perfect" point
+at which the transition can be done smoothly, it is normal that
+update tests for both Rawhide and the new branched release will fail
+for some hours around the branching process. The best we can do is to
+mitigate this as far as possible.
+
+openQA's behavior around branching will depend on when fedfind's
+https://fedorapeople.org/groups/qa/metadata/release.json[release metadata] is updated. Until that is updated, openQA will
+continue to believe that Rawhide "owns" the "old" release number: the
+RAWREL variable will be set to that number, and tests of updates for
+that release number will behave as if it is Rawhide. If updates with
+the new release number are created before this metadata is updated,
+the openQA scheduler will be confused by them and ignore them.
+
+Once the metadata is updated, openQA will act as if branching has
+happened - tests will be scheduled for updates with the "new" number,
+tests for updates for the "old" number will act consistently with it
+being Branched, not Rawhide.
+
+The key tasks to make Branching go as smoothly as possible in openQA
+are:
+
+* Get the fedfind metadata updated as close as possible to 'the right'
+time, which should be just before the first update for the new number
+is created in Bodhi. This task in the xref:release_guide:sop_mass_branching.adoc[Mass Branching SOP],
+but releng may contact us to do the edit if they don't have permissions
+* Build base disk images for the new release number as soon as the
+metadata is updated and a post-branching Rawhide compose exists
+* Rebuild base disk images for the old release number as soon as the
+first post-branching Branched compose exists
+* Trigger tests for any Rawhide updates for which they were missed
+* Create version identification needles for the new release number
+as soon as possible
+* Disable desktop_background test for the Branched release if a new
+background image does not yet exist
+* Aggressively monitor and retry failures
+
+=== Rebuilding base disk images
+
+Base disk images can only be rebuilt successfully once a post-branch
+compose for the release has completed and synced to https://dl.fedoraproject.org/pub/fedora/linux/development/[dl.fedoraproject.org].
+Stay in touch with the release engineering team and monitor the chat
+channel to keep up with this process. Once you have verified that a
+post-branch Rawhide compose has synced to https://dl.fedoraproject.org/pub/fedora/linux/development/rawhide/[the rawhide repository],
+build new base images for Rawhide. Once you have verified that the
+first Branched compose has synced to the numbered directory under
+the development directory, rebuild base images for that release (most
+will already exist, but will be pre-branch Rawhide images, which will
+likely cause issues in the tests).
+
+To (re)build images, log into the openQA worker hosts tasked with disk
+image builds for each arch on each openQA instance. These are listed
+in the `openqa_hdds_workers` group in the ansible inventory. Become
+root, then go to the correct directory, and run the command to rebuild
+all images for the release, where `NN` is the *new* release number:
+```
+cd /var/lib/openqa/share/factory/hdd/fixed
+/root/createhdds/createhdds.py all -r NN -f
+```
+So when branching Fedora 43 from Rawhide, we would pass `-r 44` to
+build the new Rawhide base disk images, and `-r 43` to rebuild the
+43 base disk images with the new Branched compose.
+
+=== Triggering missed Rawhide tests
+
+If any critical path Rawhide updates are created under the new release
+number before the fedfind metadata is updated, openQA will fail to
+schedule tests for them. Once the metadata is updated and new Rawhide
+base images are built, check in the Bodhi web UI for any Rawhide
+updates that have failed gating. Look on the automated tests page for
+each update. If they show tests as missing (rather than failed),
+check the openQA web UI and see if you can find any tests for the
+update. If not, you will need to trigger the tests for that update by
+running:
+```
+fedora-openqa update -f (UPDATE ID)
+```
+from the openQA server.
+
+=== Creating new version identification needles
+
+The installer tests have a check that the installer shows the correct
+release number. Needles for the new Rawhide release number will not
+yet exist at the time of branching. The first time install tests run
+for the new Rawhide release number and reach the point where this
+check happens, they will fail looking for a needle with the tag
+`version_NN_ident`, where `NN` is the new Rawhide release number.
+On one of the openQA instances, use the web UI needle editor to create
+a new needle with the correct match area and tags (reference the
+existing needles for the previous release https://pagure.io/fedora-qa/os-autoinst-distri-fedora/blob/main/f/needles/anaconda/identification[here]). You will
+need to create two needles, one for the GTK UI and one for the web UI,
+so long as we test images with both installer UIs. Creating a needle
+adds a .json and a .png file in the
+`/var/lib/openqa/share/tests/fedora/needles` directory. These files
+need to be copied out and checked into the os-autoinst-distri-fedora
+git repo, in the `anaconda/identification` folder, then pushed back
+to both openQA instances, after which the 'working copy' in the top-
+level `needles` directory can be removed.
+
+=== Disabling the desktop_background test
+
+The desktop_background test will initially fail for all updates for
+the newly-branched release. If the release already contains a new,
+unique background (different from that of the current stable release)
+you can create a new needle for it, following much the same process
+as for the version identification needles. Otherwise, the test must be
+disabled until the new background is ready.
+
+Here is https://pagure.io/fedora-qa/os-autoinst-distri-fedora/c/f8810b67b4fc461d1e060c0c9449991c6b18b68d?branch=main[a sample commit] that disabled the test for
+Fedora 42. You can just follow that example with a new commit, push it
+out, and pull it to both openQA instances. Then re-run all failed
+instances of the test.
+
+=== Restarting failures
+
+The update gating configuration is updated during the main releng
+branch SOP, so update gating will be active for the newly-branched
+release and for Rawhide under its new release number almost
+immediately. It is therefore critical that we ensure all failed update
+tests are re-run until they pass or the failure is deemed 'genuine'
+(i.e. not due to the branching process, but a real bug in the update).
+
+Throughout the branching process, constantly keep an eye on the webUI
+summary page for the Fedora Updates and Fedora AArch64 Updates groups.
+Also keep the web UI detail pages for one new-Rawhide and one
+new-Branched update open, and retry failures as you think they may be
+addressed.
+
+Once you are sure tests are generally working for both new-Rawhide and
+new-Branched, systematically go through and restart all failed tests.
+Some of the restarts may fail (due to normal flakiness, or the heavy
+load of running so many tests at once) - keep an eye on these, and
+keep up the restarts until they pass or the failure appears 'genuine'.
+
+If tests are failing and the cause looks like something to do with the
+branching process - classic symptoms are RPM signature errors or 404s
+from the mirror system - contact the release engineering team to get
+these rectified, and retry once they tell you the issue is addressed.
+
+Beware of tests with the wrong `RAWREL` value. This variable records
+the Rawhide release number. When running tests after branching it
+should always be the new Rawhide release number; running tests after
+branching with it set to the old number can cause various failures.
+It's common for some tests scheduled around branching to fail and have
+the old `RAWREL` value; you will find you cannot get these tests to
+pass with regular restarts. Always check the `RAWREL` value of failed
+tests, and if it's wrong, instead of just restarting the test through
+the web UI, retrigger the tests for that update from the server:
+```
+fedora-openqa update -f (UPDATE ID)
+```
+
+Your end goal, as always, is for all outstanding failures to be
+definitely identified as genuine bugs in the update, with a comment
+linking to a Bodhi comment or bug report that identifies the issue.
+
 == Rebooting / restarting

 The optimal approach to rebooting an entire openQA deployment is as
@ -468,12 +653,3 @@ but is run on the openQA servers as it seems like as good a place as any
 to do it. As with all other message consumers, if making manual changes
 or updates to the components, remember to restart the consumer service
 afterwards.
-
-== Autocloud ResultsDB forwarder (autocloudreporter)
-
-An ansible role called `autocloudreporter` also runs on the openQA
-production server. This has nothing to do with openQA at all, but is run
-there for convenience. This role deploys a fedmsg consumer that listens
-for fedmsgs indicating that Autocloud (a separate automated test system
-which tests cloud images) has completed a test run, then forwards those
-results to ResultsDB.