QA SysAdmin SOP Refresh

This commit is contained in:
František Zatloukal 2024-11-27 11:18:18 +01:00
parent a833c0f052
commit 886018403d
3 changed files with 273 additions and 112 deletions

View file

@ -10,9 +10,8 @@ freeze exception bugs in branched Fedora releases.
* <<_file_locations>>
* <<_building_for_infra>>
* <<_upgrading>>
** <<_upgrade_preparation_all_upgrades>>
** <<_minor_upgrades_no_database_changes>>
** <<_major_upgrades_with_database_changes>>
* <<_watchdog>>
* <<_sync>>
== Contact Information
@ -20,137 +19,61 @@ Owner::
Fedora QA Devel
Contact::
#fedora-qa
Location::
iad2
Persons::
jskladan, kparal
Servers::
blockerbugs01.iad2, blockerbugs02.iad2, blockerbugs01.stg.iad2
* In OpenShift.
Purpose::
Hosting the https://pagure.io/fedora-qa/blockerbugs[blocker bug
tracking application] for QA
== File Locations
`/etc/blockerbugs/settings.py` - configuration for the app
`blockerbugs/cli.py - cli for the app
=== Node Roles
== Configuration
blockerbugs01.stg.iad2::
the staging instance, it is not load balanced
blockerbugs01.iad2::
one of the load balanced production nodes, it is responsible for
running bugzilla/bodhi/koji sync
blockerbugs02.iad2::
the other load balanced production node. It does not do any sync
operations
Configuration is loaded from the environment in the pod. The default configuration is
set in the playbook: `roles/openshift-apps/blockerbugs/templates/deploymentconfig.yml`.
The possible values to set up can be found in `blockerbugs/config.py` inside
the `openshift_config` function. Apart from that, secrets, tokens, and api keys
are set in the secrets Ansible repository.
== Building for Infra
=== Do not use mock
For whatever reason, the `epel7-infra` koji tag rejects SRPMs with the
`el7.centos` dist tag. Make sure that you build SRPMs with:
....
rpmbuild -bs --define='dist .el7' blockerbugs.spec
....
Also note that this expects the release tarball to be in
`~/rpmbuild/SOURCES/`.
=== Building with Koji
You'll need to ask someone who has rights to build into `epel7-infra`
tag to make the build for you:
....
koji build epel7-infra blockerbugs-0.4.4.11-1.el7.src.rpm
....
[NOTE]
====
The fun bit of this is that `python-flask` is only available on `x86_64`
builders. If your build is routed to one of the non-x86_64, it will
fail. The only solution available to us is to keep submitting the build
until it's routed to one of the x86_64 builders and doesn't fail.
====
Once the build is complete, it should be automatically tagged into
`epel7-infra-stg` (after a ~15 min delay), so that you can test it on
blockerbugs staging instance. Once you've verified it's working well,
ask someone with infra rights to move it to `epel7-infra` tag so that
you can update it in production.
The application levarages s2i containers. The production instance is
tracking `master` branch from the blockerbugs repository, the staging instance
is tracking `develop` branch. The build don't happen automatically, but need
to be triggered manually from the OpenShift web console.
== Upgrading
Blockerbugs is currently configured through ansible and all
configuration changes need to be done through ansible.
=== Upgrade Preparation (all upgrades)
The pod initialization is set in the way that all database upgrades
happen automatically on startup. That means the extra care is needed,
and all deployments that do database changes need to happen on stg first.
Blockerbugs is not packaged in epel, so the new build needs to exist in
the infrastructure stg repo for deployment to stg or the infrastructure
repo for deployments to production.
== Deployment WatchDog
See the blockerbugs documentation for instructions on building a
blockerbugs RPM.
The deployment is configured to perform automatic liveness testing.
The first phase is running `cli.py upgrade_db`, and the second
phase consists of the cluster trying to get HTTP return
from container on port `8080` on the pod.
=== Minor Upgrades (no database changes)
If any of these fail, the cluster automatically reverts
to the previous build, and such failure can be seen on `Events` tab
in the DeploymentConfig details.
Run the following on *both* `blockerbugs01.iad2` and
`blockerbugs02.iad2` if updating in production.
Apart from that, the cluster regularly polls the pod
for liveness testing. If that fails or times out, a pod restart occurs.
Such event can be seen in `Events` tab of the DeploymentConfig.
[arabic]
. Update ansible with config changes, push changes to the ansible repo:
+
....
roles/blockerbugs/templates/blockerbugs-settings.py.j2
....
. Clear yum cache and update the blockerbugs RPM:
+
....
yum clean expire-cache && yum update blockerbugs
....
. Restart httpd to reload the application:
+
....
service httpd restart
....
== Periodic sync
=== Major Upgrades (with database changes)
Blockerbugs app deployment consists of two pods. One serves as both backend and
frontend, the other is spawned every 30 minutes with the `cli.py sync` executed.
This synchronizes the data from bugzilla and pagure into the blockerbugs db.
Run the following on *both* `blockerbugs01.phx2` and
`blockerbugs02.phx2` if updating in production.
[arabic]
. Update ansible with config changes, push changes to the ansible repo:
+
....
roles/blockerbugs/templates/blockerbugs-settings.py.j2
....
. Stop httpd on *all* relevant instances (if load balanced):
+
....
service httpd stop
....
. Clear yum cache and update the blockerbugs RPM on all relevant
instances:
+
....
yum clean expire-cache && yum update blockerbugs
....
. Upgrade the database schema:
+
....
blockerbugs upgrade_db
....
. Check the upgrade by running a manual sync to make sure that nothing
unexpected went wrong:
+
....
blockerbugs sync
....
. Start httpd back up:
+
....
service httpd start
....

View file

@ -0,0 +1,139 @@
= oraculum Infrastructure SOP
https://pagure.io/fedora-qa/oraculum[oraculum] is an app developed
by Fedora QA to aid packagers with maintenance and quality
in Fedora and EPEL releases.
As such, it serves as backend for Packager Dashboard,
testcloud, Fedora Easy Karma, and Pagure dist-git (versions table).
== Contents
* <<_contact_information>>
* <<_file_locations>>
* <<_building_for_infra>>
* <<_upgrading>>
* <<_watchdog>>
* <<_components>>
== Contact Information
Owner::
Fedora QA Devel
Contact::
#fedora-qa
Persons::
jskladan, lbrabec
Servers::
* In OpenShift.
Purpose::
Hosting the https://pagure.io/fedora-qa/oraculum[oraculum] for packagers
== File Locations
`oraculum/cli.py - cli for the app
`oraculum/cli.py debug - interactive debug interface for the app
== Configuration
Configuration is loaded from the environment in the pod. The default configuration is
set in the playbook: `roles/openshift-apps/oraculum/templates/deploymentconfig.yml`. Remember that the configuration needs
to be changed for each of the various pods (described later).
The possible values to set up can be found in `oraculum/config.py` inside
the `openshift_config` function. Apart from that, secrets, tokens, and api keys
are set in the secrets Ansible repository.
== Building for Infra
The application levarages s2i containers. Both the production
and staging instances are tracking `master` branch from the oraculum
repository. The build don't happen automatically, but need
to be triggered manually from the OpenShift web console.
== Upgrading
Oraculum is currently configured through ansible and all
configuration changes need to be done through ansible.
The pod initialization is set in the way that all database upgrades
happen automatically on startup. That means the extra care is needed,
and all deployments that do database changes need to happen on stg first.
== Deployment WatchDog
The deployment is configured to perform automatic liveness testing.
The first phase is running `cli.py upgrade_db`, and the second
phase consists of the cluster trying to get HTTP return
from container on port `8080` on the `oraculum-api-endpoint` pod.
If any of these fail, the cluster automatically reverts
to the previous build, and such failure can be seen on `Events` tab
in the DeploymentConfig details.
Apart from that, the cluster regularly polls the `oraculum-api-endpoint`
for liveness testing. If that fails or times out, a pod restart occurs.
Such event can be seen in `Events` tab of the DeploymentConfig.
== Cache clearing
oraculum doesn't handle any garbage collection in the cache. In some
situations like having stale data in the cache (for example in situations where
bugzilla data wouldn't refresh due to bugs or optimization choices),
or too large db cache, it can be beneficial or even necessary to clear its cache completely. That can be done by clearing all rows in `db_cache` table:
`DELETE * FROM cached_data;`
After that, to minimize downtime, its recommended to manually re-sync
generic providers via `CACHE._refresh`, in the following order:
(in the pod terminal via debug)
[source,python]
----
python oraculum/cli.py debug
CACHE._refresh("fedora_releases")
CACHE._refresh("bodhi_updates")
CACHE._refresh("bodhi_overrides")
CACHE._refresh("package_versions_generic")
CACHE._refresh("pagure_groups")
CACHE._refresh("koschei_data")
CACHE._refresh("packages_owners_json")
----
and finally building up the static cache block manually via:
`oraculum.utils.celery_utils.celery_sync_static_package_caches()`
To do a more lightweight cleanup, removing just PRs, bugs,
and abrt cache can do the trick:
`DELETE FROM cached_data WHERE provider LIKE 'packager-dashboard__all_package_bugs%';`
`DELETE FROM cached_data WHERE provider LIKE 'packager_dashboard_package_prs%';`
`DELETE FROM cached_data WHERE provider LIKE 'packager-dashboard_abrt_issues%';`
== Components of Deployment
Oraculum deployment consists of various pods that run together.
=== oraculum-api-endpoint
Provides api responses rendering endpoint.
Runs via gunicorn in multiple threads.
=== oraculum-worker
Managed via celery, periodic and ad-hoc sync requests are processed
by these. Pods are replicated, and each pods spawns 4 workers.
=== oraculum-beat
Sends periodic sync requests to the workers.
=== oraculum-flower
Provides an overview of the celery/worker queues via http.
Current state of the workers load can be seen in https://packager-dashboard.fedoraproject.org/_flower/[Flower].
=== oraculum-redis
Provides a deployment-local redis instance.

View file

@ -0,0 +1,99 @@
= testdays Infrastructure SOP
https://pagure.io/fedora-qa/testdays-web/[testdays] is an app developed
by Fedora QA to aid with managing testday events for the community.
== Contents
* <<_contact_information>>
* <<_file_locations>>
* <<_building_for_infra>>
* <<_upgrading>>
* <<_watchdog>>
* <<_components>>
== Contact Information
Owner::
Fedora QA Devel
Contact::
#fedora-qa
Persons::
jskladan, smukher
Servers::
* In OpenShift.
Purpose::
Hosting the https://pagure.io/fedora-qa/testdays-web/[testdays] for the QA ad the community
== File Locations
`testdays/cli.py - cli for the app
`resultsdb/cli.py - cli for the resultsDB
== Configuration
Configuration is loaded from the environment in the pod. The default configuration is
set in the playbook: `roles/openshift-apps/testdays/templates/deploymentconfig.yml`. Remember that the configuration needs
to be changed for the both pods (testdays, and resultsdb).
The possible values to set up can be found in `testdays/config.py` and
`resultsdb/config.py` inside the `openshift_config` function.
Apart from that, secrets, tokens, and api keys are set
in the secrets Ansible repository.
== Building for Infra
The application levarages s2i containers. Both the production
and staging instances of testcloud are tracking `master`
branch from the testdays-web repository, resultsdb instance
is tracking legacy_testdays branch on both prod and stg.
The build don't happen automatically, but need
to be triggered manually from the OpenShift web console.
== Upgrading
Testdays is currently configured through ansible and all
configuration changes need to be done through ansible.
The pod initialization is set in the way that all database upgrades
happen automatically on startup. That means the extra care is needed,
and all deployments that do database changes need to happen on stg first.
== Deployment sanity test
The deployment is configured to perform automatic sanity testing.
The first phase is running `cli.py upgrade_db`, and the second
phase consists of the cluster trying to get HTTP return
from container on port `8080` on the `testdays` pod.
If any of these fail, the cluster automatically reverts
to the previous build, and such failure can be seen on `Events` tab
in the DeploymentConfig details.
== Deployment WatchDog
The deployment is configured to perform automatic liveness testing.
The first phase is running `cli.py upgrade_db`, and the second
phase consists of the cluster trying to get HTTP return
from container on port `8080` on the `testdays` and `resutlsdb` pods.
If any of these fail, the cluster automatically reverts
to the previous build, and such failure can be seen on `Events` tab
in the DeploymentConfig details.
Apart from that, the cluster regularly polls the `testdays` and `resultsdb`
for liveness testing. If that fails or times out, a pod restart occurs.
Such event can be seen in `Events` tab of the DeploymentConfig.
== Components of Deployment
=== Testdays
The base testdays app that provides both backend and frontend
inside the single deployment.
=== ResultsDB
Forked state of the upstream ResultsDB that has OpenShift changes
applied on top of it while not introducing any other changes that
are in upstream branch. Available on https://pagure.io/taskotron/resultsdb/tree/legacy_testdays[Pagure].