QA SysAdmin SOP Refresh

2024-11-27 11:18:18 +01:00 · 2024-11-27 11:18:18 +01:00 · 886018403d
commit 886018403d
parent a833c0f052
3 changed files with 273 additions and 112 deletions
--- a/modules/sysadmin_guide/pages/blockerbugs.adoc
+++ b/modules/sysadmin_guide/pages/blockerbugs.adoc
@ -10,9 +10,8 @@ freeze exception bugs in branched Fedora releases.
 * <<_file_locations>>
 * <<_building_for_infra>>
 * <<_upgrading>>
-** <<_upgrade_preparation_all_upgrades>>
-** <<_minor_upgrades_no_database_changes>>
-** <<_major_upgrades_with_database_changes>>
+* <<_watchdog>>
+* <<_sync>>

 == Contact Information

@ -20,137 +19,61 @@ Owner::
  Fedora QA Devel
 Contact::
  #fedora-qa
-Location::
-  iad2
+Persons::
+  jskladan, kparal
 Servers::
-  blockerbugs01.iad2, blockerbugs02.iad2, blockerbugs01.stg.iad2
+  * In OpenShift.
 Purpose::
  Hosting the https://pagure.io/fedora-qa/blockerbugs[blocker bug
  tracking application] for QA

 == File Locations

-`/etc/blockerbugs/settings.py` - configuration for the app
+`blockerbugs/cli.py - cli for the app

-=== Node Roles
+== Configuration

-blockerbugs01.stg.iad2::
-  the staging instance, it is not load balanced
-blockerbugs01.iad2::
-  one of the load balanced production nodes, it is responsible for
-  running bugzilla/bodhi/koji sync
-blockerbugs02.iad2::
-  the other load balanced production node. It does not do any sync
-  operations
+Configuration is loaded from the environment in the pod. The default configuration is
+set in the playbook: `roles/openshift-apps/blockerbugs/templates/deploymentconfig.yml`.
+
+The possible values to set up can be found in `blockerbugs/config.py` inside
+the `openshift_config` function. Apart from that, secrets, tokens, and api keys
+are set in the secrets Ansible repository.

 == Building for Infra

-=== Do not use mock
-
-For whatever reason, the `epel7-infra` koji tag rejects SRPMs with the
-`el7.centos` dist tag. Make sure that you build SRPMs with:
-
-....
-rpmbuild -bs --define='dist .el7' blockerbugs.spec
-....
-
-Also note that this expects the release tarball to be in
-`~/rpmbuild/SOURCES/`.
-
-=== Building with Koji
-
-You'll need to ask someone who has rights to build into `epel7-infra`
-tag to make the build for you:
-
-....
-koji build epel7-infra blockerbugs-0.4.4.11-1.el7.src.rpm
-....
-
-[NOTE]
-====
-The fun bit of this is that `python-flask` is only available on `x86_64`
-builders. If your build is routed to one of the non-x86_64, it will
-fail. The only solution available to us is to keep submitting the build
-until it's routed to one of the x86_64 builders and doesn't fail.
-====
-
-Once the build is complete, it should be automatically tagged into
-`epel7-infra-stg` (after a ~15 min delay), so that you can test it on
-blockerbugs staging instance. Once you've verified it's working well,
-ask someone with infra rights to move it to `epel7-infra` tag so that
-you can update it in production.
+The application levarages s2i containers. The production instance is
+tracking `master` branch from the blockerbugs repository, the staging instance
+is tracking `develop` branch. The build don't happen automatically, but need
+to be triggered manually from the OpenShift web console.

 == Upgrading

 Blockerbugs is currently configured through ansible and all
 configuration changes need to be done through ansible.

-=== Upgrade Preparation (all upgrades)
+The pod initialization is set in the way that all database upgrades
+happen automatically on startup. That means the extra care is needed,
+and all deployments that do database changes need to happen on stg first.

-Blockerbugs is not packaged in epel, so the new build needs to exist in
-the infrastructure stg repo for deployment to stg or the infrastructure
-repo for deployments to production.
+== Deployment WatchDog

-See the blockerbugs documentation for instructions on building a
-blockerbugs RPM.
+The deployment is configured to perform automatic liveness testing.
+The first phase is running `cli.py upgrade_db`, and the second
+phase consists of the cluster trying to get HTTP return
+from container on port `8080` on the pod.

-=== Minor Upgrades (no database changes)
+If any of these fail, the cluster automatically reverts
+to the previous build, and such failure can be seen on `Events` tab
+in the DeploymentConfig details.

-Run the following on *both* `blockerbugs01.iad2` and
-`blockerbugs02.iad2` if updating in production.
+Apart from that, the cluster regularly polls the pod
+for liveness testing. If that fails or times out, a pod restart occurs.
+Such event can be seen in `Events` tab of the DeploymentConfig.

-[arabic]
-. Update ansible with config changes, push changes to the ansible repo:
-+
-....
-roles/blockerbugs/templates/blockerbugs-settings.py.j2
-....
-. Clear yum cache and update the blockerbugs RPM:
-+
-....
-yum clean expire-cache && yum update blockerbugs
-....
-. Restart httpd to reload the application:
-+
-....
-service httpd restart
-....
+== Periodic sync

-=== Major Upgrades (with database changes)
+Blockerbugs app deployment consists of two pods. One serves as both backend and
+frontend, the other is spawned every 30 minutes with the `cli.py sync` executed.
+This synchronizes the data from bugzilla and pagure into the blockerbugs db.

-Run the following on *both* `blockerbugs01.phx2` and
-`blockerbugs02.phx2` if updating in production.
-
-[arabic]
-. Update ansible with config changes, push changes to the ansible repo:
-+
-....
-roles/blockerbugs/templates/blockerbugs-settings.py.j2
-....
-. Stop httpd on *all* relevant instances (if load balanced):
-+
-....
-service httpd stop
-....
-. Clear yum cache and update the blockerbugs RPM on all relevant
-instances:
-+
-....
-yum clean expire-cache && yum update blockerbugs
-....
-. Upgrade the database schema:
-+
-....
-blockerbugs upgrade_db
-....
-. Check the upgrade by running a manual sync to make sure that nothing
-unexpected went wrong:
-+
-....
-blockerbugs sync
-....
-. Start httpd back up:
-+
-....
-service httpd start
-....
--- a/modules/sysadmin_guide/pages/oraculum.adoc
+++ b/modules/sysadmin_guide/pages/oraculum.adoc
@ -0,0 +1,139 @@
+= oraculum Infrastructure SOP
+
+https://pagure.io/fedora-qa/oraculum[oraculum] is an app developed
+by Fedora QA to aid packagers with maintenance and quality
+in Fedora and EPEL releases.
+As such, it serves as backend for Packager Dashboard,
+testcloud, Fedora Easy Karma, and Pagure dist-git (versions table).
+
+== Contents
+
+* <<_contact_information>>
+* <<_file_locations>>
+* <<_building_for_infra>>
+* <<_upgrading>>
+* <<_watchdog>>
+* <<_components>>
+
+== Contact Information
+
+Owner::
+  Fedora QA Devel
+Contact::
+  #fedora-qa
+Persons::
+  jskladan, lbrabec
+Servers::
+  * In OpenShift.
+Purpose::
+  Hosting the https://pagure.io/fedora-qa/oraculum[oraculum] for packagers
+
+== File Locations
+
+`oraculum/cli.py - cli for the app
+`oraculum/cli.py debug - interactive debug interface for the app
+
+== Configuration
+
+Configuration is loaded from the environment in the pod. The default configuration is
+set in the playbook: `roles/openshift-apps/oraculum/templates/deploymentconfig.yml`. Remember that the configuration needs
+to be changed for each of the various pods (described later).
+
+The possible values to set up can be found in `oraculum/config.py` inside
+the `openshift_config` function. Apart from that, secrets, tokens, and api keys
+are set in the secrets Ansible repository.
+
+== Building for Infra
+
+The application levarages s2i containers. Both the production
+and staging instances are tracking `master` branch from the oraculum
+repository. The build don't happen automatically, but need
+to be triggered manually from the OpenShift web console.
+
+== Upgrading
+
+Oraculum is currently configured through ansible and all
+configuration changes need to be done through ansible.
+
+The pod initialization is set in the way that all database upgrades
+happen automatically on startup. That means the extra care is needed,
+and all deployments that do database changes need to happen on stg first.
+
+== Deployment WatchDog
+
+The deployment is configured to perform automatic liveness testing.
+The first phase is running `cli.py upgrade_db`, and the second
+phase consists of the cluster trying to get HTTP return
+from container on port `8080` on the `oraculum-api-endpoint` pod.
+
+If any of these fail, the cluster automatically reverts
+to the previous build, and such failure can be seen on `Events` tab
+in the DeploymentConfig details.
+
+Apart from that, the cluster regularly polls the `oraculum-api-endpoint`
+for liveness testing. If that fails or times out, a pod restart occurs.
+Such event can be seen in `Events` tab of the DeploymentConfig.
+
+== Cache clearing
+
+oraculum doesn't handle any garbage collection in the cache. In some
+situations like having stale data in the cache (for example in situations where
+bugzilla data wouldn't refresh due to bugs or optimization choices),
+or too large db cache, it can be beneficial or even necessary to clear its cache completely. That can be done by clearing all rows in `db_cache` table:
+
+`DELETE * FROM cached_data;`
+
+After that, to minimize downtime, its recommended to manually re-sync
+generic providers via `CACHE._refresh`, in the following order:
+(in the pod terminal via debug)
+
+[source,python]
+----
+python oraculum/cli.py debug
+CACHE._refresh("fedora_releases")
+CACHE._refresh("bodhi_updates")
+CACHE._refresh("bodhi_overrides")
+CACHE._refresh("package_versions_generic")
+CACHE._refresh("pagure_groups")
+CACHE._refresh("koschei_data")
+CACHE._refresh("packages_owners_json")
+----
+
+and finally building up the static cache block manually via:
+`oraculum.utils.celery_utils.celery_sync_static_package_caches()`
+
+To do a more lightweight cleanup, removing just PRs, bugs,
+and abrt cache can do the trick:
+
+`DELETE FROM cached_data WHERE provider LIKE 'packager-dashboard__all_package_bugs%';`
+
+`DELETE FROM cached_data WHERE provider LIKE 'packager_dashboard_package_prs%';`
+
+`DELETE FROM cached_data WHERE provider LIKE 'packager-dashboard_abrt_issues%';`
+
+== Components of Deployment
+
+Oraculum deployment consists of various pods that run together.
+
+=== oraculum-api-endpoint
+
+Provides api responses rendering endpoint.
+Runs via gunicorn in multiple threads.
+
+=== oraculum-worker
+
+Managed via celery, periodic and ad-hoc sync requests are processed
+by these. Pods are replicated, and each pods spawns 4 workers.
+
+=== oraculum-beat
+
+Sends periodic sync requests to the workers.
+
+=== oraculum-flower
+
+Provides an overview of the celery/worker queues via http.
+Current state of the workers load can be seen in https://packager-dashboard.fedoraproject.org/_flower/[Flower].
+
+=== oraculum-redis
+
+Provides a deployment-local redis instance.
--- a/modules/sysadmin_guide/pages/testdays.adoc
+++ b/modules/sysadmin_guide/pages/testdays.adoc
@ -0,0 +1,99 @@
+= testdays Infrastructure SOP
+
+https://pagure.io/fedora-qa/testdays-web/[testdays] is an app developed
+by Fedora QA to aid with managing testday events for the community.
+
+== Contents
+
+* <<_contact_information>>
+* <<_file_locations>>
+* <<_building_for_infra>>
+* <<_upgrading>>
+* <<_watchdog>>
+* <<_components>>
+
+== Contact Information
+
+Owner::
+  Fedora QA Devel
+Contact::
+  #fedora-qa
+Persons::
+  jskladan, smukher
+Servers::
+  * In OpenShift.
+Purpose::
+  Hosting the https://pagure.io/fedora-qa/testdays-web/[testdays] for the QA ad the community
+
+== File Locations
+
+`testdays/cli.py - cli for the app
+`resultsdb/cli.py - cli for the resultsDB
+
+== Configuration
+
+Configuration is loaded from the environment in the pod. The default configuration is
+set in the playbook: `roles/openshift-apps/testdays/templates/deploymentconfig.yml`. Remember that the configuration needs
+to be changed for the both pods (testdays, and resultsdb).
+
+The possible values to set up can be found in `testdays/config.py` and
+`resultsdb/config.py` inside the `openshift_config` function.
+Apart from that, secrets, tokens, and api keys are set
+in the secrets Ansible repository.
+
+== Building for Infra
+
+The application levarages s2i containers. Both the production
+and staging instances of testcloud are tracking `master`
+branch from the testdays-web repository, resultsdb instance
+is tracking legacy_testdays branch on both prod and stg.
+The build don't happen automatically, but need
+to be triggered manually from the OpenShift web console.
+
+== Upgrading
+
+Testdays is currently configured through ansible and all
+configuration changes need to be done through ansible.
+
+The pod initialization is set in the way that all database upgrades
+happen automatically on startup. That means the extra care is needed,
+and all deployments that do database changes need to happen on stg first.
+
+== Deployment sanity test
+
+The deployment is configured to perform automatic sanity testing.
+The first phase is running `cli.py upgrade_db`, and the second
+phase consists of the cluster trying to get HTTP return
+from container on port `8080` on the `testdays` pod.
+
+If any of these fail, the cluster automatically reverts
+to the previous build, and such failure can be seen on `Events` tab
+in the DeploymentConfig details.
+
+== Deployment WatchDog
+
+The deployment is configured to perform automatic liveness testing.
+The first phase is running `cli.py upgrade_db`, and the second
+phase consists of the cluster trying to get HTTP return
+from container on port `8080` on the `testdays` and `resutlsdb` pods.
+
+If any of these fail, the cluster automatically reverts
+to the previous build, and such failure can be seen on `Events` tab
+in the DeploymentConfig details.
+
+Apart from that, the cluster regularly polls the `testdays` and `resultsdb`
+for liveness testing. If that fails or times out, a pod restart occurs.
+Such event can be seen in `Events` tab of the DeploymentConfig.
+
+== Components of Deployment
+
+=== Testdays
+
+The base testdays app that provides both backend and frontend
+inside the single deployment.
+
+=== ResultsDB
+
+Forked state of the upstream ResultsDB that has OpenShift changes
+applied on top of it while not introducing any other changes that
+are in upstream branch. Available on https://pagure.io/taskotron/resultsdb/tree/legacy_testdays[Pagure].