QA SysAdmin SOP Refresh
This commit is contained in:
parent
a833c0f052
commit
886018403d
3 changed files with 273 additions and 112 deletions
|
@ -10,9 +10,8 @@ freeze exception bugs in branched Fedora releases.
|
|||
* <<_file_locations>>
|
||||
* <<_building_for_infra>>
|
||||
* <<_upgrading>>
|
||||
** <<_upgrade_preparation_all_upgrades>>
|
||||
** <<_minor_upgrades_no_database_changes>>
|
||||
** <<_major_upgrades_with_database_changes>>
|
||||
* <<_watchdog>>
|
||||
* <<_sync>>
|
||||
|
||||
== Contact Information
|
||||
|
||||
|
@ -20,137 +19,61 @@ Owner::
|
|||
Fedora QA Devel
|
||||
Contact::
|
||||
#fedora-qa
|
||||
Location::
|
||||
iad2
|
||||
Persons::
|
||||
jskladan, kparal
|
||||
Servers::
|
||||
blockerbugs01.iad2, blockerbugs02.iad2, blockerbugs01.stg.iad2
|
||||
* In OpenShift.
|
||||
Purpose::
|
||||
Hosting the https://pagure.io/fedora-qa/blockerbugs[blocker bug
|
||||
tracking application] for QA
|
||||
|
||||
== File Locations
|
||||
|
||||
`/etc/blockerbugs/settings.py` - configuration for the app
|
||||
`blockerbugs/cli.py - cli for the app
|
||||
|
||||
=== Node Roles
|
||||
== Configuration
|
||||
|
||||
blockerbugs01.stg.iad2::
|
||||
the staging instance, it is not load balanced
|
||||
blockerbugs01.iad2::
|
||||
one of the load balanced production nodes, it is responsible for
|
||||
running bugzilla/bodhi/koji sync
|
||||
blockerbugs02.iad2::
|
||||
the other load balanced production node. It does not do any sync
|
||||
operations
|
||||
Configuration is loaded from the environment in the pod. The default configuration is
|
||||
set in the playbook: `roles/openshift-apps/blockerbugs/templates/deploymentconfig.yml`.
|
||||
|
||||
The possible values to set up can be found in `blockerbugs/config.py` inside
|
||||
the `openshift_config` function. Apart from that, secrets, tokens, and api keys
|
||||
are set in the secrets Ansible repository.
|
||||
|
||||
== Building for Infra
|
||||
|
||||
=== Do not use mock
|
||||
|
||||
For whatever reason, the `epel7-infra` koji tag rejects SRPMs with the
|
||||
`el7.centos` dist tag. Make sure that you build SRPMs with:
|
||||
|
||||
....
|
||||
rpmbuild -bs --define='dist .el7' blockerbugs.spec
|
||||
....
|
||||
|
||||
Also note that this expects the release tarball to be in
|
||||
`~/rpmbuild/SOURCES/`.
|
||||
|
||||
=== Building with Koji
|
||||
|
||||
You'll need to ask someone who has rights to build into `epel7-infra`
|
||||
tag to make the build for you:
|
||||
|
||||
....
|
||||
koji build epel7-infra blockerbugs-0.4.4.11-1.el7.src.rpm
|
||||
....
|
||||
|
||||
[NOTE]
|
||||
====
|
||||
The fun bit of this is that `python-flask` is only available on `x86_64`
|
||||
builders. If your build is routed to one of the non-x86_64, it will
|
||||
fail. The only solution available to us is to keep submitting the build
|
||||
until it's routed to one of the x86_64 builders and doesn't fail.
|
||||
====
|
||||
|
||||
Once the build is complete, it should be automatically tagged into
|
||||
`epel7-infra-stg` (after a ~15 min delay), so that you can test it on
|
||||
blockerbugs staging instance. Once you've verified it's working well,
|
||||
ask someone with infra rights to move it to `epel7-infra` tag so that
|
||||
you can update it in production.
|
||||
The application levarages s2i containers. The production instance is
|
||||
tracking `master` branch from the blockerbugs repository, the staging instance
|
||||
is tracking `develop` branch. The build don't happen automatically, but need
|
||||
to be triggered manually from the OpenShift web console.
|
||||
|
||||
== Upgrading
|
||||
|
||||
Blockerbugs is currently configured through ansible and all
|
||||
configuration changes need to be done through ansible.
|
||||
|
||||
=== Upgrade Preparation (all upgrades)
|
||||
The pod initialization is set in the way that all database upgrades
|
||||
happen automatically on startup. That means the extra care is needed,
|
||||
and all deployments that do database changes need to happen on stg first.
|
||||
|
||||
Blockerbugs is not packaged in epel, so the new build needs to exist in
|
||||
the infrastructure stg repo for deployment to stg or the infrastructure
|
||||
repo for deployments to production.
|
||||
== Deployment WatchDog
|
||||
|
||||
See the blockerbugs documentation for instructions on building a
|
||||
blockerbugs RPM.
|
||||
The deployment is configured to perform automatic liveness testing.
|
||||
The first phase is running `cli.py upgrade_db`, and the second
|
||||
phase consists of the cluster trying to get HTTP return
|
||||
from container on port `8080` on the pod.
|
||||
|
||||
=== Minor Upgrades (no database changes)
|
||||
If any of these fail, the cluster automatically reverts
|
||||
to the previous build, and such failure can be seen on `Events` tab
|
||||
in the DeploymentConfig details.
|
||||
|
||||
Run the following on *both* `blockerbugs01.iad2` and
|
||||
`blockerbugs02.iad2` if updating in production.
|
||||
Apart from that, the cluster regularly polls the pod
|
||||
for liveness testing. If that fails or times out, a pod restart occurs.
|
||||
Such event can be seen in `Events` tab of the DeploymentConfig.
|
||||
|
||||
[arabic]
|
||||
. Update ansible with config changes, push changes to the ansible repo:
|
||||
+
|
||||
....
|
||||
roles/blockerbugs/templates/blockerbugs-settings.py.j2
|
||||
....
|
||||
. Clear yum cache and update the blockerbugs RPM:
|
||||
+
|
||||
....
|
||||
yum clean expire-cache && yum update blockerbugs
|
||||
....
|
||||
. Restart httpd to reload the application:
|
||||
+
|
||||
....
|
||||
service httpd restart
|
||||
....
|
||||
== Periodic sync
|
||||
|
||||
=== Major Upgrades (with database changes)
|
||||
Blockerbugs app deployment consists of two pods. One serves as both backend and
|
||||
frontend, the other is spawned every 30 minutes with the `cli.py sync` executed.
|
||||
This synchronizes the data from bugzilla and pagure into the blockerbugs db.
|
||||
|
||||
Run the following on *both* `blockerbugs01.phx2` and
|
||||
`blockerbugs02.phx2` if updating in production.
|
||||
|
||||
[arabic]
|
||||
. Update ansible with config changes, push changes to the ansible repo:
|
||||
+
|
||||
....
|
||||
roles/blockerbugs/templates/blockerbugs-settings.py.j2
|
||||
....
|
||||
. Stop httpd on *all* relevant instances (if load balanced):
|
||||
+
|
||||
....
|
||||
service httpd stop
|
||||
....
|
||||
. Clear yum cache and update the blockerbugs RPM on all relevant
|
||||
instances:
|
||||
+
|
||||
....
|
||||
yum clean expire-cache && yum update blockerbugs
|
||||
....
|
||||
. Upgrade the database schema:
|
||||
+
|
||||
....
|
||||
blockerbugs upgrade_db
|
||||
....
|
||||
. Check the upgrade by running a manual sync to make sure that nothing
|
||||
unexpected went wrong:
|
||||
+
|
||||
....
|
||||
blockerbugs sync
|
||||
....
|
||||
. Start httpd back up:
|
||||
+
|
||||
....
|
||||
service httpd start
|
||||
....
|
||||
|
|
139
modules/sysadmin_guide/pages/oraculum.adoc
Normal file
139
modules/sysadmin_guide/pages/oraculum.adoc
Normal file
|
@ -0,0 +1,139 @@
|
|||
= oraculum Infrastructure SOP
|
||||
|
||||
https://pagure.io/fedora-qa/oraculum[oraculum] is an app developed
|
||||
by Fedora QA to aid packagers with maintenance and quality
|
||||
in Fedora and EPEL releases.
|
||||
As such, it serves as backend for Packager Dashboard,
|
||||
testcloud, Fedora Easy Karma, and Pagure dist-git (versions table).
|
||||
|
||||
== Contents
|
||||
|
||||
* <<_contact_information>>
|
||||
* <<_file_locations>>
|
||||
* <<_building_for_infra>>
|
||||
* <<_upgrading>>
|
||||
* <<_watchdog>>
|
||||
* <<_components>>
|
||||
|
||||
== Contact Information
|
||||
|
||||
Owner::
|
||||
Fedora QA Devel
|
||||
Contact::
|
||||
#fedora-qa
|
||||
Persons::
|
||||
jskladan, lbrabec
|
||||
Servers::
|
||||
* In OpenShift.
|
||||
Purpose::
|
||||
Hosting the https://pagure.io/fedora-qa/oraculum[oraculum] for packagers
|
||||
|
||||
== File Locations
|
||||
|
||||
`oraculum/cli.py - cli for the app
|
||||
`oraculum/cli.py debug - interactive debug interface for the app
|
||||
|
||||
== Configuration
|
||||
|
||||
Configuration is loaded from the environment in the pod. The default configuration is
|
||||
set in the playbook: `roles/openshift-apps/oraculum/templates/deploymentconfig.yml`. Remember that the configuration needs
|
||||
to be changed for each of the various pods (described later).
|
||||
|
||||
The possible values to set up can be found in `oraculum/config.py` inside
|
||||
the `openshift_config` function. Apart from that, secrets, tokens, and api keys
|
||||
are set in the secrets Ansible repository.
|
||||
|
||||
== Building for Infra
|
||||
|
||||
The application levarages s2i containers. Both the production
|
||||
and staging instances are tracking `master` branch from the oraculum
|
||||
repository. The build don't happen automatically, but need
|
||||
to be triggered manually from the OpenShift web console.
|
||||
|
||||
== Upgrading
|
||||
|
||||
Oraculum is currently configured through ansible and all
|
||||
configuration changes need to be done through ansible.
|
||||
|
||||
The pod initialization is set in the way that all database upgrades
|
||||
happen automatically on startup. That means the extra care is needed,
|
||||
and all deployments that do database changes need to happen on stg first.
|
||||
|
||||
== Deployment WatchDog
|
||||
|
||||
The deployment is configured to perform automatic liveness testing.
|
||||
The first phase is running `cli.py upgrade_db`, and the second
|
||||
phase consists of the cluster trying to get HTTP return
|
||||
from container on port `8080` on the `oraculum-api-endpoint` pod.
|
||||
|
||||
If any of these fail, the cluster automatically reverts
|
||||
to the previous build, and such failure can be seen on `Events` tab
|
||||
in the DeploymentConfig details.
|
||||
|
||||
Apart from that, the cluster regularly polls the `oraculum-api-endpoint`
|
||||
for liveness testing. If that fails or times out, a pod restart occurs.
|
||||
Such event can be seen in `Events` tab of the DeploymentConfig.
|
||||
|
||||
== Cache clearing
|
||||
|
||||
oraculum doesn't handle any garbage collection in the cache. In some
|
||||
situations like having stale data in the cache (for example in situations where
|
||||
bugzilla data wouldn't refresh due to bugs or optimization choices),
|
||||
or too large db cache, it can be beneficial or even necessary to clear its cache completely. That can be done by clearing all rows in `db_cache` table:
|
||||
|
||||
`DELETE * FROM cached_data;`
|
||||
|
||||
After that, to minimize downtime, its recommended to manually re-sync
|
||||
generic providers via `CACHE._refresh`, in the following order:
|
||||
(in the pod terminal via debug)
|
||||
|
||||
[source,python]
|
||||
----
|
||||
python oraculum/cli.py debug
|
||||
CACHE._refresh("fedora_releases")
|
||||
CACHE._refresh("bodhi_updates")
|
||||
CACHE._refresh("bodhi_overrides")
|
||||
CACHE._refresh("package_versions_generic")
|
||||
CACHE._refresh("pagure_groups")
|
||||
CACHE._refresh("koschei_data")
|
||||
CACHE._refresh("packages_owners_json")
|
||||
----
|
||||
|
||||
and finally building up the static cache block manually via:
|
||||
`oraculum.utils.celery_utils.celery_sync_static_package_caches()`
|
||||
|
||||
To do a more lightweight cleanup, removing just PRs, bugs,
|
||||
and abrt cache can do the trick:
|
||||
|
||||
`DELETE FROM cached_data WHERE provider LIKE 'packager-dashboard__all_package_bugs%';`
|
||||
|
||||
`DELETE FROM cached_data WHERE provider LIKE 'packager_dashboard_package_prs%';`
|
||||
|
||||
`DELETE FROM cached_data WHERE provider LIKE 'packager-dashboard_abrt_issues%';`
|
||||
|
||||
== Components of Deployment
|
||||
|
||||
Oraculum deployment consists of various pods that run together.
|
||||
|
||||
=== oraculum-api-endpoint
|
||||
|
||||
Provides api responses rendering endpoint.
|
||||
Runs via gunicorn in multiple threads.
|
||||
|
||||
=== oraculum-worker
|
||||
|
||||
Managed via celery, periodic and ad-hoc sync requests are processed
|
||||
by these. Pods are replicated, and each pods spawns 4 workers.
|
||||
|
||||
=== oraculum-beat
|
||||
|
||||
Sends periodic sync requests to the workers.
|
||||
|
||||
=== oraculum-flower
|
||||
|
||||
Provides an overview of the celery/worker queues via http.
|
||||
Current state of the workers load can be seen in https://packager-dashboard.fedoraproject.org/_flower/[Flower].
|
||||
|
||||
=== oraculum-redis
|
||||
|
||||
Provides a deployment-local redis instance.
|
99
modules/sysadmin_guide/pages/testdays.adoc
Normal file
99
modules/sysadmin_guide/pages/testdays.adoc
Normal file
|
@ -0,0 +1,99 @@
|
|||
= testdays Infrastructure SOP
|
||||
|
||||
https://pagure.io/fedora-qa/testdays-web/[testdays] is an app developed
|
||||
by Fedora QA to aid with managing testday events for the community.
|
||||
|
||||
== Contents
|
||||
|
||||
* <<_contact_information>>
|
||||
* <<_file_locations>>
|
||||
* <<_building_for_infra>>
|
||||
* <<_upgrading>>
|
||||
* <<_watchdog>>
|
||||
* <<_components>>
|
||||
|
||||
== Contact Information
|
||||
|
||||
Owner::
|
||||
Fedora QA Devel
|
||||
Contact::
|
||||
#fedora-qa
|
||||
Persons::
|
||||
jskladan, smukher
|
||||
Servers::
|
||||
* In OpenShift.
|
||||
Purpose::
|
||||
Hosting the https://pagure.io/fedora-qa/testdays-web/[testdays] for the QA ad the community
|
||||
|
||||
== File Locations
|
||||
|
||||
`testdays/cli.py - cli for the app
|
||||
`resultsdb/cli.py - cli for the resultsDB
|
||||
|
||||
== Configuration
|
||||
|
||||
Configuration is loaded from the environment in the pod. The default configuration is
|
||||
set in the playbook: `roles/openshift-apps/testdays/templates/deploymentconfig.yml`. Remember that the configuration needs
|
||||
to be changed for the both pods (testdays, and resultsdb).
|
||||
|
||||
The possible values to set up can be found in `testdays/config.py` and
|
||||
`resultsdb/config.py` inside the `openshift_config` function.
|
||||
Apart from that, secrets, tokens, and api keys are set
|
||||
in the secrets Ansible repository.
|
||||
|
||||
== Building for Infra
|
||||
|
||||
The application levarages s2i containers. Both the production
|
||||
and staging instances of testcloud are tracking `master`
|
||||
branch from the testdays-web repository, resultsdb instance
|
||||
is tracking legacy_testdays branch on both prod and stg.
|
||||
The build don't happen automatically, but need
|
||||
to be triggered manually from the OpenShift web console.
|
||||
|
||||
== Upgrading
|
||||
|
||||
Testdays is currently configured through ansible and all
|
||||
configuration changes need to be done through ansible.
|
||||
|
||||
The pod initialization is set in the way that all database upgrades
|
||||
happen automatically on startup. That means the extra care is needed,
|
||||
and all deployments that do database changes need to happen on stg first.
|
||||
|
||||
== Deployment sanity test
|
||||
|
||||
The deployment is configured to perform automatic sanity testing.
|
||||
The first phase is running `cli.py upgrade_db`, and the second
|
||||
phase consists of the cluster trying to get HTTP return
|
||||
from container on port `8080` on the `testdays` pod.
|
||||
|
||||
If any of these fail, the cluster automatically reverts
|
||||
to the previous build, and such failure can be seen on `Events` tab
|
||||
in the DeploymentConfig details.
|
||||
|
||||
== Deployment WatchDog
|
||||
|
||||
The deployment is configured to perform automatic liveness testing.
|
||||
The first phase is running `cli.py upgrade_db`, and the second
|
||||
phase consists of the cluster trying to get HTTP return
|
||||
from container on port `8080` on the `testdays` and `resutlsdb` pods.
|
||||
|
||||
If any of these fail, the cluster automatically reverts
|
||||
to the previous build, and such failure can be seen on `Events` tab
|
||||
in the DeploymentConfig details.
|
||||
|
||||
Apart from that, the cluster regularly polls the `testdays` and `resultsdb`
|
||||
for liveness testing. If that fails or times out, a pod restart occurs.
|
||||
Such event can be seen in `Events` tab of the DeploymentConfig.
|
||||
|
||||
== Components of Deployment
|
||||
|
||||
=== Testdays
|
||||
|
||||
The base testdays app that provides both backend and frontend
|
||||
inside the single deployment.
|
||||
|
||||
=== ResultsDB
|
||||
|
||||
Forked state of the upstream ResultsDB that has OpenShift changes
|
||||
applied on top of it while not introducing any other changes that
|
||||
are in upstream branch. Available on https://pagure.io/taskotron/resultsdb/tree/legacy_testdays[Pagure].
|
Loading…
Add table
Add a link
Reference in a new issue