Update Koschei SOP

Closes #78
This commit is contained in:
Mikolaj Izdebski 2024-06-06 07:16:52 +02:00
parent d035653f65
commit fe02817e0e

View file

@ -5,121 +5,151 @@ runs package scratch builds after dependency change or after time elapse
and reports package buildability status to interested parties.
Production instance::
https://apps.fedoraproject.org/koschei
Staginginstance::
https://apps.stg.fedoraproject.org/koschei
https://koschei.fedoraproject.org/
Staging instance::
https://koschei.stg.fedoraproject.org/
== Contact Information
Owner::
mizdebsk, msimacek
mizdebsk
Contact::
#fedora-admin
Location::
Fedora Cloud
Fedora infrastructure OpenShift
Purpose::
continuous integration system
== Deployment
Koschei deployment is managed by two Ansible playbooks:
....
sudo rbac-playbook groups/koschei-backend.yml
sudo rbac-playbook groups/koschei-web.yml
....
== Description
Koschei is deployed on two separate machines - `koschei-backend` and
`koschei-web`
Koschei consists of frontend and backend.
Frontend (`koschei-web`) is a Flask WSGi application running with httpd.
It displays information to users and allows editing package groups and
changing priorities.
Frontend is a web application written in Python using Flask framework.
It is ran under Apache httpd with mod_wsgi as a WSGi application.
Frontend displays information to users and allows editing package
groups and changing priorities.
Backend (`koschei-backend`) consists of multiple services:
Backend consists of a couple of loosely-coupled microservices,
including:
* `koschei-watcher` - listens to fedmsg events for complete builds and
changes build states in the database
* `koschei-repo-resolver` - resolves package dependencies in given repo
using hawkey and compares them with previous iteration to get a
dependency diff. It resolves all packages in the newest repo available
in Koji. The output is a base for scheduling new builds
* `koschei-build-resolver` - resolves complete builds in the repo in
which they were done in Koji. Produces the dependency differences
visible in the frontend
* `koschei-scheduler` - schedules new builds based on multiple criteria:
* `watcher` - listens to events on Fedora messaging bus for complete
builds and changes build states in the database.
* `repo-resolver` - resolves package dependencies in given repo using
hawkey and compares them with previous iteration to get a dependency
diff. It resolves all packages in the newest repo available in
Koji. The output is a base for scheduling new builds.
* `build-resolver` - resolves complete builds in the repo in which
they were done in Koji. Produces the dependency differences visible in
the frontend.
* `scheduler` - schedules new builds based on multiple criteria:
** dependency priority - dependency changes since last build valued by
their distance in the dependency graph
** manual and static priorities - set manually in the frontend. Manual
priority is reset after each build, static priority persists
** time priority - time elapsed since the last build
* `koschei-polling` - polls the same types of events as koschei-watcher
without reliance on fedmsg. Additionaly takes care of package list
synchronization and other regularly executed tasks
** time priority - time elapsed since the last build.
* `polling` - polls the same types of events as `watcher` without
reliance on the messaging bus. Additionally takes care of package list
synchronization and other regularly executed tasks.
== Configuration
== Deployment
Koschei configuration is in `/etc/koschei/config-backend.cfg` and
`/etc/koschei/config-frontend.cfg`, and is merged with the default
configuration in `/usr/share/koschei/config.cfg` (the ones in `/etc`
overrides the defaults in `/usr`). Note the merge is recursive. The
configuration contains all configurable items for all Koschei services
and the frontend. The alterations to configuration that aren't temporary
should be done through ansible playbook. Configuration changes have no
effect on already running services -- they need to be restarted, which
happens automatically when using the playbook.
== Disk usage
Koschei doesn't keep on disk anything that couldn't be recreated easily -
all important data is stored in PostgreSQL database, configuration is
managed by Ansible, code installed by RPM and so on.
To speed up operation and reduce load on external servers, Koschei
caches some data obtained from services it integrates with. Most
notably, YUM repositories downloaded from Koji are kept in
`/var/cache/koschei/repodata`. Each repository takes about 100 MB of
disk space. Maximal number of repositories kept at time is controlled by
`cache_l2_capacity` parameter in `config-backend.cfg`
(`config-backend.cfg.j2` in Ansible). If repodata cache starts to
consume too much disk space, that value can be decreased - after
restart, `koschei-*-resolver` will remove least recently used cache
entries to respect configured cache capacity.
== Database
Koschei needs to connect to a PostgreSQL database, other database
systems are not supported. Database connection is specified in the
configuration under the `database_config` key that can contain the
following keys: `username, password, host, port, database`.
After an update of koschei, the database needs to be migrated to new
schema. This happens automatically when using the upgrade playbook.
Alternatively, it can be executed manulally using:
Koschei deployment is managed by an Ansible playbook:
....
koschei-admin alembic upgrade head
sudo rbac-playbook openshift-apps/koschei.yml
....
The backend services need to be stopped during the migration.
The above playbook is idempotent, which means that running it has no
effect when everything is already configured as expected.
== Managing koschei services
Koschei is fully-containerized. It is deployed on OpenShift.
Koschei services are systemd units managed through `systemctl`. They can
be started and stopped independently in any order. The frontend is run
using httpd.
Koschei is stateless. It doesn't use any persistent storage. All
non-volatile information is stored in PostgreSQL database, which is
not part of Koschei, but an external service that Koschei depends on.
== Suspending koschei operation
There is one common container image for different Koschei workloads --
frontend and backend containers are all ran from the same image.
For stopping builds from being scheduled, stopping the
`koschei-scheduler` service is enough. For planned Koji outages, it's
recommended to stop `koschei-scheduler`. It is not necessary, as koschei
can recover from Koji errors and network errors automatically, but when
Koji builders are stopped, it may cause unexpected build failures that
would be reported to users. Other services can be left running as they
automatically restart themselves on Koji and network errors.
Koschei images are built by upstream on Quay.io. Upstream implements
continuous delivery of container images to Quay.io registry. Code
pushed to fedora-prod or fedora-stage git branches in upstream GitHub
repository are automatically built as container images and pushed to
Quay.io registry with appropriate tags.
Pristine upstream Koschei images are then imported into internal
OpenShift registry -- Fedora OpenShift does not build any Koschei
container images by itself. Image import into OpenShift is always
done manually by a Koschei sysadmin, usually by running a manual
Ansible playbook. This way we ensure that developers who can push
code to GitHub repository don't have any control over Fedora
infrastructure deployment process.
Upstream images don't contain any Fedora-specific configuration. Such
configuration is mounted into containers as read-only volumes backed
by Kubernetes Secrets.
Frontend is ran as Kubernetes Deployment with multiple replicas for
high availability. Frontend supports rolling update, which allows it
to be updated with no user-visible downtime.
Each of backend services has its own Kubernetes Deployment with a
single replica. Because backend downtime is not user-visible, rolling
updates are not used by backend.
In addition to frontend and backend, there is also `admin` Deployment,
which runs a container that does nothing but waits for sysadmin to
`rsh` into it for running manual admin commands.
Besides the forementioned Kubernetes Deployments, some ad-hoc tasks
are ran as Kubernetes Jobs, either created on a time schedule from
CronJobs or created by running manual Ansible playbooks by Koschei
sysadmins.
== Upgrade
Upgrading Koschei to a new upstream version is done by running one of
manual Ansible playbooks:
....
sudo rbac-playbook manual/upgrade/koschei-rolling.yml
sudo rbac-playbook manual/upgrade/koschei-full.yml
....
The first rolling update playbook should be used when given update is
known not to change database schema. In this case new upstream image
is simply imported into internal OpenShift registry and all
Deployments are restarted. OpenShift takes care of doing rolling
update of frontend, so that no downtime is experienced by
users. Backend Pods are also recreated with the new image.
The second full update playbook is used when given update changes
database schema. This playbook pauses all Deployments and terminates
all Pods. Users experience frontend downtime. When everything is
stopped, the playbook creates Kubernetes Jobs to run database
migrations and perform other maintenance tasks. Once the Jobs are
done, new Deployments are rolled.
== Admin shell
Certain Koschei operation tasks are done with the `koschei-admin` CLI
tool. The container where the tool is available can be accessed with:
...
oc project koschei
oc rsh deploy/admin
...
== Suspending Koschei operation
For stopping builds from being scheduled, scaling down the `scheduler`
Deployment to zero replicas is enough. For planned Koji outages, it's
recommended to stop the scheduler service. It is not necessary, as
Koschei can recover from Koji errors and network errors automatically,
but when Koji builders are stopped, it may cause unexpected build
failures that would be reported to users. Other backend services can
be left running as they automatically restart themselves on Koji and
network errors.
== Limiting Koji usage
@ -130,20 +160,12 @@ scheduled when Koji load is higher that certain threshold. That should
prevent scheduling builds during mass rebuilds, so it's not necessary to
stop scheduling during those.
== Fedmsg notifications
Koschei optionally supports sending fedmsg notifications for package
state changes. The fedmsg dispatch can be turned on and off in the
configuration (key `fedmsg-publisher.enabled`). Koschei doesn't supply
configuration for fedmsg, it lets the library to load it's own (in
`/etc/fedmsg.d/`).
== Setting admin announcement
Koschei can display announcement in web UI. This is mostly useful to
inform users about outages or other problems.
To set announcement, run as koschei user:
To set announcement, run:
....
koschei-admin set-notice "Koschei operation is currently suspended due to scheduled Koji outage"
@ -152,10 +174,10 @@ koschei-admin set-notice "Koschei operation is currently suspended due to schedu
or:
....
koschei-admin set-notice "Sumbitting scratch builds by Koschei is currently disabled due to Fedora 23 mass rebuild"
koschei-admin set-notice "Submitting scratch builds by Koschei is currently disabled due to Fedora 23 mass rebuild"
....
To clear announcement, run as koschei user:
To clear announcement, run:
....
koschei-admin clear-notice
@ -165,14 +187,14 @@ koschei-admin clear-notice
Packages can be added to one or more group.
To add new group named `mynewgroup`, run as `koschei` user:
To add new group named `mynewgroup`, run:
....
koschei-admin add-group mynewgroup
....
To add new group named `mynewgroup` and populate it with some packages,
run as `koschei` user:
To add new group named `mynewgroup` and populate it with some
packages, run:
....
koschei-admin add-group mynewgroup pkg1 pkg2 pkg3
@ -185,8 +207,7 @@ priority. Any user can change manual priority, which is reset after
package is rebuilt. Admins can additionally set static priority, which
is not affected by package rebuilds.
To set static priority of package `foo` to value `100`, run as `koschei`
user:
To set static priority of package `foo` to value `100`, run:
....
koschei-admin --collection f27 set-priority --static foo 100
@ -206,15 +227,14 @@ koschei-admin branch-collection f27 f28 -d 'Fedora 27' -t f28 --bugzilla-version
....
Then you can optionally verify that the collection configuration is
correct by visiting https://apps.fedoraproject.org/koschei/collections
correct by visiting https://koschei.fedoraproject.org/collections
and examining the configuration of the newly branched collection.
== Edit Koschei group to make it global
Koschei runs in an openshift instance. Connect to the openshift control vm using `ssh` and run the following commands:
To turn `mygroup` group created by user `someuser` into a global group
`thegroup`, run:
....
oc project koschei
oc rsh <admin pod in the koschei project>
koschei-admin edit-group myuser/mygroup --make-global --new-name mygroup
....
koschei-admin edit-group someuser/mygroup --make-global --new-name thegroup
....