2021-07-26 10:39:47 +02:00
|
|
|
= Layered Image Build System
|
|
|
|
|
|
|
|
The
|
|
|
|
https://docs.pagure.org/releng/layered_image_build_service.html[Fedora
|
|
|
|
Layered Image Build System], often referred to as
|
|
|
|
https://github.com/projectatomic/osbs-client[OSBS] (OpenShift Build
|
|
|
|
Service) as that is the upstream project that this is based on, is used
|
|
|
|
to build Layered Container Images in the Fedora Infrastructure via Koji.
|
|
|
|
|
|
|
|
== Contents
|
|
|
|
|
2021-09-06 15:37:39 +02:00
|
|
|
* <<_contact_information>>
|
|
|
|
* <<_overview>>
|
|
|
|
* <<_setup>>
|
|
|
|
* <<_outage>>
|
2021-07-26 10:39:47 +02:00
|
|
|
|
|
|
|
== Contact Information
|
|
|
|
|
|
|
|
Owner::
|
|
|
|
Clement Verna (cverna)
|
|
|
|
Contact::
|
|
|
|
#fedora-admin, #fedora-releng, #fedora-noc, sysadmin-main,
|
|
|
|
sysadmin-releng
|
|
|
|
Location::
|
|
|
|
osbs-control01, osbs-master01, osbs-node01, osbs-node02
|
|
|
|
registry.fedoraproject.org, candidate-registry.fedoraproject.org
|
|
|
|
+
|
|
|
|
osbs-control01.stg, osbs-master01.stg, osbs-node01.stg,
|
|
|
|
osbs-node02.stg registry.stg.fedoraproject.org,
|
|
|
|
candidate-registry.stg.fedoraproject.org
|
|
|
|
+
|
|
|
|
x86_64 koji buildvms
|
|
|
|
Purpose::
|
|
|
|
Layered Container Image Builds
|
|
|
|
|
|
|
|
== Overview
|
|
|
|
|
|
|
|
The build system is setup such that Fedora Layered Image maintainers
|
|
|
|
will submit a build to Koji via the `fedpkg container-build` command a
|
|
|
|
`container` namespace within
|
|
|
|
https://src.fedoraproject.org/projects/container/*[DistGit]. This will
|
|
|
|
trigger the build to be scheduled in
|
|
|
|
https://www.openshift.org/[OpenShift] via
|
2021-09-06 15:37:39 +02:00
|
|
|
https://github.com/projectatomic/osbs-client[osbs-client] tooling,
|
|
|
|
which creates a custom
|
|
|
|
https://docs.okd.io/latest/cicd/builds/understanding-image-builds.html[OpenShift Build]
|
2021-07-26 10:39:47 +02:00
|
|
|
which will use the pre-made buildroot container image that we have
|
|
|
|
created. The https://github.com/projectatomic/atomic-reactor[Atomic
|
|
|
|
Reactor] (`atomic-reactor`) utility will run within the buildroot and
|
|
|
|
prep the build container where the actual build action will execute, it
|
|
|
|
will also maintain uploading the
|
2021-09-06 15:37:39 +02:00
|
|
|
https://docs.pagure.org/koji/content_generators/[Content Generator]
|
2021-07-26 10:39:47 +02:00
|
|
|
metadata back to https://fedoraproject.org/wiki/Koji[Koji] and upload
|
|
|
|
the built image to the candidate docker registry. This will run on a
|
|
|
|
host with iptables rules restricting access to the docker bridge, this
|
|
|
|
is how we will further limit the access of the buildroot to the outside
|
|
|
|
world verifying that all sources of information come from Fedora.
|
|
|
|
|
|
|
|
Completed layered image builds are hosted in a candidate docker registry
|
|
|
|
which is then used to pull the image and perform tests.
|
|
|
|
|
|
|
|
== Setup
|
|
|
|
|
|
|
|
The Layered Image Build System setup is currently as follows (more
|
|
|
|
detailed view available in the
|
|
|
|
https://docs.pagure.org/releng/layered_image_build_service.html[RelEng
|
|
|
|
Architecture Document]):
|
|
|
|
|
|
|
|
....
|
|
|
|
=== Layered Image Build System Overview ===
|
|
|
|
|
|
|
|
+--------------+ +-----------+
|
|
|
|
| | | |
|
|
|
|
| koji hub +----+ | batcave |
|
|
|
|
| | | | |
|
|
|
|
+--------------+ | +----+------+
|
|
|
|
| |
|
|
|
|
V |
|
|
|
|
+----------------+ V
|
|
|
|
| | +----------------+
|
|
|
|
| koji builder | | +-----------+
|
|
|
|
| | | osbs-control01 +--------+ |
|
|
|
|
+-+--------------+ | +-----+ | |
|
|
|
|
| +----------------+ | | |
|
|
|
|
| | | |
|
|
|
|
| | | |
|
|
|
|
| | | |
|
|
|
|
V | | |
|
|
|
|
+----------------+ | | |
|
|
|
|
| | | | |
|
|
|
|
| osbs-master01 +------------------------------+ [ansible]
|
|
|
|
| +-------+ | | | |
|
|
|
|
+----------------+ | | | | |
|
|
|
|
^ | | | | |
|
|
|
|
| | | | | |
|
|
|
|
| V V | | |
|
|
|
|
| +-----------------+ +----------------+ | | |
|
|
|
|
| | | | | | | |
|
|
|
|
| | osbs-node01 | | osbs-node02 | | | |
|
|
|
|
| | | | | | | |
|
|
|
|
| +-----------------+ +----------------+ | | |
|
|
|
|
| ^ ^ | | |
|
|
|
|
| | | | | |
|
|
|
|
| | +-----------+ | |
|
|
|
|
| | | |
|
|
|
|
| +------------------------------------------+ |
|
|
|
|
| |
|
|
|
|
+-------------------------------------------------------------+
|
|
|
|
....
|
|
|
|
|
|
|
|
=== Deployment
|
|
|
|
|
|
|
|
From batcave you can run the following
|
|
|
|
|
|
|
|
....
|
|
|
|
$ sudo rbac-playbook groups/osbs/deploy-cluster.yml
|
|
|
|
....
|
|
|
|
|
|
|
|
This is going to deploy the OpenShift cluster used by OSBS. Currently
|
|
|
|
the playbook deploys 2 clusters (x86_64 and aarch64). Ansible tags can
|
|
|
|
be used to deploy only one of these if needed for example
|
2021-09-06 15:37:39 +02:00
|
|
|
_osbs-x86-deploy-openshift_.
|
2021-07-26 10:39:47 +02:00
|
|
|
|
|
|
|
If the openshift-ansible playbook fails it can be easier to run it
|
|
|
|
directly from osbs-control01 and use the verbose mode.
|
|
|
|
|
|
|
|
....
|
2025-07-04 11:55:02 +02:00
|
|
|
$ ssh osbs-control01.rdu3.fedoraproject.org
|
2021-07-26 10:39:47 +02:00
|
|
|
$ sudo -i
|
|
|
|
# cd /root/openshift-ansible
|
|
|
|
# ansible-playbook -i cluster-inventory playbooks/prerequisites.yml
|
|
|
|
# ansible-playbook -i cluster-inventory playbooks/deploy_cluster.yml
|
|
|
|
....
|
|
|
|
|
|
|
|
Once these playbook have been successfull, you can configure OSBS on the
|
|
|
|
cluster. For that use the following playbook
|
|
|
|
|
|
|
|
....
|
|
|
|
$ sudo rbac-playbook groups/osbs/configure-osbs.yml
|
|
|
|
....
|
|
|
|
|
|
|
|
When this is done we need to get the new koji service token and update
|
|
|
|
its value in the private repository
|
|
|
|
|
|
|
|
....
|
2025-07-04 11:55:02 +02:00
|
|
|
$ ssh osbs-master01.rdu3.fedoraproject.org
|
2021-07-26 10:39:47 +02:00
|
|
|
$ sudo -i
|
|
|
|
# oc -n osbs-fedora sa get-token koji
|
|
|
|
dsjflksfkgjgkjfdl ....
|
|
|
|
....
|
|
|
|
|
|
|
|
The token needs to be saved in the private ansible repo in
|
2021-09-06 15:37:39 +02:00
|
|
|
`files/osbs/production/x86-64-osbs-koji`. Once this is done
|
2021-07-26 10:39:47 +02:00
|
|
|
you can run the builder playbook to update that token.
|
|
|
|
|
|
|
|
....
|
|
|
|
$ sudo rbac-playbook groups/buildvm.yml -t osbs
|
|
|
|
....
|
|
|
|
|
|
|
|
=== Operation
|
|
|
|
|
2021-09-06 15:37:39 +02:00
|
|
|
Koji Hub will schedule the `containerBuild` on a koji builder via the
|
|
|
|
`koji-containerbuild-hub` plugin, the builder will then submit the build
|
|
|
|
in OpenShift via the `koji-containerbuild-builder` plugin which uses the
|
|
|
|
`osbs-client` python API that wraps the OpenShift API along with a custom
|
2021-07-26 10:39:47 +02:00
|
|
|
OpenShift Build JSON payload.
|
|
|
|
|
|
|
|
The Build is then scheduled in OpenShift and it's logs are captured by
|
|
|
|
the koji plugins. Inside the buildroot, atomic-reactor will upload the
|
|
|
|
built container image as well as provide the metadata to koji's content
|
|
|
|
generator.
|
|
|
|
|
|
|
|
== Outage
|
|
|
|
|
|
|
|
If Koji is down, then builds can't be scheduled but repairing Koji is
|
|
|
|
outside the scope of this document.
|
|
|
|
|
2021-09-06 15:37:39 +02:00
|
|
|
If either the _candidate-registry.fedoraproject.org_ or
|
|
|
|
_registry.fedoraproject.org_. Container registries are unavailable, but
|
2021-07-26 10:39:47 +02:00
|
|
|
repairing those is also outside the scope of this document.
|
|
|
|
|
|
|
|
=== OSBS Failures
|
|
|
|
|
|
|
|
OpenShift Build System itself can have various types of failures that
|
|
|
|
are known about and the recovery procedures are listed below.
|
|
|
|
|
|
|
|
==== Ran out of disk space
|
|
|
|
|
|
|
|
Docker uses a lot of disk space, and while the osbs-nodes have been
|
2021-09-06 15:37:39 +02:00
|
|
|
allocated what is considered to be ample disk space for builds (since they
|
2021-07-26 10:39:47 +02:00
|
|
|
are automatically cleaned up periodically) it is possible this will run
|
|
|
|
out.
|
|
|
|
|
|
|
|
To resolve this, run the following commands:
|
|
|
|
|
|
|
|
....
|
|
|
|
# These command will clean up old/dead docker containers from old OpenShift
|
|
|
|
# Pods
|
|
|
|
|
|
|
|
$ for i in $(sudo docker ps -a | awk '/Exited/ { print $1 }'); do sudo docker rm $i; done
|
|
|
|
|
|
|
|
$ for i in $(sudo docker images -q -f 'dangling=true'); do sudo docker rmi $i; done
|
|
|
|
|
|
|
|
|
|
|
|
# This command should only be run on osbs-master01 (it won't work on the
|
|
|
|
# nodes)
|
|
|
|
#
|
|
|
|
# This command will clean up old builds and related artifacts in OpenShift
|
|
|
|
# that are older than 30 days (We can get more aggressive about this if
|
|
|
|
# necessary, the main reason these still exist is in the event we need to
|
|
|
|
# debug something. All build info we care about is stored in Koji.)
|
|
|
|
|
|
|
|
$ oadm prune builds --orphans --keep-younger-than=720h0m0s --confirm
|
|
|
|
....
|
|
|
|
|
|
|
|
==== A node is broken, how to remove it from the cluster?
|
|
|
|
|
|
|
|
If a node is having an issue, the following command will effectively
|
|
|
|
remove it from the cluster temporarily.
|
|
|
|
|
|
|
|
In this example, we are removing osbs-node01
|
|
|
|
|
|
|
|
....
|
|
|
|
$ oadm manage-node osbs-node01.phx2.fedoraproject.org --schedulable=true
|
|
|
|
....
|
|
|
|
|
|
|
|
==== Container Builds are unable to access resources on the network
|
|
|
|
|
|
|
|
Sometimes the Container Builds will fail and the logs will show that the
|
|
|
|
buildroot is unable to access networked resources (docker registry, dnf
|
|
|
|
repos, etc).
|
|
|
|
|
|
|
|
This is because of a bug in OpenShift v1.3.1 (current upstream release
|
|
|
|
at the time of this writing) where an OpenVSwitch flow is left behind
|
|
|
|
when a Pod is destroyed instead of the flow being deleted along with the
|
|
|
|
Pod.
|
|
|
|
|
|
|
|
Method to confirm the issue is unfortunately multi-step since it's not a
|
|
|
|
cluster-wide issue but isolated to the node experiencing the problem.
|
|
|
|
|
|
|
|
First in the koji createContainer task there is a log file called
|
|
|
|
openshift-incremental.log and in there you will find a key:value in some
|
|
|
|
JSON output similar to the following:
|
|
|
|
|
|
|
|
....
|
|
|
|
'openshift_build_selflink': u'/oapi/v1/namespaces/default/builds/cockpit-f24-6``
|
|
|
|
....
|
|
|
|
|
|
|
|
The last field of the value, in this example `cockpit-f24-6` is the
|
|
|
|
OpenShift build identifier. We need to ssh into `osbs-master01` and get
|
|
|
|
information about which node that ran on.
|
|
|
|
|
|
|
|
....
|
|
|
|
# On osbs-master01
|
|
|
|
# Note: the output won't be pretty, but it gives you the info you need
|
|
|
|
|
|
|
|
$ sudo oc get build cockpit-f25-3 -o yaml | grep osbs-node
|
|
|
|
....
|
|
|
|
|
|
|
|
Once you know what machine you need, ssh into it and run the following:
|
|
|
|
|
|
|
|
....
|
|
|
|
$ sudo docker run --rm -ti buildroot /bin/bash'
|
|
|
|
|
|
|
|
# now attempt to run a curl command
|
|
|
|
|
|
|
|
$ curl https://google.com
|
|
|
|
# This should get refused, but if this node is experiencing the networking
|
|
|
|
# issue then this command will hang and eventually time out
|
|
|
|
....
|
|
|
|
|
|
|
|
How to fix:
|
|
|
|
|
|
|
|
Reboot the affected node that's experiencing the issue, when the node
|
|
|
|
comes back up OpenShift will rebuild the flow tables on OpenVSwitch and
|
|
|
|
things will be back to normal.
|
|
|
|
|
|
|
|
....
|
|
|
|
systemctl reboot
|
|
|
|
....
|