infra-docs-fpo/modules/sysadmin_guide/pages/layered-image-buildsys.adoc

281 lines
10 KiB
Text
Raw Normal View History

= Layered Image Build System
The
https://docs.pagure.org/releng/layered_image_build_service.html[Fedora
Layered Image Build System], often referred to as
https://github.com/projectatomic/osbs-client[OSBS] (OpenShift Build
Service) as that is the upstream project that this is based on, is used
to build Layered Container Images in the Fedora Infrastructure via Koji.
== Contents
* <<_contact_information>>
* <<_overview>>
* <<_setup>>
* <<_outage>>
== Contact Information
Owner::
Clement Verna (cverna)
Contact::
#fedora-admin, #fedora-releng, #fedora-noc, sysadmin-main,
sysadmin-releng
Location::
osbs-control01, osbs-master01, osbs-node01, osbs-node02
registry.fedoraproject.org, candidate-registry.fedoraproject.org
+
osbs-control01.stg, osbs-master01.stg, osbs-node01.stg,
osbs-node02.stg registry.stg.fedoraproject.org,
candidate-registry.stg.fedoraproject.org
+
x86_64 koji buildvms
Purpose::
Layered Container Image Builds
== Overview
The build system is setup such that Fedora Layered Image maintainers
will submit a build to Koji via the `fedpkg container-build` command a
`container` namespace within
https://src.fedoraproject.org/projects/container/*[DistGit]. This will
trigger the build to be scheduled in
https://www.openshift.org/[OpenShift] via
https://github.com/projectatomic/osbs-client[osbs-client] tooling,
which creates a custom
https://docs.okd.io/latest/cicd/builds/understanding-image-builds.html[OpenShift Build]
which will use the pre-made buildroot container image that we have
created. The https://github.com/projectatomic/atomic-reactor[Atomic
Reactor] (`atomic-reactor`) utility will run within the buildroot and
prep the build container where the actual build action will execute, it
will also maintain uploading the
https://docs.pagure.org/koji/content_generators/[Content Generator]
metadata back to https://fedoraproject.org/wiki/Koji[Koji] and upload
the built image to the candidate docker registry. This will run on a
host with iptables rules restricting access to the docker bridge, this
is how we will further limit the access of the buildroot to the outside
world verifying that all sources of information come from Fedora.
Completed layered image builds are hosted in a candidate docker registry
which is then used to pull the image and perform tests.
== Setup
The Layered Image Build System setup is currently as follows (more
detailed view available in the
https://docs.pagure.org/releng/layered_image_build_service.html[RelEng
Architecture Document]):
....
=== Layered Image Build System Overview ===
+--------------+ +-----------+
| | | |
| koji hub +----+ | batcave |
| | | | |
+--------------+ | +----+------+
| |
V |
+----------------+ V
| | +----------------+
| koji builder | | +-----------+
| | | osbs-control01 +--------+ |
+-+--------------+ | +-----+ | |
| +----------------+ | | |
| | | |
| | | |
| | | |
V | | |
+----------------+ | | |
| | | | |
| osbs-master01 +------------------------------+ [ansible]
| +-------+ | | | |
+----------------+ | | | | |
^ | | | | |
| | | | | |
| V V | | |
| +-----------------+ +----------------+ | | |
| | | | | | | |
| | osbs-node01 | | osbs-node02 | | | |
| | | | | | | |
| +-----------------+ +----------------+ | | |
| ^ ^ | | |
| | | | | |
| | +-----------+ | |
| | | |
| +------------------------------------------+ |
| |
+-------------------------------------------------------------+
....
=== Deployment
From batcave you can run the following
....
$ sudo rbac-playbook groups/osbs/deploy-cluster.yml
....
This is going to deploy the OpenShift cluster used by OSBS. Currently
the playbook deploys 2 clusters (x86_64 and aarch64). Ansible tags can
be used to deploy only one of these if needed for example
_osbs-x86-deploy-openshift_.
If the openshift-ansible playbook fails it can be easier to run it
directly from osbs-control01 and use the verbose mode.
....
$ ssh osbs-control01.rdu3.fedoraproject.org
$ sudo -i
# cd /root/openshift-ansible
# ansible-playbook -i cluster-inventory playbooks/prerequisites.yml
# ansible-playbook -i cluster-inventory playbooks/deploy_cluster.yml
....
Once these playbook have been successfull, you can configure OSBS on the
cluster. For that use the following playbook
....
$ sudo rbac-playbook groups/osbs/configure-osbs.yml
....
When this is done we need to get the new koji service token and update
its value in the private repository
....
$ ssh osbs-master01.rdu3.fedoraproject.org
$ sudo -i
# oc -n osbs-fedora sa get-token koji
dsjflksfkgjgkjfdl ....
....
The token needs to be saved in the private ansible repo in
`files/osbs/production/x86-64-osbs-koji`. Once this is done
you can run the builder playbook to update that token.
....
$ sudo rbac-playbook groups/buildvm.yml -t osbs
....
=== Operation
Koji Hub will schedule the `containerBuild` on a koji builder via the
`koji-containerbuild-hub` plugin, the builder will then submit the build
in OpenShift via the `koji-containerbuild-builder` plugin which uses the
`osbs-client` python API that wraps the OpenShift API along with a custom
OpenShift Build JSON payload.
The Build is then scheduled in OpenShift and it's logs are captured by
the koji plugins. Inside the buildroot, atomic-reactor will upload the
built container image as well as provide the metadata to koji's content
generator.
== Outage
If Koji is down, then builds can't be scheduled but repairing Koji is
outside the scope of this document.
If either the _candidate-registry.fedoraproject.org_ or
_registry.fedoraproject.org_. Container registries are unavailable, but
repairing those is also outside the scope of this document.
=== OSBS Failures
OpenShift Build System itself can have various types of failures that
are known about and the recovery procedures are listed below.
==== Ran out of disk space
Docker uses a lot of disk space, and while the osbs-nodes have been
allocated what is considered to be ample disk space for builds (since they
are automatically cleaned up periodically) it is possible this will run
out.
To resolve this, run the following commands:
....
# These command will clean up old/dead docker containers from old OpenShift
# Pods
$ for i in $(sudo docker ps -a | awk '/Exited/ { print $1 }'); do sudo docker rm $i; done
$ for i in $(sudo docker images -q -f 'dangling=true'); do sudo docker rmi $i; done
# This command should only be run on osbs-master01 (it won't work on the
# nodes)
#
# This command will clean up old builds and related artifacts in OpenShift
# that are older than 30 days (We can get more aggressive about this if
# necessary, the main reason these still exist is in the event we need to
# debug something. All build info we care about is stored in Koji.)
$ oadm prune builds --orphans --keep-younger-than=720h0m0s --confirm
....
==== A node is broken, how to remove it from the cluster?
If a node is having an issue, the following command will effectively
remove it from the cluster temporarily.
In this example, we are removing osbs-node01
....
$ oadm manage-node osbs-node01.phx2.fedoraproject.org --schedulable=true
....
==== Container Builds are unable to access resources on the network
Sometimes the Container Builds will fail and the logs will show that the
buildroot is unable to access networked resources (docker registry, dnf
repos, etc).
This is because of a bug in OpenShift v1.3.1 (current upstream release
at the time of this writing) where an OpenVSwitch flow is left behind
when a Pod is destroyed instead of the flow being deleted along with the
Pod.
Method to confirm the issue is unfortunately multi-step since it's not a
cluster-wide issue but isolated to the node experiencing the problem.
First in the koji createContainer task there is a log file called
openshift-incremental.log and in there you will find a key:value in some
JSON output similar to the following:
....
'openshift_build_selflink': u'/oapi/v1/namespaces/default/builds/cockpit-f24-6``
....
The last field of the value, in this example `cockpit-f24-6` is the
OpenShift build identifier. We need to ssh into `osbs-master01` and get
information about which node that ran on.
....
# On osbs-master01
# Note: the output won't be pretty, but it gives you the info you need
$ sudo oc get build cockpit-f25-3 -o yaml | grep osbs-node
....
Once you know what machine you need, ssh into it and run the following:
....
$ sudo docker run --rm -ti buildroot /bin/bash'
# now attempt to run a curl command
$ curl https://google.com
# This should get refused, but if this node is experiencing the networking
# issue then this command will hang and eventually time out
....
How to fix:
Reboot the affected node that's experiencing the issue, when the node
comes back up OpenShift will rebuild the flow tables on OpenVSwitch and
things will be back to normal.
....
systemctl reboot
....