Added the infra SOPs ported to asciidoc.
This commit is contained in:
parent
8a7f111a12
commit
a0301e30f1
148 changed files with 18575 additions and 17 deletions
281
modules/sysadmin_guide/pages/layered-image-buildsys.adoc
Normal file
281
modules/sysadmin_guide/pages/layered-image-buildsys.adoc
Normal file
|
@ -0,0 +1,281 @@
|
|||
= Layered Image Build System
|
||||
|
||||
The
|
||||
https://docs.pagure.org/releng/layered_image_build_service.html[Fedora
|
||||
Layered Image Build System], often referred to as
|
||||
https://github.com/projectatomic/osbs-client[OSBS] (OpenShift Build
|
||||
Service) as that is the upstream project that this is based on, is used
|
||||
to build Layered Container Images in the Fedora Infrastructure via Koji.
|
||||
|
||||
== Contents
|
||||
|
||||
[arabic]
|
||||
. Contact Information
|
||||
. Overview
|
||||
. Setup
|
||||
. Outage
|
||||
|
||||
== Contact Information
|
||||
|
||||
Owner::
|
||||
Clement Verna (cverna)
|
||||
Contact::
|
||||
#fedora-admin, #fedora-releng, #fedora-noc, sysadmin-main,
|
||||
sysadmin-releng
|
||||
Location::
|
||||
osbs-control01, osbs-master01, osbs-node01, osbs-node02
|
||||
registry.fedoraproject.org, candidate-registry.fedoraproject.org
|
||||
+
|
||||
osbs-control01.stg, osbs-master01.stg, osbs-node01.stg,
|
||||
osbs-node02.stg registry.stg.fedoraproject.org,
|
||||
candidate-registry.stg.fedoraproject.org
|
||||
+
|
||||
x86_64 koji buildvms
|
||||
Purpose::
|
||||
Layered Container Image Builds
|
||||
|
||||
== Overview
|
||||
|
||||
The build system is setup such that Fedora Layered Image maintainers
|
||||
will submit a build to Koji via the `fedpkg container-build` command a
|
||||
`container` namespace within
|
||||
https://src.fedoraproject.org/projects/container/*[DistGit]. This will
|
||||
trigger the build to be scheduled in
|
||||
https://www.openshift.org/[OpenShift] via
|
||||
https://github.com/projectatomic/osbs-client[osbs-client] tooling, this
|
||||
will create a custom
|
||||
https://docs.openshift.org/latest/dev_guide/builds.html[OpenShift Build]
|
||||
which will use the pre-made buildroot container image that we have
|
||||
created. The https://github.com/projectatomic/atomic-reactor[Atomic
|
||||
Reactor] (`atomic-reactor`) utility will run within the buildroot and
|
||||
prep the build container where the actual build action will execute, it
|
||||
will also maintain uploading the
|
||||
https://fedoraproject.org/wiki/Koji/ContentGenerators[Content Generator]
|
||||
metadata back to https://fedoraproject.org/wiki/Koji[Koji] and upload
|
||||
the built image to the candidate docker registry. This will run on a
|
||||
host with iptables rules restricting access to the docker bridge, this
|
||||
is how we will further limit the access of the buildroot to the outside
|
||||
world verifying that all sources of information come from Fedora.
|
||||
|
||||
Completed layered image builds are hosted in a candidate docker registry
|
||||
which is then used to pull the image and perform tests.
|
||||
|
||||
== Setup
|
||||
|
||||
The Layered Image Build System setup is currently as follows (more
|
||||
detailed view available in the
|
||||
https://docs.pagure.org/releng/layered_image_build_service.html[RelEng
|
||||
Architecture Document]):
|
||||
|
||||
....
|
||||
=== Layered Image Build System Overview ===
|
||||
|
||||
+--------------+ +-----------+
|
||||
| | | |
|
||||
| koji hub +----+ | batcave |
|
||||
| | | | |
|
||||
+--------------+ | +----+------+
|
||||
| |
|
||||
V |
|
||||
+----------------+ V
|
||||
| | +----------------+
|
||||
| koji builder | | +-----------+
|
||||
| | | osbs-control01 +--------+ |
|
||||
+-+--------------+ | +-----+ | |
|
||||
| +----------------+ | | |
|
||||
| | | |
|
||||
| | | |
|
||||
| | | |
|
||||
V | | |
|
||||
+----------------+ | | |
|
||||
| | | | |
|
||||
| osbs-master01 +------------------------------+ [ansible]
|
||||
| +-------+ | | | |
|
||||
+----------------+ | | | | |
|
||||
^ | | | | |
|
||||
| | | | | |
|
||||
| V V | | |
|
||||
| +-----------------+ +----------------+ | | |
|
||||
| | | | | | | |
|
||||
| | osbs-node01 | | osbs-node02 | | | |
|
||||
| | | | | | | |
|
||||
| +-----------------+ +----------------+ | | |
|
||||
| ^ ^ | | |
|
||||
| | | | | |
|
||||
| | +-----------+ | |
|
||||
| | | |
|
||||
| +------------------------------------------+ |
|
||||
| |
|
||||
+-------------------------------------------------------------+
|
||||
....
|
||||
|
||||
=== Deployment
|
||||
|
||||
From batcave you can run the following
|
||||
|
||||
....
|
||||
$ sudo rbac-playbook groups/osbs/deploy-cluster.yml
|
||||
....
|
||||
|
||||
This is going to deploy the OpenShift cluster used by OSBS. Currently
|
||||
the playbook deploys 2 clusters (x86_64 and aarch64). Ansible tags can
|
||||
be used to deploy only one of these if needed for example
|
||||
[.title-ref]#osbs-x86-deploy-openshift#.
|
||||
|
||||
If the openshift-ansible playbook fails it can be easier to run it
|
||||
directly from osbs-control01 and use the verbose mode.
|
||||
|
||||
....
|
||||
$ ssh osbs-control01.iad2.fedoraproject.org
|
||||
$ sudo -i
|
||||
# cd /root/openshift-ansible
|
||||
# ansible-playbook -i cluster-inventory playbooks/prerequisites.yml
|
||||
# ansible-playbook -i cluster-inventory playbooks/deploy_cluster.yml
|
||||
....
|
||||
|
||||
Once these playbook have been successfull, you can configure OSBS on the
|
||||
cluster. For that use the following playbook
|
||||
|
||||
....
|
||||
$ sudo rbac-playbook groups/osbs/configure-osbs.yml
|
||||
....
|
||||
|
||||
When this is done we need to get the new koji service token and update
|
||||
its value in the private repository
|
||||
|
||||
....
|
||||
$ ssh osbs-master01.iad2.fedoraproject.org
|
||||
$ sudo -i
|
||||
# oc -n osbs-fedora sa get-token koji
|
||||
dsjflksfkgjgkjfdl ....
|
||||
....
|
||||
|
||||
The token needs to be saved in the private ansible repo in
|
||||
[.title-ref]#files/osbs/production/x86-64-osbs-koji#. Once this is done
|
||||
you can run the builder playbook to update that token.
|
||||
|
||||
....
|
||||
$ sudo rbac-playbook groups/buildvm.yml -t osbs
|
||||
....
|
||||
|
||||
=== Operation
|
||||
|
||||
Koji Hub will schedule the containerBuild on a koji builder via the
|
||||
koji-containerbuild-hub plugin, the builder will then submit the build
|
||||
in OpenShift via the koji-containerbuild-builder plugin which uses the
|
||||
osbs-client python API that wraps the OpenShift API along with a custom
|
||||
OpenShift Build JSON payload.
|
||||
|
||||
The Build is then scheduled in OpenShift and it's logs are captured by
|
||||
the koji plugins. Inside the buildroot, atomic-reactor will upload the
|
||||
built container image as well as provide the metadata to koji's content
|
||||
generator.
|
||||
|
||||
== Outage
|
||||
|
||||
If Koji is down, then builds can't be scheduled but repairing Koji is
|
||||
outside the scope of this document.
|
||||
|
||||
If either the candidate-registry.fedoraproject.org or
|
||||
registry.fedoraproject.org Container Registries are unavailable, but
|
||||
repairing those is also outside the scope of this document.
|
||||
|
||||
=== OSBS Failures
|
||||
|
||||
OpenShift Build System itself can have various types of failures that
|
||||
are known about and the recovery procedures are listed below.
|
||||
|
||||
==== Ran out of disk space
|
||||
|
||||
Docker uses a lot of disk space, and while the osbs-nodes have been
|
||||
alloted what is considered to be ample disk space for builds (since they
|
||||
are automatically cleaned up periodically) it is possible this will run
|
||||
out.
|
||||
|
||||
To resolve this, run the following commands:
|
||||
|
||||
....
|
||||
# These command will clean up old/dead docker containers from old OpenShift
|
||||
# Pods
|
||||
|
||||
$ for i in $(sudo docker ps -a | awk '/Exited/ { print $1 }'); do sudo docker rm $i; done
|
||||
|
||||
$ for i in $(sudo docker images -q -f 'dangling=true'); do sudo docker rmi $i; done
|
||||
|
||||
|
||||
# This command should only be run on osbs-master01 (it won't work on the
|
||||
# nodes)
|
||||
#
|
||||
# This command will clean up old builds and related artifacts in OpenShift
|
||||
# that are older than 30 days (We can get more aggressive about this if
|
||||
# necessary, the main reason these still exist is in the event we need to
|
||||
# debug something. All build info we care about is stored in Koji.)
|
||||
|
||||
$ oadm prune builds --orphans --keep-younger-than=720h0m0s --confirm
|
||||
....
|
||||
|
||||
==== A node is broken, how to remove it from the cluster?
|
||||
|
||||
If a node is having an issue, the following command will effectively
|
||||
remove it from the cluster temporarily.
|
||||
|
||||
In this example, we are removing osbs-node01
|
||||
|
||||
....
|
||||
$ oadm manage-node osbs-node01.phx2.fedoraproject.org --schedulable=true
|
||||
....
|
||||
|
||||
==== Container Builds are unable to access resources on the network
|
||||
|
||||
Sometimes the Container Builds will fail and the logs will show that the
|
||||
buildroot is unable to access networked resources (docker registry, dnf
|
||||
repos, etc).
|
||||
|
||||
This is because of a bug in OpenShift v1.3.1 (current upstream release
|
||||
at the time of this writing) where an OpenVSwitch flow is left behind
|
||||
when a Pod is destroyed instead of the flow being deleted along with the
|
||||
Pod.
|
||||
|
||||
Method to confirm the issue is unfortunately multi-step since it's not a
|
||||
cluster-wide issue but isolated to the node experiencing the problem.
|
||||
|
||||
First in the koji createContainer task there is a log file called
|
||||
openshift-incremental.log and in there you will find a key:value in some
|
||||
JSON output similar to the following:
|
||||
|
||||
....
|
||||
'openshift_build_selflink': u'/oapi/v1/namespaces/default/builds/cockpit-f24-6``
|
||||
....
|
||||
|
||||
The last field of the value, in this example `cockpit-f24-6` is the
|
||||
OpenShift build identifier. We need to ssh into `osbs-master01` and get
|
||||
information about which node that ran on.
|
||||
|
||||
....
|
||||
# On osbs-master01
|
||||
# Note: the output won't be pretty, but it gives you the info you need
|
||||
|
||||
$ sudo oc get build cockpit-f25-3 -o yaml | grep osbs-node
|
||||
....
|
||||
|
||||
Once you know what machine you need, ssh into it and run the following:
|
||||
|
||||
....
|
||||
$ sudo docker run --rm -ti buildroot /bin/bash'
|
||||
|
||||
# now attempt to run a curl command
|
||||
|
||||
$ curl https://google.com
|
||||
# This should get refused, but if this node is experiencing the networking
|
||||
# issue then this command will hang and eventually time out
|
||||
....
|
||||
|
||||
How to fix:
|
||||
|
||||
Reboot the affected node that's experiencing the issue, when the node
|
||||
comes back up OpenShift will rebuild the flow tables on OpenVSwitch and
|
||||
things will be back to normal.
|
||||
|
||||
....
|
||||
systemctl reboot
|
||||
....
|
Loading…
Add table
Add a link
Reference in a new issue