Also includes updates to reflect the new deployment style, and a brief guide on testing in staging. Signed-off-by: Jeremy Cline <jeremycline@linux.microsoft.com>
259 lines
10 KiB
Text
259 lines
10 KiB
Text
= cloud-image-uploader SOP
|
|
|
|
Upload Cloud images to public clouds after they are built in Koji.
|
|
|
|
Source code: https://pagure.io/cloud-image-uploader
|
|
|
|
== Contact Information
|
|
|
|
Owner::
|
|
Cloud SIG, Jeremy Cline (jcline)
|
|
Contact::
|
|
#cloud:fedoraproject.org (Matrix)
|
|
Servers::
|
|
- https://console-openshift-console.apps.ocp.stg.fedoraproject.org/project-details/ns/cloud-image-uploader[Stage]
|
|
- https://console-openshift-console.apps.ocp.fedoraproject.org/project-details/ns/cloud-image-uploader[Production]
|
|
|
|
Purpose::
|
|
Upload Cloud images to public clouds.
|
|
|
|
== Description
|
|
|
|
cloud-image-uploader is an AMQP message consumer (run via `fedora-messaging
|
|
consume`) that processes Pungi compose messages published on the
|
|
`org.fedoraproject.*.pungi.compose.status.change` AMQP topic. When a compose
|
|
enters the `FINISHED` or `FINISHED_INCOMPLETE` states, the service downloads
|
|
any images in the compose and uploads it to the relevant cloud provider.
|
|
|
|
The service does not accept any incoming connections and only depends on the
|
|
RabbitMQ message broker and the relevant cloud provider's APIs.
|
|
|
|
It requires a few gigabytes of temporary space to download the images before
|
|
uploading them to the cloud provider. It is heavily I/O bound and the most
|
|
computationally expensive thing it does is decompress the images.
|
|
|
|
== General Configuration
|
|
|
|
The Fedora Ansible repository contains the
|
|
https://pagure.io/fedora-infra/ansible/blob/main/f/roles/openshift-apps/cloud-image-uploader[OpenShift
|
|
application definition]. The playbook to create the OpenShift application is
|
|
located at `playbooks/openshift-apps/cloud-image-uploader.yml`.
|
|
|
|
The Ansible playbook creates multiple fedora-messaging configuration files from
|
|
the `config.toml` template. All application configuration is either in the
|
|
fedora-messaging configuration file or in environment variables. The
|
|
environment variables are used for secrets and vary based on which service the
|
|
container handles.
|
|
|
|
The fedora-messaging configuration file in use by a container is defined in the
|
|
`FEDORA_MESSAGING_CONF` environment variable.
|
|
|
|
== Deploying
|
|
|
|
The OpenShift deployment consists a single image and multiple containers using
|
|
that image, one container for each content type (containers, azure, aws, and
|
|
gcp). The only variation between the containers is the secrets volumes mounted,
|
|
secrets injected via environment variables, and the `FEDORA_MESSAGING_CONF`
|
|
environment variable which points to one of the fedora-messaging configurations
|
|
in `/etc/fedora-messaging/`.
|
|
|
|
=== Staging
|
|
|
|
The staging BuildConfig builds a container from
|
|
https://pagure.io/cloud-image-uploader/tree/main[the main branch]. You need to
|
|
trigger a build manually, either from the web UI or the CLI.
|
|
|
|
Although composes are not done in staging, it's still possible to test in
|
|
staging manually. First, start a debug terminal to enter a running container.
|
|
Next, find an AMQP message for a
|
|
https://apps.fedoraproject.org/datagrepper/v2/search?topic=org.fedoraproject.prod.pungi.compose.status.change[production
|
|
compose] in the `FINISHED` or `FINISHED_INCOMPLETE` state. You can trigger the
|
|
fedora-messaging consumer to process the message by running:
|
|
|
|
....
|
|
FEDORA_MESSAGING_CONF=/etc/fedora-messaging/service-config.toml fedora-messaging reconsume <message-id>
|
|
....
|
|
|
|
=== Production
|
|
|
|
The production BuildConfig builds a container from
|
|
https://pagure.io/cloud-image-uploader/tree/prod[the prod branch]. Just like
|
|
staging, you need to trigger a build manually. After deploying to staging, the
|
|
main branch can be merged into the production branch to "promote" it:
|
|
|
|
....
|
|
$ git checkout prod && git merge --ff-only main
|
|
....
|
|
|
|
=== Azure
|
|
|
|
Images are uploaded whenever a compose that contains `vhd-compressed` images.
|
|
Images are first uploaded to a container in the storage account and then
|
|
imported into an Image Gallery.
|
|
|
|
Credentials for Azure are provided using environment variables and are
|
|
discovered by the Azure Python SDK automatically.
|
|
|
|
==== Image Cleanup
|
|
|
|
Image clean-up is automated.
|
|
|
|
The storage account is configured to delete any blob in the container older
|
|
than 1 week and should require no manual attention. Nothing in the container is
|
|
required after the VHD is imported to the Image Gallery.
|
|
|
|
Images in the Gallery are cleaned up by the image uploader after a new image
|
|
has been uploaded. For complete details on the image cleanup policy refer to
|
|
the consumer code, but at the time of this writing the policy is as follows:
|
|
|
|
- Any image that has an end-of-life field that is in the past is removed.
|
|
|
|
- Only the latest 7 images that are marked as "excluded from latest = True"
|
|
within an image definition are retained. When an image is marked as "exclude
|
|
from latest = False", new virtual machines that don't reference an explicit
|
|
image version will boot using the newest image (following semver). All images
|
|
are uploaded with "excluded from latest = True" and are only marked as
|
|
"excluded from latest = False" after testing.
|
|
|
|
- Only the latest 7 images in the Rawhide image definitions are retained,
|
|
regardless of whether they are marked "excluded from latest = False".
|
|
|
|
At the moment, testing and promotion to "excluded from latest = False" is a
|
|
manual process, but in the future will be automated to happen regularly
|
|
(weekly, perhaps).
|
|
|
|
==== Authentication
|
|
|
|
The following environment variables are used:
|
|
|
|
....
|
|
AZURE_SUBSCRIPTION_ID - Identifies the subscription within an Azure tenant (our tenant only has 1)
|
|
AZURE_CLIENT_ID - The application ID used during authentication.
|
|
AZURE_SECRET - The application secret used during authentication.
|
|
AZURE_TENANT - Identifies the Azure tenant.
|
|
....
|
|
|
|
If you have access to the Fedora Project tenant, these values are available in
|
|
the https://portal.azure.com[web portal] under the Microsoft Entra ID service
|
|
in the "App registrations" tab. To manage things via the CLI you can do `dnf
|
|
install azure-cli`. All commands below assume you've logged in with `az login`.
|
|
|
|
There are two app registrations, `fedora-cloud-image-uploader` and
|
|
`fedora-cloud-image-uploader-staging`. These were created by running:
|
|
....
|
|
$ az ad app create --display-name fedora-cloud-image-uploader
|
|
....
|
|
|
|
==== Authorization
|
|
|
|
Images are placed in two resource groups (containers for arbitrary resources).
|
|
`fedora-cloud-staging` is used for the staging deployment, and `fedora-cloud`
|
|
is used for the production deployment.
|
|
|
|
The app registrations are granted access to their respective resource group by
|
|
assigning them a role on the resource group. The role definition can be seen with:
|
|
|
|
....
|
|
$ az role definition list --name "Image Uploader"
|
|
....
|
|
|
|
This role is then assigned to the app registration with
|
|
|
|
....
|
|
$ az role assignment create --assignee "fedora-cloud-image-uploader" \
|
|
--role "Image Uploader" \
|
|
--scope "/subscriptions/{subscription_id}/resourceGroups/fedora-cloud"
|
|
....
|
|
|
|
In the event that additional permissions are required, the role can be updated
|
|
with additional permission.
|
|
|
|
|
|
==== Credential rotation
|
|
|
|
At the moment, credentials are set to expire and will need to be periodically rotated. To do so via the CLI:
|
|
|
|
....
|
|
$ az ad app list -o table # Find the application to issue new secrets for and set CLIENT_ID to its "Id" field
|
|
$ touch azure_secret
|
|
$ chmod 600 azure_secret
|
|
$ SECRET_NAME="Some useful name for the secret"
|
|
$ az ad app credential reset --id $CLIENT_ID --append --display-name $SECRET_NAME --years 1 --query password --output tsv > azure_secret
|
|
....
|
|
|
|
=== AWS
|
|
|
|
AWS images are uploaded by this service to the Fedora AWS account. Cleanup is
|
|
handled by the general Fedora AWS resource cleaner and uses the tags applied to
|
|
a resource to determine when to remove them.
|
|
|
|
Images are first uploaded to the `fedora-s3-bucket-fedimg` S3 bucket, and then
|
|
imported as EC2 snapshots to the region configured in the `base_region` setting
|
|
of the `consumer_config.aws` section. The snapshot is then replicated to all
|
|
the regions listed in the `ami_regions` setting.
|
|
|
|
==== New Regions
|
|
|
|
In the event that a new region becomes available and users want Fedora Cloud
|
|
images there, simply add the new region to the `ami_regions` list.
|
|
|
|
|
|
=== Containers
|
|
|
|
Containers are pushed to the `registry.fedoraproject.org` and `quay.io/fedora/`
|
|
registries. These include the Fedora Toolbox, Fedora and Fedora Minimal, ELN,
|
|
and Atomic Desktop images.
|
|
|
|
==== Adding New Container Images
|
|
|
|
The configuration contains a mapping of variants to registry repositories in
|
|
the `consumer_config.container.repos` configuration section. In order to handle
|
|
a new container image, a new mapping should be added to this dictionary.
|
|
|
|
=== Google Cloud Engine
|
|
|
|
Google Cloud Engine images are published under the `fedora-cloud` project in
|
|
Google Cloud Platform. The flow is similar to other clouds, as the tarball is
|
|
uploaded to the `fedora-cloud-image-upload` bucket and then imported as a
|
|
machine image. The bucket has a lifecycle configuration to delete an object 3
|
|
days after it has been created so old tarballs are cleaned up automatically
|
|
after being imported.
|
|
|
|
==== Credentials
|
|
|
|
The service uses the
|
|
`fedora-image-uploader@fedora-cloud.iam.gserviceaccount.com` service account.
|
|
New credentials can be issued for that account under the IAM & Admin panel,
|
|
although the current credentials do not expire.
|
|
|
|
==== Permissions
|
|
|
|
The service account is assigned the `Fedora Image Uploader` role which should
|
|
grant it the minimal permissions required to manage images. The current
|
|
permission list is as follows:
|
|
|
|
- compute.globalOperations.get
|
|
- compute.images.create
|
|
- compute.images.createTagBinding
|
|
- compute.images.delete
|
|
- compute.images.deleteTagBinding
|
|
- compute.images.deprecate
|
|
- compute.images.get
|
|
- compute.images.getFromFamily
|
|
- compute.images.list
|
|
- compute.images.listEffectiveTags
|
|
- compute.images.listTagBindings
|
|
- compute.images.setLabels
|
|
- compute.images.update
|
|
- compute.images.useReadOnly
|
|
- resourcemanager.projects.get
|
|
|
|
In the event that the application requires new permissions, edit the `Fedora
|
|
Image Uploader` role to include the new permissions.
|
|
|
|
==== Cleanup
|
|
|
|
Machine images are labeled to include their `end-of-life` date. After this date
|
|
is reached, the image is removed. Images are uploaded as "deprecated" by
|
|
default. Every two weeks an image in an Image Family is promoted and marked as
|
|
not deprecated. Deprecated images are removed after two weeks.
|