Communishift: project deletion and notifcations SOP

Openshift: WIP best practices

Signed-off-by: David Kirwan <davidkirwanirl@gmail.com>
This commit is contained in:
David Kirwan 2024-12-11 13:20:33 +00:00
parent 25d3f58d7a
commit 5141624ed5
No known key found for this signature in database
GPG key ID: A5893AB6474AC37D
5 changed files with 265 additions and 0 deletions

View file

@ -12,6 +12,8 @@ If you've never used OpenShift before a good place to start is with
https://www.openshift.org/minishift/[MiniShift], which deploys OpenShift https://www.openshift.org/minishift/[MiniShift], which deploys OpenShift
Origin in a virtual machine. Origin in a virtual machine.
See the following for some: xref:openshift_bestpractices.adoc[Openshift Best Practices]
=== OpenShift in Fedora Infrastructure === OpenShift in Fedora Infrastructure
Fedora has two OpenShift deployments: Fedora has two OpenShift deployments:

View file

@ -0,0 +1,207 @@
== Fedora Infra Openshift Best Practices
This document aims to encourage the use of best practices related to application development and deployment of containerised applications on Kubernetes/Openshift.
NOTE: Its a large topic, can't possibly cover every element in detail, but should be enough to act as a primer.
NOTE: Should these best practices be something maintained by the kube-sig? If so should we attempt to resurrect it?
=== References/Resources/Further Reading
- [1] Fedora Infra Flock Hackfest https://hackmd.io/HxpzTNpITfu0OYmOGRApiw
- [2] Kubernetes health checks https://blog.kubecost.com/blog/kubernetes-health-check/
- [3] Prometheus metrics format https://github.com/prometheus/docs/blob/main/content/docs/instrumenting/exposition_formats.md#text-based-format
- [4] Fedora Kubedev SIG https://fedoraproject.org/wiki/SIGs/KubeDev
- [5] Openshift oauth-proxy https://github.com/openshift/oauth-proxy
- [6] Fedora Infra migration tracker DeploymentConfig to Deployment https://pagure.io/fedora-infrastructure/issue/12142
- [7] Fedora Infra ticket tracker: https://pagure.io/fedora-infrastructure/issues
- [8] 42 Prod Best Practices The Complete Guide for Developers https://medium.com/@mahernaija/docker-2024-docker-compose-2024-master-best-practices-the-complete-guide-for-developers-aaf851349240
- [9] Semantic Versioning https://semver.org/
- [10] Resources https://docs.openshift.com/container-platform/4.17/scalability_and_performance/compute-resource-quotas.html
- [11] Ansible Operator Tutorial: https://sdk.operatorframework.io/docs/building-operators/ansible/tutorial/
- [12] How pods with resource limits are run https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#how-pods-with-resource-limits-are-run
- [13] Enabling monitoring for user defined projects: https://docs.openshift.com/container-platform/4.8/monitoring/enabling-monitoring-for-user-defined-projects.html
=== Fedora Infra Clusters
Fedora Infra currently manages the following three Openshift clusters:
- Staging (Self Hosted in IAD2, deploy apps via ansible): https://console-openshift-console.apps.ocp.stg.fedoraproject.org/
- Production (Self Hosted in IAD2, deploy apps via ansible): https://console-openshift-console.apps.ocp.fedoraproject.org/
- Communishift (RH Openshift Dedicated deployed in AWS, apps deployed by individual app maintainers in various ways): https://console-openshift-console.apps.fedora.cj14.p1.openshiftapps.com/
Access to the clusters is managed via the Fedora account system (FAS). All Fedora users may authenticate, but access to each project is managed on an app per app basis. Open a ticket at [7] requesting access to a particular app, but ensure you first get approval from the existing app owners.
=== Building containers
- Use Podman over Docker when developing locally.
- How containers are currently built and updated inside Fedora Infra? Since the retirement of OSBS, they arn't automated iirc?
- Use a service to build the containers Konflux?, Imagebuilder?, quay.io? iirc, the plan is that we will use Konflux to do our container building going forward, we're starting off looking at configuring the Konflux instance to build artifacts (If you're interested in working on that, reach out to dkirwan if you want to look at that together).
-
- Don't consume an image directly built via BuildConfig with S2I (source to image) instead:
-- Use Fedora as the base image! `quay.io/fedora/fedora:latest`
-- Build and push the built container image to a registry like quay.io.
-- If the application is an official app image, use the fedora namespace: `quay.io/fedora/appname`.
-- Inside Openshift create an ImageStream which points at `quay.io/fedora/appname:v1.0.0.releasename`.
-- For staging could possibly use `quay.io/fedora/appname:latest`.
-- When the image changes inside quay.io the ImageStream will pull down the latest version of this image.
-- Applications should consume the container image via ImageStream within a Deployment.
-- This prevents problems which only display themselves during a build such as missing dependencies.
-- Doing it this way prevents outages or service degredation, as the existing version will remain operational should the build run into issues.
- Minimise the number of layers. Each line in the Dockerfile/Podmanfile adds a new layer. This can quickly increase the build time and size of the end container. To combat this make use of `&&` to chain commands together, which counts as a single layer. eg:
```
FROM busybox
RUN echo This is the 1 > 1 \
&& rm -f 1 \
&& echo This is the 2 > 2 \
&& rm -f 2 \
# ... for about 70 commands
# rather than
FROM busybox
RUN echo This is the 1 > 1 \
RUN rm -f 1 \
RUN echo This is the 2 > 2 \
RUN rm -f 2 \
# ... for about 70 commands
```
- Use specific build tags eg: `v1.0.2` which follow semantic versioning [9].
- Limit container privileges. By default containers which run as root cannot run in Openshift without elevated privileges and will not start the container without these privileges in place for the ServiceAccount. If you need root access, don't run this part of the application in Openshift at all (if possible).
=== ImageStream
- Changes to an image which the imagestream points to will automatically cause a roll out to applications which use this imagestream.
- This provides a single change to the base fedora image, to cause a roll out of all applications on the clusters to the latest image.
=== Handling Dependencies
- All application dependencies should be version pinned and locked within the container to aid reaching reproducible builds. Make use of a dependency management system as per the language best practices.
- If a container image is vital to Fedora, perhaps dependencies could also be stored in a local pip/gem/nodejs/whatever/rpm repository to enable building?
=== DeploymentConfig migration to Deployments
- DeploymentConfig is depreciated and is being phased out (very) soonish, we should replace all DeploymentConfigs with Deployments.
- This is being tracked with a board on pagure: https://pagure.io/fedora-infrastructure/issue/12142
- We should consider breaking this epic up into smaller tickets and creating individual tickets to track each instance of DeploymentConfig deployed app in Fedora Infra.
=== Deploy/use ACS (redhat product) that looks inside containers and tells you what's in it and what the security issues are.
- The registry quay.io has such features already, perhaps use this instead? One less service we need to run and maintain.
=== Security
- Do secure access to application using something like the `oauth-proxy` [5] especially if working with user data.
- When hosting an app within Openshift, using the oauth-proxy might be a better way to secure the app rather than using systems like flask-oidc.
=== Monitoring applications
- Do expose endpoints in the application to aid in monitoring [2].
-- Liveness probes to detect a non-responsive application
-- Readiness probes to ensure that a service is ready to receive traffic
-- Startup probes for identifying and delaying application startup until its prepared to handle requests
- Do expose metrics endpoint in the application to display current metrics indicating health and application metadata eg: /metrics
```
apiVersion: apps/v1
kind: Deployment
metadata:
name: darwin-app
spec:
replicas: 1
selector:
matchLabels:
app: darwin-app
template:
metadata:
labels:
app: darwin-app
spec:
containers:
- name: darwin-container
image: nginx:latest
ports:
- containerPort: 443
readinessProbe:
httpGet:
path: /darwin-probes
port: 443
scheme: HTTPS
initialDelaySeconds: 5
periodSeconds: 10
timeoutSeconds: 5
livenessProbe:
httpGet:
path: /healthz
port: 443
scheme: HTTPS
initialDelaySeconds: 5
periodSeconds: 10
timeoutSeconds: 5
startupProbe:
httpGet:
path: /healthz
port: liveness-port
failureThreshold: 30
periodSeconds: 10
```
- user workload monitoring stack (what ever the released name is now its out of tech preview)
- Hook into the Openshift monitoring [13] stack and then use prometheus exporters to push metrics and alerts to zabbix maybe (blocked until Zabbix is more widely used within Fedora Infra).
=== Preferred source control managers
- github
- pagure
- gitlab
- forgejo
=== Preferred methods of deploying applications within Fedora Infra
- Fedora Infra uses ansible playbooks/roles as the primary means to deploy applications.
- An ansible role should be developed to deploy the app within Fedora Infra.
- Private variables should be stored in the ansible-private repo.
- Ensure sane defaults are available within the default directory of the role.
- Alternative is to perhaps develop a Helm chart or Ansible based Kubernetes Operator [11] to do this work.
=== Limits, requests
- When deploying the application, ensure to add resource requests and limits to the Deployment [10] see example:
```
apiVersion: apps/v1
kind: Deployment
metadata:
name: darwin-app
spec:
replicas: 1
selector:
matchLabels:
app: darwin-app
template:
metadata:
labels:
app: darwin-app
spec:
containers:
- name: darwin-container
image: nginx:latest
ports:
- containerPort: 443
resources:
requests:
ephemeral-storage: "2Gi"
memory: 100Mi
cpu: 100m # milicores
limits:
memory: 200Mi
cpu: 1
ephemeral-storage: "4Gi"
...
```
- Do set limits
- Do set resource requests
- If a Container exceeds its memory limit, it will probably be terminated.
- If a Container exceeds its memory request, it is likely that its Pod will be evicted whenever the node runs out of memory.
=== Scaling
- When designing an app ensure the following:
-- It is capable of recovering from a restart/crash. (eg: killed, moved and or crashed containers)
-- Add ability to scale app in the architecture and design eg: multiple instances behind a load balancer. In Kubernetes Deployments: `replicas: 1`
-- Ensure the app includes high availability in its design, eg: 3 instances ensuring the application stays up even if one instance is down.

View file

@ -9,4 +9,5 @@ The following SOPs are related to the administration of the Communishift Cluster
- xref:sop_communishift_onboard_tenant.adoc[Onboarding a Communishift tenant] - xref:sop_communishift_onboard_tenant.adoc[Onboarding a Communishift tenant]
- xref:sop_communishift_tenant_quota.adoc[Configuring the Resourcequota for a tenant] - xref:sop_communishift_tenant_quota.adoc[Configuring the Resourcequota for a tenant]
- xref:sop_communishift_create_sharedvolume.adoc[Create the SharedVolume object which manages tenant storage] - xref:sop_communishift_create_sharedvolume.adoc[Create the SharedVolume object which manages tenant storage]
- xref:sop_communishift_cleanup_script.adoc[Run the Communishift Clean Up Script]

View file

@ -0,0 +1,41 @@
= Run the Communishift Clean Up Script
== Resources
- [1] Playbook: https://pagure.io/fedora-infra/ansible/blob/main/f/playbooks/manual/communishift_send_email_notifications.yml
- [2] Role: https://pagure.io/fedora-infra/ansible/blob/main/f/roles/communishift
- [3] Cluster: https://console-openshift-console.apps.fedora.cj14.p1.openshiftapps.com
=== Add project name to variables
Members of `sysadmin-openshift` can run this playbook at [1]. It contains the list of communishift projects. When on boarding, add the new name of the project to the `communishift_projects` dictionary in `inventory/group_vars/all`.
If needed, resource quotas can be overriden from defaults in the same dictionary. The `do_not_delete: true` variable attached to a project will prevent the project from recieving notifications and being cleaned up by cleanup scripts.
=== Run the playbook to send notifications
Run the playbook[1] on the batcave in order to send notifications to project administrators.
----
sudo rbac-playbook manual/communishift_send_email_notifications.yml
----
=== Cleaning up projects
The system for actually deleting the projects is not automated. Please manually delete each one.
=== Finally update the all groups var with the remaining list of projects
Update the `communishift_projects` dictionary in `inventory/groups/all` to include only the remaining projects which were not removed as part of this process eg:
----
communishift_projects:
communishift-fedora-review-service:
name: communishift-fedora-review-service
do_not_delete: true # Marked do not delete 2024-10-21
communishift-log-detective:
name: communishift-log-detective
do_not_delete: true # Marked do not delete 2024-10-21
memory_requests: 4Gi
memory_limits: 6Gi
storage_requests: 10Gi
----
Please also disable each FAS group which corresponded with the pruned project. It should match the name of the project listed in the `inventory/groups/all` `communishift_projects` dictionary.

View file

@ -16,6 +16,20 @@ If needed, resource quotas can be overriden from defaults in the same dictionary
Note: Projects *must* start with `communishift-` eg `communishift-dev-test`. Note: Projects *must* start with `communishift-` eg `communishift-dev-test`.
See the following example of the `communishift-eventbot` project and the `communishift-fedora-review-service` project being added:
----
communishift_projects:
communishift-eventbot:
name: communishift-eventbot
communishift-fedora-review-service:
name: communishift-fedora-review-service
do_not_delete: true # Marked do not delete 2024-10-21
...
----
NOTE: To mark a project as one which should _NOT_ be cleaned up as part of the Communishift clean up script, mark it with the boolean like so and it is helpful to include the date so we can see at a glance when projects were granted this special status: `do_not_delete: true # Marked do not delete YYYY-MM-DD`
=== Add new project group to IPA === Add new project group to IPA
A group must be created in IPA which matches the name of the group added to the playbook in the previous step. Please ensure that the community member requesting access to the cluster is also added to this group in IPA, and made a sponsor. This way they can administer members in their group in a self service fashion later. A group must be created in IPA which matches the name of the group added to the playbook in the previous step. Please ensure that the community member requesting access to the cluster is also added to this group in IPA, and made a sponsor. This way they can administer members in their group in a self service fashion later.