Communishift: project deletion and notifcations SOP

Openshift: WIP best practices Signed-off-by: David Kirwan <davidkirwanirl@gmail.com>
2024-12-11 13:20:33 +00:00 · 2024-12-11 13:20:33 +00:00 · 5141624ed5
commit 5141624ed5
parent 25d3f58d7a
5 changed files with 265 additions and 0 deletions
--- a/modules/developer_guide/pages/openshift.adoc
+++ b/modules/developer_guide/pages/openshift.adoc
@ -12,6 +12,8 @@ If you've never used OpenShift before a good place to start is with
 https://www.openshift.org/minishift/[MiniShift], which deploys OpenShift
 Origin in a virtual machine.
 See the following for some: xref:openshift_bestpractices.adoc[Openshift Best Practices]
 === OpenShift in Fedora Infrastructure
 Fedora has two OpenShift deployments:
--- a/modules/developer_guide/pages/openshift_bestpractices.adoc
+++ b/modules/developer_guide/pages/openshift_bestpractices.adoc
@ -0,0 +1,207 @@
 == Fedora Infra Openshift Best Practices
 This document aims to encourage the use of best practices related to application development and deployment of containerised applications on Kubernetes/Openshift.
 NOTE: Its a large topic, can't possibly cover every element in detail, but should be enough to act as a primer.
 NOTE: Should these best practices be something maintained by the kube-sig? If so should we attempt to resurrect it?
 === References/Resources/Further Reading
 - [1] Fedora Infra Flock Hackfest https://hackmd.io/HxpzTNpITfu0OYmOGRApiw
 - [2] Kubernetes health checks https://blog.kubecost.com/blog/kubernetes-health-check/
 - [3] Prometheus metrics format https://github.com/prometheus/docs/blob/main/content/docs/instrumenting/exposition_formats.md#text-based-format
 - [4] Fedora Kubedev SIG https://fedoraproject.org/wiki/SIGs/KubeDev
 - [5] Openshift oauth-proxy https://github.com/openshift/oauth-proxy
 - [6] Fedora Infra migration tracker DeploymentConfig to Deployment https://pagure.io/fedora-infrastructure/issue/12142
 - [7] Fedora Infra ticket tracker: https://pagure.io/fedora-infrastructure/issues
 - [8] 42 Prod Best Practices The Complete Guide for Developers https://medium.com/@mahernaija/docker-2024-docker-compose-2024-master-best-practices-the-complete-guide-for-developers-aaf851349240
 - [9] Semantic Versioning https://semver.org/
 - [10] Resources https://docs.openshift.com/container-platform/4.17/scalability_and_performance/compute-resource-quotas.html
 - [11] Ansible Operator Tutorial: https://sdk.operatorframework.io/docs/building-operators/ansible/tutorial/
 - [12] How pods with resource limits are run https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#how-pods-with-resource-limits-are-run
 - [13] Enabling monitoring for user defined projects: https://docs.openshift.com/container-platform/4.8/monitoring/enabling-monitoring-for-user-defined-projects.html
 === Fedora Infra Clusters
 Fedora Infra currently manages the following three Openshift clusters:
 - Staging (Self Hosted in IAD2, deploy apps via ansible): https://console-openshift-console.apps.ocp.stg.fedoraproject.org/
 - Production (Self Hosted in IAD2, deploy apps via ansible): https://console-openshift-console.apps.ocp.fedoraproject.org/
 - Communishift (RH Openshift Dedicated deployed in AWS, apps deployed by individual app maintainers in various ways): https://console-openshift-console.apps.fedora.cj14.p1.openshiftapps.com/
 Access to the clusters is managed via the Fedora account system (FAS). All Fedora users may authenticate, but access to each project is managed on an app per app basis. Open a ticket at [7] requesting access to a particular app, but ensure you first get approval from the existing app owners.
 === Building containers
 - Use Podman over Docker when developing locally.
 - How containers are currently built and updated inside Fedora Infra? Since the retirement of OSBS, they arn't automated iirc?
 - Use a service to build the containers Konflux?, Imagebuilder?, quay.io? iirc, the plan is that we will use Konflux to do our container building going forward, we're starting off looking at configuring the Konflux instance to build artifacts (If you're interested in working on that, reach out to dkirwan if you want to look at that together).
 -
 - Don't consume an image directly built via BuildConfig with S2I (source to image) instead:
 -- Use Fedora as the base image! `quay.io/fedora/fedora:latest`
 -- Build and push the built container image to a registry like quay.io.
 -- If the application is an official app image, use the fedora namespace: `quay.io/fedora/appname`.
 -- Inside Openshift create an ImageStream which points at `quay.io/fedora/appname:v1.0.0.releasename`.
 -- For staging could possibly use `quay.io/fedora/appname:latest`.
 -- When the image changes inside quay.io the ImageStream will pull down the latest version of this image.
 -- Applications should consume the container image via ImageStream within a Deployment.
 -- This prevents problems which only display themselves during a build such as missing dependencies.
 -- Doing it this way prevents outages or service degredation, as the existing version will remain operational should the build run into issues.
 - Minimise the number of layers. Each line in the Dockerfile/Podmanfile adds a new layer. This can quickly increase the build time and size of the end container. To combat this make use of `&&` to chain commands together, which counts as a single layer. eg:
 ```
 FROM busybox
 RUN echo This is the 1 > 1 \
  && rm -f 1 \
  && echo This is the 2 > 2 \
  && rm -f 2 \
 # ... for about 70 commands
 # rather than
 FROM busybox
 RUN echo This is the 1 > 1 \
 RUN rm -f 1 \
 RUN echo This is the 2 > 2 \
 RUN rm -f 2 \
 # ... for about 70 commands
 ```
 - Use specific build tags eg: `v1.0.2` which follow semantic versioning [9].
 - Limit container privileges. By default containers which run as root cannot run in Openshift without elevated privileges and will not start the container without these privileges in place for the ServiceAccount. If you need root access, don't run this part of the application in Openshift at all (if possible).
 === ImageStream
 - Changes to an image which the imagestream points to will automatically cause a roll out to applications which use this imagestream.
 - This provides a single change to the base fedora image, to cause a roll out of all applications on the clusters to the latest image.
 === Handling Dependencies
 - All application dependencies should be version pinned and locked within the container to aid reaching reproducible builds. Make use of a dependency management system as per the language best practices.
 - If a container image is vital to Fedora, perhaps dependencies could also be stored in a local pip/gem/nodejs/whatever/rpm repository to enable building?
 === DeploymentConfig migration to Deployments
 - DeploymentConfig is depreciated and is being phased out (very) soonish, we should replace all DeploymentConfigs with Deployments.
 - This is being tracked with a board on pagure: https://pagure.io/fedora-infrastructure/issue/12142
 - We should consider breaking this epic up into smaller tickets and creating individual tickets to track each instance of DeploymentConfig deployed app in Fedora Infra.
 === Deploy/use ACS (redhat product) that looks inside containers and tells you what's in it and what the security issues are.
 - The registry quay.io has such features already, perhaps use this instead? One less service we need to run and maintain.
 === Security
 - Do secure access to application using something like the `oauth-proxy` [5] especially if working with user data.
 - When hosting an app within Openshift, using the oauth-proxy might be a better way to secure the app rather than using systems like flask-oidc.
 === Monitoring applications
 - Do expose endpoints in the application to aid in monitoring [2].
 -- Liveness probes to detect a non-responsive application
 -- Readiness probes to ensure that a service is ready to receive traffic
 -- Startup probes for identifying and delaying application startup until it’s prepared to handle requests
 - Do expose metrics endpoint in the application to display current metrics indicating health and application metadata eg: /metrics
 ```
 apiVersion: apps/v1
 kind: Deployment
 metadata:
  name: darwin-app
 spec:
  replicas: 1
  selector:
    matchLabels:
      app: darwin-app
  template:
    metadata:
      labels:
        app: darwin-app
    spec:
      containers:
      - name: darwin-container
        image: nginx:latest
        ports:
        - containerPort: 443
        readinessProbe:
          httpGet:
            path: /darwin-probes
            port: 443
            scheme: HTTPS
          initialDelaySeconds: 5
          periodSeconds: 10
          timeoutSeconds: 5
        livenessProbe:
          httpGet:
            path: /healthz
            port: 443
            scheme: HTTPS
          initialDelaySeconds: 5
          periodSeconds: 10
          timeoutSeconds: 5
        startupProbe:
          httpGet:
            path: /healthz
            port: liveness-port
          failureThreshold: 30
          periodSeconds: 10
 ```
 - user workload monitoring stack (what ever the released name is now its out of tech preview)
 - Hook into the Openshift monitoring [13] stack and then use prometheus exporters to push metrics and alerts to zabbix maybe (blocked until Zabbix is more widely used within Fedora Infra).
 === Preferred source control managers
 - github
 - pagure
 - gitlab
 - forgejo
 === Preferred methods of deploying applications within Fedora Infra
 - Fedora Infra uses ansible playbooks/roles as the primary means to deploy applications.
 - An ansible role should be developed to deploy the app within Fedora Infra.
 - Private variables should be stored in the ansible-private repo.
 - Ensure sane defaults are available within the default directory of the role.
 - Alternative is to perhaps develop a Helm chart or Ansible based Kubernetes Operator [11] to do this work.
 === Limits, requests
 - When deploying the application, ensure to add resource requests and limits to the Deployment [10] see example:
 ```
 apiVersion: apps/v1
 kind: Deployment
 metadata:
  name: darwin-app
 spec:
  replicas: 1
  selector:
    matchLabels:
      app: darwin-app
  template:
    metadata:
      labels:
        app: darwin-app
    spec:
      containers:
      - name: darwin-container
        image: nginx:latest
        ports:
        - containerPort: 443
        resources:
          requests:
            ephemeral-storage: "2Gi"
            memory: 100Mi
            cpu: 100m # milicores
          limits:
            memory: 200Mi
            cpu: 1
            ephemeral-storage: "4Gi"
 ...
 ```
 - Do set limits
 - Do set resource requests
 - If a Container exceeds its memory limit, it will probably be terminated.
 - If a Container exceeds its memory request, it is likely that its Pod will be evicted whenever the node runs out of memory.
 === Scaling
 - When designing an app ensure the following:
 -- It is capable of recovering from a restart/crash. (eg: killed, moved and or crashed containers)
 -- Add ability to scale app in the architecture and design eg: multiple instances behind a load balancer. In Kubernetes Deployments: `replicas: 1`
 -- Ensure the app includes high availability in its design, eg: 3 instances ensuring the application stays up even if one instance is down.
--- a/modules/sysadmin_guide/pages/sop_communishift.adoc
+++ b/modules/sysadmin_guide/pages/sop_communishift.adoc
@ -9,4 +9,5 @@ The following SOPs are related to the administration of the Communishift Cluster
 - xref:sop_communishift_onboard_tenant.adoc[Onboarding a Communishift tenant]
 - xref:sop_communishift_tenant_quota.adoc[Configuring the Resourcequota for a tenant]
 - xref:sop_communishift_create_sharedvolume.adoc[Create the SharedVolume object which manages tenant storage]
 - xref:sop_communishift_cleanup_script.adoc[Run the Communishift Clean Up Script]
--- a/modules/sysadmin_guide/pages/sop_communishift_cleanup_script.adoc
+++ b/modules/sysadmin_guide/pages/sop_communishift_cleanup_script.adoc
@ -0,0 +1,41 @@
 = Run the Communishift Clean Up Script
 == Resources
 - [1] Playbook: https://pagure.io/fedora-infra/ansible/blob/main/f/playbooks/manual/communishift_send_email_notifications.yml
 - [2] Role: https://pagure.io/fedora-infra/ansible/blob/main/f/roles/communishift
 - [3] Cluster: https://console-openshift-console.apps.fedora.cj14.p1.openshiftapps.com
 === Add project name to variables
 Members of `sysadmin-openshift` can run this playbook at [1]. It contains the list of communishift projects. When on boarding, add the new name of the project to the `communishift_projects` dictionary in `inventory/group_vars/all`.
 If needed, resource quotas can be overriden from defaults in the same dictionary. The `do_not_delete: true` variable attached to a project will prevent the project from recieving notifications and being cleaned up by cleanup scripts.
 === Run the playbook to send notifications
 Run the playbook[1] on the batcave in order to send notifications to project administrators.
 ----
 sudo rbac-playbook manual/communishift_send_email_notifications.yml
 ----
 === Cleaning up projects
 The system for actually deleting the projects is not automated. Please manually delete each one.
 === Finally update the all groups var with the remaining list of projects
 Update the `communishift_projects` dictionary in `inventory/groups/all` to include only the remaining projects which were not removed as part of this process eg:
 ----
 communishift_projects:
  communishift-fedora-review-service:
    name: communishift-fedora-review-service
    do_not_delete: true                           # Marked do not delete 2024-10-21
  communishift-log-detective:
    name: communishift-log-detective
    do_not_delete: true                           # Marked do not delete 2024-10-21
    memory_requests: 4Gi
    memory_limits: 6Gi
    storage_requests: 10Gi
 ----
 Please also disable each FAS group which corresponded with the pruned project. It should match the name of the project listed in the `inventory/groups/all` `communishift_projects` dictionary.
--- a/modules/sysadmin_guide/pages/sop_communishift_onboard_tenant.adoc
+++ b/modules/sysadmin_guide/pages/sop_communishift_onboard_tenant.adoc
@ -16,6 +16,20 @@ If needed, resource quotas can be overriden from defaults in the same dictionary
 Note: Projects *must* start with `communishift-` eg `communishift-dev-test`.
 See the following example of the `communishift-eventbot` project and the `communishift-fedora-review-service` project being added:
 ----
 communishift_projects:
  communishift-eventbot:
    name: communishift-eventbot
  communishift-fedora-review-service:
    name: communishift-fedora-review-service
    do_not_delete: true                           # Marked do not delete 2024-10-21
 ...
 ----
 NOTE: To mark a project as one which should _NOT_ be cleaned up as part of the Communishift clean up script, mark it with the boolean like so and it is helpful to include the date so we can see at a glance when projects were granted this special status: `do_not_delete: true # Marked do not delete YYYY-MM-DD`
 === Add new project group to IPA
 A group must be created in IPA which matches the name of the group added to the playbook in the previous step. Please ensure that the community member requesting access to the cluster is also added to this group in IPA, and made a sponsor. This way they can administer members in their group in a self service fashion later.