ocp4 sops moved into sysadmin_guide

Signed-off-by: David Kirwan <davidkirwanirl@gmail.com>
2024-07-22 10:37:01 +01:00 · 2024-07-22 10:37:01 +01:00 · c0d6947dba
commit c0d6947dba
parent 1d17fd8610
35 changed files with 1 additions and 1 deletions
--- a/modules/sysadmin_guide/pages/index.adoc
+++ b/modules/sysadmin_guide/pages/index.adoc
@ -151,7 +151,7 @@ xref:developer_guide:sops.adoc[Developing Standard Operating Procedures].
 * xref:netapp.adoc[Netapp Infrastructure]
 * xref:new-virtual-hosts.adoc[Virtual Host Addition]
 * xref:nonhumanaccounts.adoc[Non-human Accounts Infrastructure]
-* xref:ocp4:sops.adoc[Openshift SOPs]
+* xref:openshift_sops.adoc[Openshift SOPs]
 * xref:odcs.adoc[On Demand Compose Service]
 * xref:openqa.adoc[OpenQA Infrastructure]
 * xref:openvpn.adoc[OpenVPN]
--- a/modules/sysadmin_guide/pages/openshift_sops.adoc
+++ b/modules/sysadmin_guide/pages/openshift_sops.adoc
@ -0,0 +1,24 @@
+= SOPs
+
+- xref:sop_configure_baremetal_pxe_uefi_boot.adoc[SOP Configure Baremetal PXE-UEFI Boot]
+- xref:sop_configure_image_registry_operator.adoc[SOP Configure the Image Registry Operator]
+- xref:sop_configure_local_storage_operator.adoc[SOP Configure the Local Storage Operator]
+- xref:sop_configure_oauth_ipa.adoc[SOP Configure oauth Authentication via IPA/Noggin]
+- xref:sop_configure_openshift_container_storage.adoc[SOP Configure the Openshift Container Storage Operator]
+- xref:sop_configure_userworkload_monitoring_stack.adoc[SOP Configure the Userworkload Monitoring Stack]
+- xref:sop_cordoning_nodes_and_draining_pods.adoc[SOP Cordoning and Draining Nodes]
+- xref:sop_create_machineconfigs.adoc[SOP Create MachineConfigs to Configure RHCOS]
+- xref:sop_disable_provisioners_role.adoc[SOP Disable the Provisioners Role]
+- xref:sop_graceful_shutdown_ocp_cluster.adoc[SOP Graceful Cluster Shutdown]
+- xref:sop_graceful_startup_ocp_cluster.adoc[SOP Graceful Cluster Startup]
+- xref:sop_installation.adoc[SOP Openshift 4 Installation on Fedora Infra]
+- xref:sop_retrieve_ocp4_cacert.adoc[SOP Retrieve OCP4 CACERT]
+- xref:sop_upgrade.adoc[SOP Upgrade OCP4 Cluster]
+- xref:sop_etcd_backup.adoc[SOP Create etcd backup]
+- xref:sop_configure_openshift_virtualization_operator.adoc[SOP Configure the Openshift Virtualization Operator]
+- xref:sop_add_node.adoc[SOP Add an OCP4 Node to an Existing Cluster]
+- xref:sop_add_odf_storage.adoc[SOP Add new capacity to the OCP4 ODF Storage Cluster]
+- xref:sop_velero.adoc[SOP Velero]
+- xref:sop_aws_efs_operator.adoc[SOP AWS EFS Operator]
+- xref:sop_communishift.adoc[SOP Communishift Cluster Administration]
+- xref:sop_fas2discourse_operator.adoc[SOP fas2discourse operator]
--- a/modules/sysadmin_guide/pages/sop_add_node.adoc
+++ b/modules/sysadmin_guide/pages/sop_add_node.adoc
@ -0,0 +1,133 @@
+= SOP Add an OCP4 Node to an Existing Cluster
+This SOP should be used in the following scenario:
+
+- Red Hat OpenShift Container Platform 4.x cluster has been installed some time ago (1+ days ago) and additional worker nodes are required to increase the capacity for the cluster.
+
+
+== Resources
+- [1] https://access.redhat.com/solutions/4246261[How to add OpenShift 4 RHCOS worker nodes in UPI within the first 24 hours]
+- [2] https://access.redhat.com/solutions/4799921[How to add OpenShift 4 RHCOS worker nodes to UPI after the first 24 hours]
+- [3] https://docs.openshift.com/container-platform/4.8/post_installation_configuration/node-tasks.html[Adding RHCOS worker nodes]
+
+
+== Steps
+1. Add the new nodes to the Ansible inventory file in the appropriate group.
+
+eg:
+
+----
+[ocp_workers]
+worker01.ocp.iad2.fedoraproject.org
+worker02.ocp.iad2.fedoraproject.org
+worker03.ocp.iad2.fedoraproject.org
+
+
+[ocp_workers_stg]
+worker01.ocp.stg.iad2.fedoraproject.org
+worker02.ocp.stg.iad2.fedoraproject.org
+worker03.ocp.stg.iad2.fedoraproject.org
+worker04.ocp.stg.iad2.fedoraproject.org
+worker05.ocp.stg.iad2.fedoraproject.org
+----
+
+2. Add the new hostvars for each new host being added, see the following examples for `VM` vs `baremetal` hosts.
+
+----
+# control plane VM
+inventory/host_vars/ocp01.ocp.iad2.fedoraproject.org
+
+# compute baremetal
+inventory/host_vars/worker01.ocp.iad2.fedoraproject.org
+----
+
+3. If the nodes are `compute` or `worker` nodes, they must be also added to the following group_vars `proxies` for prod, `proxies_stg` for staging
+
+----
+inventory/group_vars/proxies:ocp_nodes:
+inventory/group_vars/proxies_stg:ocp_nodes_stg:
+----
+
+4. Changes must be made to the `roles/dhcp_server/files/dhcpd.conf.noc01.iad2.fedoraproject.org` file for DHCP to ensure that the node will receive an IP address based on its MAC address, and tells the node to reach out to the `next-server` where it can find the UEFI boot configuration.
+
+----
+host worker01-ocp {                        # UPDATE THIS
+     hardware ethernet 68:05:CA:CE:A3:C9;  # UPDATE THIS
+     fixed-address 10.3.163.123;           # UPDATE THIS
+     filename "uefi/grubx64.efi";
+     next-server 10.3.163.10;
+     option routers 10.3.163.254;
+     option subnet-mask 255.255.255.0;
+}
+----
+
+5. Changes must be made to DNS. To do this one must be a member of `sysadmin-main`, if you are not, one must send a patch request to the Fedora Infra mailing list for review which will be merged by the sysadmin-main members.
+
+See the following examples for the `worker01.ocp` nodes for production and staging.
+
+----
+master/163.3.10.in-addr.arpa:123      IN        PTR      worker01.ocp.iad2.fedoraproject.org.
+master/166.3.10.in-addr.arpa:118      IN        PTR      worker01.ocp.stg.iad2.fedoraproject.org.
+master/iad2.fedoraproject.org:worker01.ocp            IN      A       10.3.163.123
+master/stg.iad2.fedoraproject.org:worker01.ocp            IN      A       10.3.166.118
+----
+
+6. Run the playbook to update the haproxy config to monitor the new nodes, and add it to the load balancer.
+
+----
+sudo rbac-playbook groups/noc.yml -t "tftp_server,dhcp_server"
+sudo rbac-playbook groups/proxies.yml -t 'haproxy,httpd'
+----
+
+7. DHCP instructs the node to reach out to the `next-server` when it is handed out an IP address. The `next-server` runs a tftp server which contains the kernel, initramfs and UEFI boot configuration. `uefi/grub.cfg`. Contained in this grub.cfg is the following which relates to the OCP4 nodes:
+
+----
+menuentry 'RHCOS 4.8 worker staging' {
+  linuxefi images/RHCOS/4.8/x86_64/rhcos-4.8.2-x86_64-live-kernel-x86_64 ip=dhcp nameserver=10.3.163.33 coreos.inst.install_dev=/dev/sda
+coreos.live.rootfs_url=http://10.3.166.50/rhcos/rhcos-4.8.2-x86_64-live-rootfs.x86_64.img coreos.inst.ignition_url=http://10.3.166.50/rhcos/worker.ign
+  initrdefi images/RHCOS/4.8/x86_64/rhcos-4.8.2-x86_64-live-initramfs.x86_64.img
+}
+menuentry 'RHCOS 4.8 worker production' {
+  linuxefi images/RHCOS/4.8/x86_64/rhcos-4.8.2-x86_64-live-kernel-x86_64 ip=dhcp nameserver=10.3.163.33 coreos.inst.install_dev=/dev/sda
+coreos.live.rootfs_url=http://10.3.163.65/rhcos/rhcos-4.8.2-x86_64-live-rootfs.x86_64.img coreos.inst.ignition_url=http://10.3.163.65/rhcos/worker.ign
+  initrdefi images/RHCOS/4.8/x86_64/rhcos-4.8.2-x86_64-live-initramfs.x86_64.img
+}
+----
+
+When a node is booted up, and reads this UEFI boot configuration, the menu option must be manually selected:
+
+- To add a node to the staging cluster choose: `RHCOS 4.8 worker staging`
+- To add a node to the production cluster choose: `RHCOS 4.8 worker production`
+
+8. Connect to the `os-control01` node which corresponds with the ENV which the new node is being added to.
+
+Verify that you are authenticated correctly to the OpenShift cluster as the `system:admin` user.
+
+----
+oc whoami
+system:admin
+----
+
+9. Contained within the UEFI boot menu configuration are the links to the web server running on the `os-control01` host specific to the ENV. This server should only run when we wish to reinstall an existing node or install a new node. Start it using systemctl manually:
+
+----
+systemctl start httpd.service
+----
+
+10. Boot up the node and select the appropriate menu entry to install the node into the correct cluster.
+Wait until the node displays a SSH login prompt with the nodes name. It may reboot several times during the process.
+
+11. As the new nodes are provisioned, they will attempt to join the cluster. They must first be accepted.
+From the `os-control01` node run the following:
+
+----
+# List the certs. If you see status pending, this is the worker/compute nodes attempting to join the cluster. It must be approved.
+oc get csr
+
+# Accept all node CSRs one liner
+oc get csr -o go-template='{{range .items}}{{if not .status}}{{.metadata.name}}{{"\n"}}{{end}}{{end}}' | xargs oc adm certificate approve
+----
+
+This process usually needs to be repeated twice, for each new node.
+
+To see more information about adding new worker/compute nodes to a user provisioned infrastructure based OCP4 cluster see the detailed steps at [1],[2].
+
--- a/modules/sysadmin_guide/pages/sop_add_odf_storage.adoc
+++ b/modules/sysadmin_guide/pages/sop_add_odf_storage.adoc
@ -0,0 +1,98 @@
+= SOP Add new capacity to the OCP4 ODF Storage Cluster
+This SOP should be used in the following scenario:
+
+- Red Hat OpenShift Container Platform 4.x cluster has been installed
+- Additional worker nodes are being added to increase the capacity for the cluster
+- These additional worker nodes have storage resources which we wish to add to the Openshift Datafoundation Storage Cluster
+- We are adding enough storage to meet the minimum of 3 replicas. eg: 3 nodes, or enough storage devices that the the number is divisble by 3.
+
+== Resources
+- [1] https://access.redhat.com/solutions/4246261[How to add OpenShift 4 RHCOS worker nodes in UPI within the first 24 hours]
+- [2] https://access.redhat.com/solutions/4799921[How to add OpenShift 4 RHCOS worker nodes to UPI after the first 24 hours]
+- [3] https://docs.openshift.com/container-platform/4.8/post_installation_configuration/node-tasks.html[Adding RHCOS worker nodes]
+- [4] https://access.redhat.com/documentation/en-us/red_hat_openshift_data_foundation/4.9[Openshift Data Foundation Product Notes]
+
+== Steps
+
+1. Once a new node has been added to the Openshift cluster, we can manage the extra local storage devices on this node from within Openshift itself, providing that they do not contain file paritions/filesystems. In the case of a node being repurposed, please first ensure that all storage devices except `/dev/sda` are partition and filesystem free before starting.
+
+2. From within the Openshift webconsole, or via cli search for all "LocalVolumeDiscovery" objects.
+
+----
+[root@os-control01 ~][PROD-IAD2]# oc get localvolumediscovery --all-namespaces
+NAMESPACE                 NAME                    AGE
+openshift-local-storage   auto-discover-devices   167d
+----
+
+There should really be only a single LocalVolumeDiscovery object called `auto-discover-devices` in the `openshift-local-storage` namespace/project.
+
+Edit this object:
+
+----
+oc edit localvolumediscovery auto-discover-devices -n openshift-local-storage
+----
+
+Add the hostname for the new node to the list that is already there like:
+
+----
+...
+spec:
+  nodeSelector:
+    nodeSelectorTerms:
+    - matchExpressions:
+      - key: kubernetes.io/hostname
+        operator: In
+        values:
+        - worker01.ocp.iad2.fedoraproject.org
+        - worker02.ocp.iad2.fedoraproject.org
+        - worker03.ocp.iad2.fedoraproject.org
+        - worker04.ocp.iad2.fedoraproject.org
+        - worker05.ocp.iad2.fedoraproject.org
+...
+----
+
+Write and save the change.
+
+3. From within the Openshift webconsole, or via cli search for all "LocalVolumeSet" objects.
+
+There should really be only a single LocalVolumeSet object called `local-block` in the `openshift-local-storage` namespace/project.
+
+Edit this object:
+
+----
+oc edit localvolumeset local-block -n openshift-local-storage
+----
+
+Add the hostname for the new node to the list that is already there like:
+
+----
+...
+spec:
+...
+  nodeSelector:
+    nodeSelectorTerms:
+    - matchExpressions:
+      - key: kubernetes.io/hostname
+        operator: In
+        values:
+        - worker01.ocp.iad2.fedoraproject.org
+        - worker02.ocp.iad2.fedoraproject.org
+        - worker03.ocp.iad2.fedoraproject.org
+        - worker04.ocp.iad2.fedoraproject.org
+        - worker05.ocp.iad2.fedoraproject.org
+...
+----
+
+Write and save the change.
+
+4. Add the `cluster.ocs.openshift.io/openshift-storage` label to the new node:
+
+----
+oc label no worker05.ocp.iad2.fedoraproject.org cluster.ocs.openshift.io/openshift-storage=''
+----
+
+5. From the Openshift Web console visit `Storage, OpenShift Data Foundation`, then in the `Storage Systems` sub menu, click the 3 dot menu on the right beside the `ocs-storagecluster-storage` object. Choose `Add Capacity` option. From the popup menu that appears, ensure that the storage class `local-block` is selected in the list. Finally confirm with add.
+
+.Note:
+For best results, only perform this step once, and after all nodes have been added to the cluster.
+
--- a/modules/sysadmin_guide/pages/sop_aws_efs_operator.adoc
+++ b/modules/sysadmin_guide/pages/sop_aws_efs_operator.adoc
@ -0,0 +1,23 @@
+= Configure the AWS EFS Operator
+
+== Resources
+- [1] https://github.com/openshift/aws-efs-operator
+- [2] https://access.redhat.com/articles/5025181
+
+
+== Installation
+For installation instructions visit the official docs at: [1], [2].
+
+- From the webconsole, click on the `Operators` option, then `OperatorHub`
+- Search for `efs`
+- Click the operator named `AWS EFS Operator`
+- Click install
+- Make sure the `Update Channel` matches `stable`.
+- Ensure that the `Update Approval` is set to automatic.
+
+== Configuration
+
+- No configuration is required for this operator.
+- Users must be given permissions to perform CRUD operations on `SharedVolume` objects
+- Making use of the Fedora Ansible role `communishift` to create the AWS EFS Filesystems and Access Points, will also create a Secret containing these details in the users namespace.
+
--- a/modules/sysadmin_guide/pages/sop_communishift.adoc
+++ b/modules/sysadmin_guide/pages/sop_communishift.adoc
@ -0,0 +1,12 @@
+= Communishift
+The following SOPs are related to the administration of the Communishift Cluster.
+
+== Resources
+- https://console-openshift-console.apps.fedora.cj14.p1.openshiftapps.com[Cluster]
+- xref:sop_communishift_authorization_operator.adoc[Install the CommunishiftAuthorization operator]
+- xref:sop_communishift_authorization_operator_testing.adoc[Testing the CommunishiftAuthorization operator]
+- xref:sop_communishift_authorization_operator_build.adoc[Building/releasing the CommunishiftAuthorization operator]
+- xref:sop_communishift_onboard_tenant.adoc[Onboarding a Communishift tenant]
+- xref:sop_communishift_tenant_quota.adoc[Configuring the Resourcequota for a tenant]
+- xref:sop_communishift_create_sharedvolume.adoc[Create the SharedVolume object which manages tenant storage]
+
--- a/modules/sysadmin_guide/pages/sop_communishift_authorization_operator.adoc
+++ b/modules/sysadmin_guide/pages/sop_communishift_authorization_operator.adoc
@ -0,0 +1,26 @@
+= Configure the CommunishiftAuthorization Operator
+
+== Resources
+- [1] Code:  https://pagure.io/cpe/communishift/blob/main/f/CommunishiftAuthorization
+
+== Installation
+There is a Makefile bundled with the code [1] of this operator.
+
+To install the operator:
+
+- From a terminal, be logged into the Communishift cluster with cluster-admin privileges.
+- Create a project `communishift-authorization-operator`
+- Run `make deploy`
+
+To activate the operator we need to create a `CommunishiftAuthorization` custom resource. An example of one exists in `CommunishiftAuthorization/config/samples/_v1alpha1_communishiftauthorization.yaml`
+
+Create it with the following:
+
+----
+oc apply -f CommunishiftAuthorization/config/samples/_v1alpha1_communishiftauthorization.yaml
+----
+
+
+== Configuration
+
+- No other configuration is required for this operator.
--- a/modules/sysadmin_guide/pages/sop_communishift_authorization_operator_build.adoc
+++ b/modules/sysadmin_guide/pages/sop_communishift_authorization_operator_build.adoc
@ -0,0 +1,23 @@
+= Build/release the CommunishiftAuthorization Operator
+
+== Resources
+- [1] Code:  https://pagure.io/cpe/communishift/blob/main/f/CommunishiftAuthorization
+- [2] Quay: https://quay.io/repository/fedora/communishift-authorization-operator
+
+== Installation
+To build the operator and tag it with version `v0.0.30` as an example:
+
+- First ensure that you are logged into quay.io and have access to the repository at [2].
+- Check out the code at [1], and change directory into the `CommunishiftAuthorization` directory.
+- Update the version mentioned in the Deployment for the operator at `config/manager/manager.yml`
+
+----
+podman build -t quay.io/fedora/communishift-authorization-operator:v0.0.30 .
+----
+
+Push the operator to the quay.io catalog then with the following:
+
+----
+podman push quay.io/fedora/communishift-authorization-operator:v0.0.30
+----
+
--- a/modules/sysadmin_guide/pages/sop_communishift_authorization_operator_testing.adoc
+++ b/modules/sysadmin_guide/pages/sop_communishift_authorization_operator_testing.adoc
@ -0,0 +1,14 @@
+= Test the CommunishiftAuthorization Operator
+
+== Resources
+- [1] Code:  https://pagure.io/cpe/communishift/blob/main/f/CommunishiftAuthorization
+- [2] Molecule: https://molecule.readthedocs.io/en/latest/
+
+== Installation
+There is a molecule directory bundled with the code [1] of this operator. They currently are designed to only run against the Communishift cluster itself, as it needs access to secrets for the keytab to auth against fasjson.
+
+To run the operator molecule tests:
+
+- Ensure that the molecule utility is installed `dnf install python3-molecule`
+- From a terminal, be logged into the Communishift cluster with cluster-admin privileges.
+- Run `molecule test`
--- a/modules/sysadmin_guide/pages/sop_communishift_create_sharedvolume.adoc
+++ b/modules/sysadmin_guide/pages/sop_communishift_create_sharedvolume.adoc
@ -0,0 +1,78 @@
+= Create SharedVolume
+
+== Resources
+- [1] AWS EFS Operator: https://github.com/openshift/aws-efs-operator
+- [2] AWS EFS Operator Installation/Configuration: https://access.redhat.com/articles/5025181
+
+=== Creating the SharedVolume
+The `communishift` ansible role will create the AWS EFS filesystem and accesspoint, and then creates a Secret called `communishift-project-name-efs-credentials` in the tenants project. The structure of the secret is as follows:
+
+----
+data:
+  efs_filesystem_id: "fsap-xxxxxxxx"
+  efs_accesspoint_id: "fs-xxxxxxxxxx"
+----
+
+The values are base64 encoded, to retrieve the values do the following:
+
+----
+oc get secret communishift-project-name-efs-credentials -o jsonpath="{.data['efs_accesspoint_id']}" | base64 -d
+oc get secret communishift-project-name-efs-credentials -o jsonpath="{.data['efs_filesystem_id']}" | base64 -d
+----
+
+Next create a yaml file and populate the values for the `accessPointID` and the `fileSystemID`.
+
+----
+apiVersion: aws-efs.managed.openshift.io/v1alpha1
+kind: SharedVolume
+metadata:
+  name: PROJECTNAME-sharedvolume
+  namespace: PROJECTNAME
+spec:
+  accessPointID: fsap-xxxxx
+  fileSystemID: fs-xxxxx
+----
+
+Then create the `SharedVolume` object:
+
+----
+oc apply -f project-name-sharedvolume.yml
+----
+
+Once created, the AWS EFS Operator should automatically create a PersistentVolume, then a PersistentVolumeClaim in the project namespace. Tenants can then mount this volume as normal.
+
+The following Pod defintion maybe used to verify the storage is working correctly.
+
+----
+apiVersion: v1
+kind: Pod
+metadata:
+  name: volume-test
+  namespace: communishift-dev-test
+spec:
+  securityContext:
+    runAsUser: 1001
+    runAsGroup: 1001
+    fsGroup: 1001
+    fsGroupChangePolicy: "OnRootMismatch"
+  serviceAccount: volume-test
+  volumes:
+    - name: test-volume
+      persistentVolumeClaim:
+        claimName: pvc-communishift-dev-test-sharedvolume
+  containers:
+    - image: quay.io/operator-framework/ansible-operator:v1.23.0
+      command:
+        - /bin/sh
+        - "-c"
+        - "sleep 60m"
+      imagePullPolicy: IfNotPresent
+      name: alpine
+      volumeMounts:
+        - name: test-volume
+          mountPath: /tmp/volume_test
+      restartPolicy: Always
+      resources:
+        requests:
+          memory: "2Gi"
+----
--- a/modules/sysadmin_guide/pages/sop_communishift_onboard_tenant.adoc
+++ b/modules/sysadmin_guide/pages/sop_communishift_onboard_tenant.adoc
@ -0,0 +1,51 @@
+= Onboard a tenant to the Communishift Cluster
+
+== Resources
+- [1] Playbook: https://pagure.io/fedora-infra/ansible/blob/main/f/playbooks/manual/communishift.yml
+- [2] Role: https://pagure.io/fedora-infra/ansible/blob/main/f/roles/communishift
+- [3] Cluster: https://console-openshift-console.apps.fedora.cj14.p1.openshiftapps.com
+- [4] CAO: https://pagure.io/cpe/communishift/blob/main/f/CommunishiftAuthorization
+
+== Onboarding
+To onboard a tenant, perform the following steps:
+
+
+=== Add project name to Playbook
+Members of `sysadmin-openshift` can run this playbook at [1]. It contains the list of communishift projects. When on boarding, add the new name of the project to the `communishift_projects` dictionary in `inventory/group_vars/all`.
+If needed, resource quotas can be overriden from defaults in the same dictionary.
+
+Note: Projects *must* start with `communishift-` eg `communishift-dev-test`.
+
+
+=== Add new project group to IPA
+A group must be created in IPA which matches the name of the group added to the playbook in the previous step. Please ensure that the community member requesting access to the cluster is also added to this group in IPA, and made a sponsor. This way they can administer members in their group in a self service fashion later.
+
+
+=== Run the playbook
+Run the playbook on the batcave.
+
+----
+sudo rbac-playbook manual/communishift.yml
+----
+
+This will create the project, creates the EFS storage in AWS, then creates a Secret in the project which contains the credentials in order to create a `SharedVolume` object.
+
+eg:
+
+----
+apiVersion: aws-efs.managed.openshift.io/v1alpha1
+kind: SharedVolume
+metadata:
+  name: communishift-dev-test-sharedvolume
+  namespace: communishift-dev-test
+spec:
+  accessPointID: fsap-xxxxx
+  fileSystemID: fs-xxxx
+----
+
+This also applies a ResourceQuota to the project. This sets an upper limit on the amount of resources that may be consumed within. It is low on purpose, and can be changed later in an individual basis based on the tenant needs.
+
+
+=== Authorizing the project members to access the cluster
+The CommunishiftAuthorization operator [4] handles dishing out permissions to access the cluster. This is based on the group name being present in IPA. Every 20minutes, the CAO will retrieve a list of groups from IPA via fasjson, which match `communishift-*` pattern, then ensure this group exists in Openshift, and synchronises the users between the IPA group and Openshift. This process is automatic and performed every 20 minutes.
+
--- a/modules/sysadmin_guide/pages/sop_communishift_tenant_quota.adoc
+++ b/modules/sysadmin_guide/pages/sop_communishift_tenant_quota.adoc
@ -0,0 +1,26 @@
+= Configure the tenant ResourceQuota
+
+== Resources
+- [1] ResourceQuota Openshift Docs: https://docs.openshift.com/container-platform/4.11/applications/quotas/quotas-setting-per-project.html
+
+
+=== Config
+The ResourceQuota is contained within the tenants namespace and is named like `communishift-project-name-quota`.
+
+By default the following quota is assigned:
+
+----
+spec:
+  hard:
+    cpu: "1" # requests.cpu
+    memory: "1Gi" # requests.memory
+    limits.cpu: "1"
+    limits.memory: "2Gi"
+    requests.storage: "5Gi"
+    persistentvolumeclaims: "1"
+    pods: "2"
+    replicationcontrollers: 1
+----
+
+This object can be modified in order to increase or restrict resources available to tenants after the fact. Refer to the official docs for instructions [1].
+
--- a/modules/sysadmin_guide/pages/sop_configure_baremetal_pxe_uefi_boot.adoc
+++ b/modules/sysadmin_guide/pages/sop_configure_baremetal_pxe_uefi_boot.adoc
@ -0,0 +1,40 @@
+= Configure Baremetal PXE-UEFI Boot
+A high level overview of how a baremetal node in the Fedora Infra gets booted via UEFI is as follows.
+
+- Server powered on
+- Gets ip via dhcp
+- DHCP server uses `next-server` command to point the Server to next contact the tftpboot server and retrieve `grub.cfg`
+- tftpboot serves `grub.cfg`
+- Sysadmin manually chooses the correct UEFI menu to boot
+- tftpboot serves kernel and initramfs to the server
+- Server boots with kernel and initramfs, and retrieves ingition file from `os-control01`
+
+== Resources
+
+- [1] https://pagure.io/fedora-infra/ansible/blob/main/f/roles/dhcp_server[Ansible Role DHCP Server]
+- [2] https://pagure.io/fedora-infra/ansible/blob/main/f/roles/tftp_server[Ansible Role tftpboot server]
+
+== UEFI Configuration
+The configuration for UEFI booting is contained in the `grub.cfg` config which is not currently under source control. It is located on the `batcave01` at: `/srv/web/infra/bigfiles/tftpboot2/uefi/grub.cfg`.
+
+The following is a sample configuration to install a baremetal OCP4 worker in the Staging cluster.
+
+----
+menuentry 'RHCOS 4.8 worker staging' {
+  linuxefi images/RHCOS/4.8/x86_64/rhcos-4.8.2-x86_64-live-kernel-x86_64 ip=dhcp nameserver=10.3.163.33 coreos.inst.install_dev=/dev/sda coreos.live.rootfs_url=http://10.3.166.50/rhcos/rhcos-4.8.2-x86_64-live-rootfs.x86_64.img coreos.inst.ignition_url=http://10.3.166.50/rhcos/worker.ign
+  initrdefi images/RHCOS/4.8/x86_64/rhcos-4.8.2-x86_64-live-initramfs.x86_64.img
+}
+----
+
+Any new changes must be made here. Writing to this file requires one to be a member of the `sysadmin-main` group, so best to instead create a ticket in the Fedora Infra issue tracker with patch request. See the following https://pagure.io/fedora-infrastructure/issue/10213[PR] for inspiration.
+
+== Pushing new changes out to the tftpboot server
+To push out changes made to the `grub.cfg` the following playbook should be run, which requires `sysadmin-noc` group permissions:
+
+----
+sudo rbac-playbook groups/noc.yml -t 'tftp_server,dhcp_server'
+----
+
+On the `noc01` instance the `grub.cfg` file is located at `/var/lib/tftpboot/uefi/grub.cfg`
+
+If particular changes to OS images for example, are required, they should be made on the `noc01` instance directly at `/var/lib/tftpboot/images/`. This will require users to be in the `sysadmin-noc` group.
--- a/modules/sysadmin_guide/pages/sop_configure_image_registry_operator.adoc
+++ b/modules/sysadmin_guide/pages/sop_configure_image_registry_operator.adoc
@ -0,0 +1,59 @@
+= SOP Configure the Image Registry Operator
+
+== Resources
+- [1] https://docs.openshift.com/container-platform/4.8/registry/configuring_registry_storage/configuring-registry-storage-baremetal.html#configuring-registry-storage-baremetal[Configuring Registry Storage Baremetal]
+
+
+== Enable the image registry operator
+For detailed instructions please refer to the official documentation for the particular version of Openshift [1].
+
+From the `os-control01` node we can enable the Image Registry Operator set it to a `Managed` state like so via the CLI.:
+
+----
+oc patch configs.imageregistry.operator.openshift.io cluster --type merge --patch '{"spec":{"managementState":"Managed"}}'
+----
+
+Next edit the configuration for the Image Registry operator like so:
+
+----
+oc edit configs.imageregistry.operator.openshift.io
+----
+
+Add the following to replace the `storage: {}`:
+
+----
+...
+storage:
+  pvc:
+    claim:
+...
+----
+
+Save the config.
+
+The Image registry will automatically claim a 100G sized PV if available. It is best to open a ticket with Fedora Infra and have a 100G NFS share be created.
+
+Use the following template for inspiration, populate the particular values to match the newly created NFS Share.
+
+----
+kind: PersistentVolume
+apiVersion: v1
+metadata:
+  name: ocp-image-registry-volume
+spec:
+  capacity:
+    storage: 100Gi
+  nfs:
+    server: 10.3.162.11
+    path: /ocp_prod_registry
+  accessModes:
+    - ReadWriteMany
+  persistentVolumeReclaimPolicy: Retain
+  volumeMode: Filesystem
+----
+
+To create this new PV, create a persisent volume template file like above and apply it using the Openshift client tool like so:
+
+----
+oc apply -f image-registry-pv.yaml
+----
--- a/modules/sysadmin_guide/pages/sop_configure_local_storage_operator.adoc
+++ b/modules/sysadmin_guide/pages/sop_configure_local_storage_operator.adoc
@ -0,0 +1,27 @@
+= Configure the Local Storage Operator
+
+== Resources
+- [1] https://access.redhat.com/documentation/en-us/red_hat_openshift_container_storage/4.7/html/deploying_openshift_container_storage_using_bare_metal_infrastructure/deploy-using-local-storage-devices-bm
+- [2] https://github.com/centosci/ocp4-docs/blob/master/sops/localstorage/installation.md
+
+
+== Installation
+For installation instructions visit the official docs at: [1]. The CentOS CI SOP at [2] also has more context but it is now slightly dated.
+
+- From the webconsole, click on the `Operators` option, then `OperatorHub`
+- Search for `Local Storage`
+- Click install
+- Make sure the `Update Channel` matches the major.minor version of your OCP4 install
+- Choose `A specific namespace on this cluster`
+- Choose `Operator recommended namespace`
+- Update approval set to automatic
+- Click install
+
+== Configuration
+A prerequisite to this step is to have all volumes on the nodes already formatted and available prior to this step. This can be done via a machineconfig/ignition file during installation time, or alternatively SSH onto the boxes and manually create / format the volumes.
+
+- Create a `LocalVolumeDiscovery` and configured it to target the disks on all nodes
+- When that process is complete, it creates `LocalVolumeDiscoveryResult` objects which you can search the type for, then examine to see if it has found the correct disks and if they are showing as available.
+- Create a `LocalVolumeSet`: name `local-block` storage class `local-block` type all, devicetypes disk, part, filter disks by, choose the selected nodes worker01-03, volume mode block. Create.
+- After a period of time check the newly created LocalVolumeSet `local-block` object's yaml definition, it should show the correct number of volumes listed in the `totalProvisionedDeviceCount` field.
+
--- a/modules/sysadmin_guide/pages/sop_configure_oauth_ipa.adoc
+++ b/modules/sysadmin_guide/pages/sop_configure_oauth_ipa.adoc
@ -0,0 +1,48 @@
+= SOP Configure oauth Authentication via IPA/Noggin
+
+
+== Resources
+
+- [1] https://pagure.io/fedora-infra/ansible/blob/main/f/files/communishift/objects[Example Config from Communishift]
+
+
+== OIDC Setup
+The first step is to request that a secret be created for this environment, please open a ticket with Fedora Infra. Once the secret has been made available we can add it to an Openshift Secret in the cluster like so:
+
+----
+oc create secret generic fedoraidp-clientsecret --from-literal=clientSecret=<client-secret> -n openshift-config
+----
+
+Next we can update the oauth configuration on the cluster and add the config for ipa/noggin/ipsilon. See the following snippet for inspiration:
+
+----
+apiVersion: config.openshift.io/v1
+kind: OAuth
+metadata:
+  name: cluster
+spec:
+  identityProviders:
+...
+  - name: fedoraidp
+    login: true
+    challenge: false
+    mappingMethod: claim
+    type: OpenID
+    openID:
+      clientID: ocp
+      clientSecret:
+        name: fedoraidp-clientsecret
+      extraScopes:
+      - email
+      - profile
+      claims:
+        preferredUsername:
+        - nickname
+        name:
+        - name
+        email:
+        - email
+      issuer: https://id.fedoraproject.org
+----
+
+This config already exists in the cluster so you need to edit or patch it, you can't just `oc apply -f template.yaml`.
--- a/modules/sysadmin_guide/pages/sop_configure_openshift_container_storage.adoc
+++ b/modules/sysadmin_guide/pages/sop_configure_openshift_container_storage.adoc
@ -0,0 +1,37 @@
+= Configure the Openshift Container Storage Operator
+
+
+== Resources
+
+- [1] https://docs.openshift.com/container-platform/4.8/storage/persistent_storage/persistent-storage-ocs.html[Official Docs]
+- [2] https://github.com/red-hat-storage/ocs-operator[Github]
+
+== Installation
+Important: before following this SOP, please ensure that you have already followed the SOP to install the Local Storage Operator first, as this is a requirement for the OCS operator.
+
+For full detailed instructions please refer to the official docs at: [1]. For general instructions see below:
+
+- In the webconsole, click the Operators menu
+- Click the OperatorHub menu
+- Search for `OpenShift Container Storage`
+- Click install
+- Choose the update channel to match the major.minor version of the cluster itself.
+- Installation mode, A specified namespace on the cluster
+- Installed namespace, Operator Recommended
+- Update approval, automatic
+- Click install
+
+
+== Configuration
+When the operator is finished installing, we can continue, please ensure that a minimum of 3 nodes are available.
+
+- A `StorageCluster` is required to complete this installation, click the Create StorageCluster.
+- At the top, choose the `internal - attached devices` mode.
+- In the storageclass choose the `local-block` from the list.
+- The compute/worker nodes with available storage appear in the list
+- It automatically calculates the possible storage amount
+- Click next
+- On the `Security and Network` section just click next.
+- Click create.
+
+
--- a/modules/sysadmin_guide/pages/sop_configure_openshift_virtualization_operator.adoc
+++ b/modules/sysadmin_guide/pages/sop_configure_openshift_virtualization_operator.adoc
@ -0,0 +1,56 @@
+= Installation of the Openshift Virtualisation Operator
+
+== Resources
+- [1] https://alt.fedoraproject.org/cloud/[Fedora Images]
+- [2] https://github.com/kubevirt/kubevirt/blob/main/docs/container-register-disks.md[Kubevirt Importing Containers of VMI Images]
+
+
+== Installation
+From the web console, choose the `Operators` menu, and choose `OperatorHub`.
+
+Search for `Openshift Virtualization`
+
+Click install.
+
+When the installation of the Operator is completed, create a `HyperConverged` object and follow the wizard, the default options should be fine, click next through the menus.
+
+Next create a `HostPathProvisioner` object the default options should be fine, click next through the menus.
+
+
+== Verification
+To verify that the installation of the Operator is successful, we can attempt to create a VM.
+
+From the [1] location download the Fedora34 `Cloud Base image for Openstack` image with the `qcow2` format locally.
+
+Create a `Dockerfile` with the following contents:
+
+----
+FROM scratch
+ADD fedora34.qcow2 /disk/
+----
+
+Build the contianer:
+
+----
+podman build -t fedora34:latest .
+----
+
+Push the container to your username at quay.io.
+
+----
+podman push quay.io/<USER>/fedora34:latest
+----
+
+In the web console, visit the Workloads, then Virtualization menu.
+
+Create a VirtualMachine with Wizard
+
+Choose Fedora and click next
+
+From the boot source dropdown menu, select import via Registry
+
+In the container image, you can add the one prepared earlier. eg `quay.io/dkirwan/fedora34`
+
+Click the `Advanced Storage settings`, change the storageclass to `oc-storagecluster-ceph-rbd` and click next and done.
+
+Once the VM is created and booted, the console is available from the top right drop down menu.
--- a/modules/sysadmin_guide/pages/sop_configure_userworkload_monitoring_stack.adoc
+++ b/modules/sysadmin_guide/pages/sop_configure_userworkload_monitoring_stack.adoc
@ -0,0 +1,95 @@
+= Enable User Workload Monitoring Stack
+
+== Resources
+- [1] https://docs.openshift.com/container-platform/4.8/monitoring/enabling-monitoring-for-user-defined-projects.html[Official Docs]
+- [2] https://docs.openshift.com/container-platform/4.8/monitoring/enabling-monitoring-for-user-defined-projects.html#granting-users-permission-to-monitor-user-defined-projects_enabling-monitoring-for-user-defined-projects[Providing Access to the UWMS features]
+- [3] https://docs.openshift.com/container-platform/4.8/monitoring/enabling-monitoring-for-user-defined-projects.html#granting-user-permissions-using-the-web-console_enabling-monitoring-for-user-defined-projects[Providing Access to the UWMS dashboard]
+- [4] https://docs.openshift.com/container-platform/4.8/monitoring/configuring-the-monitoring-stack.html#configuring-persistent-storage[Configure Monitoring Stack]
+
+== Configuration
+To enable the stack edit the `cluster-monitoring` ConfigMap like so:
+
+----
+oc -n openshift-monitoring edit configmap cluster-monitoring-config
+----
+
+Set the `enableUserWorkload` to `true` like so:
+
+----
+apiVersion: v1
+kind: ConfigMap
+metadata:
+  name: cluster-monitoring-config
+  namespace: openshift-monitoring
+data:
+  config.yaml: |
+    enableUserWorkload: true
+    prometheusK8s:
+      retention: 30d
+      volumeClaimTemplate:
+        spec:
+          storageClassName: ocs-storagecluster-ceph-rbd
+          resources:
+            requests:
+              storage: 100Gi
+    alertmanagerMain:
+      volumeClaimTemplate:
+        spec:
+          storageClassName: ocs-storagecluster-ceph-rbd
+          resources:
+            requests:
+              storage: 50Gi
+----
+
+Save the configmap changes. Monitor the rollout progress of the User Workload Monitoring Stack with the following:
+
+----
+oc -n openshift-user-workload-monitoring get pod
+NAME                                   READY   STATUS        RESTARTS   AGE
+prometheus-operator-6f7b748d5b-t7nbg   2/2     Running       0          3h
+prometheus-user-workload-0             4/4     Running       1          3h
+prometheus-user-workload-1             4/4     Running       1          3h
+thanos-ruler-user-workload-0           3/3     Running       0          3h
+thanos-ruler-user-workload-1           3/3     Running       0          3h
+----
+
+At this point we can create a `ConfigMap` to configure the User Workload Monitoring stack in the `openshift-user-workload-monitoring` namespace.
+
+----
+oc create configmap user-workload-monitoring-config -n openshift-user-workload-monitoring
+----
+
+Then edit this ConfigMap:
+
+----
+oc -n openshift-user-workload-monitoring edit configmap user-workload-monitoring-config
+----
+
+Save the following configuration
+
+----
+apiVersion: v1
+kind: ConfigMap
+metadata:
+  name: user-workload-monitoring-config
+  namespace: openshift-user-workload-monitoring
+data:
+  config.yaml: |
+    prometheus:
+      retention: 30d
+      volumeClaimTemplate:
+        spec:
+          storageClassName: ocs-storagecluster-ceph-rbd
+          resources:
+            requests:
+              storage: 100Gi
+    thanosRuler:
+      volumeClaimTemplate:
+        spec:
+          storageClassName: ocs-storagecluster-ceph-rbd
+          resources:
+            requests:
+              storage: 50Gi
+----
+
+To provide access to users to create `PrometheusRule` and `ServiceMonitor` and `PodMonitor` objects see [2]. To allow access to the User Workload Monitoring Stack dashboard see [3].
--- a/modules/sysadmin_guide/pages/sop_cordoning_nodes_and_draining_pods.adoc
+++ b/modules/sysadmin_guide/pages/sop_cordoning_nodes_and_draining_pods.adoc
@ -0,0 +1,53 @@
+= Cordoning Nodes and Draining Pods
+This SOP should be followed in the following scenarios:
+
+- If maintenance is scheduled to be carried out on an Openshift node.
+
+
+== Steps
+
+1. Connect to the `os-control01` host associated with this ENV. Become root `sudo su -`.
+
+2. Mark the node as unschedulable:
+
+----
+nodes=$(oc get nodes -o name  | sed -E "s/node\///")
+echo $nodes
+
+for node in ${nodes[@]}; do oc adm cordon $node; done
+node/<node> cordoned
+----
+
+3. Check that the node status is `NotReady,SchedulingDisabled`
+
+----
+oc get node <node1>
+NAME        STATUS                        ROLES     AGE       VERSION
+<node1>     NotReady,SchedulingDisabled   worker    1d        v1.18.3
+----
+
+Note: It might not switch to `NotReady` immediately, there maybe many pods still running.
+
+4. Evacuate the Pods from **worker nodes** using one of the following methods
+This will drain node `<node1>`, delete any local data, and ignore daemonsets, and give a period of 60 seconds for pods to drain gracefully.
+
+----
+oc adm drain <node1> --delete-emptydir-data=true --ignore-daemonsets=true --grace-period=15
+----
+
+5. Perform the scheduled maintenance on the node
+Do what ever is required in the scheduled maintenance window
+
+6. Once the node is ready to be added back into the cluster
+We must uncordon the node. This allows it to be marked scheduleable once more.
+
+----
+nodes=$(oc get nodes -o name  | sed -E "s/node\///")
+echo $nodes
+
+for node in ${nodes[@]}; do oc adm uncordon $node; done
+----
+
+==  Resources
+
+- [1] [Nodes - working with nodes](https://docs.openshift.com/container-platform/4.8/nodes/nodes/nodes-nodes-working.html)
--- a/modules/sysadmin_guide/pages/sop_create_machineconfigs.adoc
+++ b/modules/sysadmin_guide/pages/sop_create_machineconfigs.adoc
@ -0,0 +1,38 @@
+= Create MachineConfigs to Configure RHCOS
+
+== Resources
+
+- [1] https://coreos.github.io/butane/getting-started/[Butane Getting Started]
+- [2] https://docs.openshift.com/container-platform/4.8/post_installation_configuration/machine-configuration-tasks.html#installation-special-config-chrony_post-install-machine-configuration-tasks[OCP4 Post Installation Configuration]
+
+== Butane
+"Butane (formerly the Fedora CoreOS Config Transpiler) is a tool that consumes a Butane Config and produces an Ignition Config, which is a JSON document that can be given to a Fedora CoreOS machine when it first boots." [1]
+
+Butane is available in a container image, we can pull the latest version locally like so:
+
+----
+# Pull the latest release
+podman pull quay.io/coreos/butane:release
+
+# Run butane using standard in and standard out
+podman run -i --rm quay.io/coreos/butane:release --pretty --strict < your_config.bu > transpiled_config.ign
+
+# Run butane using files.
+podman run --rm -v /path/to/your_config.bu:/config.bu:z quay.io/coreos/butane:release --pretty --strict /config.bu > transpiled_config.ign
+----
+
+We can create a CLI alias to make running the Butane container much easier like so:
+
+----
+alias butane='podman run --rm --tty --interactive \
+              --security-opt label=disable        \
+              --volume ${PWD}:/pwd --workdir /pwd \
+              quay.io/coreos/butane:release'
+----
+
+For more detailed information on how to structure your Butane file see [1]. Once created you can convert the butane config to an igntion file like so:
+
+----
+butane master_chrony_machineconfig.bu -o master_chrony_machineconfig.yaml
+butane worker_chrony_machineconfig.bu -o worker_chrony_machineconfig.yaml
+----
--- a/modules/sysadmin_guide/pages/sop_disable_provisioners_role.adoc
+++ b/modules/sysadmin_guide/pages/sop_disable_provisioners_role.adoc
@ -0,0 +1,70 @@
+= SOP Disable `self-provisioners` Role
+
+== Resources
+
+- [1] https://docs.openshift.com/container-platform/4.4/applications/projects/configuring-project-creation.html#disabling-project-self-provisioning_configuring-project-creation
+
+
+== Disabling self-provisioners role
+By default, when a user authenticates with Openshift via Oauth, it is part of the `self-provisioners` group. This group provides the ability to create new projects. On the Fedora cluster we do not want users to be able to create their own projects, as we have a system in place where we create a project and control the administrators of that project.
+
+To disable the self-provisioner role do the following as outlined in the documentation[1].
+
+----
+oc describe clusterrolebinding.rbac self-provisioners
+
+Name:		self-provisioners
+Labels:		<none>
+Annotations:	rbac.authorization.kubernetes.io/autoupdate=true
+Role:
+  Kind:	ClusterRole
+  Name:	self-provisioner
+Subjects:
+  Kind	Name				Namespace
+  ----	----				---------
+  Group	system:authenticated:oauth
+----
+
+Remove the subjects that the self-provisioners role applies to.
+
+----
+oc patch clusterrolebinding.rbac self-provisioners -p '{"subjects": null}'
+----
+
+Verify the change occurred successfully
+
+----
+oc describe clusterrolebinding.rbac self-provisioners
+Name:         self-provisioners
+Labels:       <none>
+Annotations:  rbac.authorization.kubernetes.io/autoupdate: true
+Role:
+  Kind:  ClusterRole
+  Name:  self-provisioner
+Subjects:
+  Kind  Name  Namespace
+  ----  ----  ---------
+----
+
+When the cluster is updated to a new version, unless we mark the role appropriately, the permissions will be restored after the update is complete.
+
+Verify that the value is currently set to be restored after an update:
+
+----
+oc get clusterrolebinding.rbac self-provisioners -o yaml
+----
+
+----
+apiVersion: authorization.openshift.io/v1
+kind: ClusterRoleBinding
+metadata:
+  annotations:
+    rbac.authorization.kubernetes.io/autoupdate: "true"
+  ...
+----
+
+We wish to set this `rbac.authorization.kubernetes.io/autoupdate` to `false`. To patch this do the following.
+
+----
+oc patch clusterrolebinding.rbac self-provisioners -p '{ "metadata": { "annotations": { "rbac.authorization.kubernetes.io/autoupdate": "false" } } }'
+----
--- a/modules/sysadmin_guide/pages/sop_etcd_backup.adoc
+++ b/modules/sysadmin_guide/pages/sop_etcd_backup.adoc
@ -0,0 +1,50 @@
+= Create etcd backup
+This SOP should be followed in the following scenarios:
+
+- When the need exists to create an etcd backup.
+- When shutting a cluster down gracefully.
+
+== Resources
+
+- [1] https://docs.openshift.com/container-platform/4.8/backup_and_restore/backing-up-etcd.html[Creating an etcd backup]
+
+== Take etcd backup
+
+1. Connect to the `os-control01` node associated with the ENV.
+
+2. Use the `oc` tool to make a debug connection to a controlplane node
+
+----
+oc debug node/<node_name>
+----
+
+3. Chroot to the /host directory on the containers filesystem
+
+----
+sh-4.2# chroot /host
+----
+
+4. Run the cluster-backup.sh script and pass in the location to save the backup to
+
+----
+sh-4.4# /usr/local/bin/cluster-backup.sh /home/core/assets/backup
+----
+
+5. Chown the backup files to be owned by user `core` and group `core`
+
+----
+chown -R core:core /home/core/assets/backup
+----
+
+6. From the admin machine, see inventory group: `ocp-ci-management`, become the Openshift service account, see the inventory hostvars for the host identified in the previous step and note the `ocp_service_account` variable.
+
+----
+ssh <host>
+sudo su - <ocp_service_account>
+----
+
+7. Copy the files down to the `os-control01` machine.
+
+----
+scp -i <ssh_key> core@<node_name>:/home/core/assets/backup/* ocp_backups/
+----
--- a/modules/sysadmin_guide/pages/sop_fas2discourse_operator.adoc
+++ b/modules/sysadmin_guide/pages/sop_fas2discourse_operator.adoc
@ -0,0 +1,13 @@
+= fas2discourse Operator
+The following SOPs are related to the administration of the fas2discourse operator.
+
+== Resources
+- https://pagure.io/cpe/fas2discourse/[Code]
+- https://quay.io/repository/fedora/fas2discourse-operator[Image]
+- https://pagure.io/fedora-infrastructure/issue/10952[Initial ticket]
+- xref:sop_fas2discourse_operator_installation.adoc[Install the fas2discourse operator]
+- xref:sop_fas2discourse_operator_testing.adoc[Testing the fas2discourse operator]
+- xref:sop_fas2discourse_operator_build.adoc[Building/releasing the fas2discourse operator]
+- xref:sop_fas2discourse_operator_interacting.adoc[Interacting with the the fas2discourse operator]
+- xref:sop_fas2discourse_operator_debugging.adoc[Debugging issues with the the fas2discourse operator]
+
--- a/modules/sysadmin_guide/pages/sop_fas2discourse_operator_build.adoc
+++ b/modules/sysadmin_guide/pages/sop_fas2discourse_operator_build.adoc
@ -0,0 +1,23 @@
+= Build/release the fas2discourse Operator
+
+== Resources
+- [1] Code: https://pagure.io/cpe/fas2discourse
+- [2] Quay: https://quay.io/repository/fedora/fas2discourse-operator
+
+== Installation
+To build the operator and tag it with version `0.0.63` as an example:
+
+- First ensure that you are logged into quay.io and have access to the repository at [2].
+- Check out the code at [1].
+- Make the change to the version of the operator being built by editing the `Makefile` and change the variable at the top `VERSION ?= 0.0.63`
+
+----
+make build
+----
+
+Push the operator to the quay.io catalog then with the following:
+
+----
+podman push quay.io/repository/fedora/fas2discourse-operator:0.0.63
+----
+
--- a/modules/sysadmin_guide/pages/sop_fas2discourse_operator_debugging.adoc
+++ b/modules/sysadmin_guide/pages/sop_fas2discourse_operator_debugging.adoc
@ -0,0 +1,108 @@
+= Debugging issues with the fas2discourse Operator
+
+== Resources
+- [1] Code: https://pagure.io/cpe/fas2discourse/
+- [2] Playbook: https://pagure.io/fedora-infra/ansible/blob/main/f/playbooks/manual/fas2discourse.yml
+- [3] Role: https://pagure.io/fedora-infra/ansible/blob/main/f/roles/fas2discourse
+
+== Workload
+The operator runs in the namespace: `fas2discourse-operator` on both the staging and production openshift clusters.
+
+There is a single pod running. First port of call should be to examine the logs of this pod.
+
+By default, the verbocity of logs are set low. To increase them to debug level add the following annotation to the `Fas2DiscourseConfig` object in the `fas2discourse-operator` namespace:
+
+----
+apiVersion: fas2discourse.apps.fedoraproject.org/v1alpha1
+kind: Fas2discourseConfig
+metadata:
+  annotations:
+    ansible.sdk.operatorframework.io/verbosity: '5'
+----
+
+This will enable full output from logging, which may aid in debugging.
+
+The following task list is contained inside the operator. This list is repeated in the reconcile loop which is currently set to run every `20 minutes`.
+Reconcile loop can be adjusted in `watches.yaml` file.
+
+----
+# tasks file for Fas2discourseConfig
+
+- include_tasks: retrieve_openshift_secrets.yml  # Retrieves the secrets such as discourse api key etc and populates variable which feeds into the later tasks
+- include_tasks: kerberos_auth.yml               # Authenticate to fasjson via keytab
+- include_tasks: retrieve_discourse_groups.yml   # Contact Discourse API, retrieve the list of groups, and retrieve the list of users in each group
+- include_tasks: retrieve_ipa_groups.yml         # Contact fasjson, using the Discourse group list, retrieve the membership of each group in IPA
+- include_tasks: sync_group_membership.yml       # Using set functions, discover who is not in Discourse group but is in IPA group: add them. Who is in Discourse group but not in IPA group: remove them.
+----
+
+The results of each call in the workflow is outputted in the log. If any task fails the entire loop stops and retries.
+
+
+== Contributors
+Simple guide to troubleshooting in codeready containers.
+
+* Make the change
+* Tag and create new image and push to registry
+
+Open Makefile and bump up the version
+`make`
+`podman push quay.io/fedora/fas2discourse-operator:<VERSION>`
+
+In case you don't have the permissions to push to the repositories in the fedora namespace,
+push to your own namespace in quay.io and pull the image into crc from there.
+
+* Start crc and login
+
+`crc start`
+`oc login -u kubeadmin https://api.crc.testing:6443`
+
+* Deploy controller to the k8s cluster
+`make deploy`
+
+* Remove the deployment and deploy again
+
+`oc get deployments`
+`oc delete <deployment NAME>`
+`make deploy`
+
+* Check if in correct project
+
+`oc project fas2discourse-operator`
+
+* Apply Fas2DiscourseConfig custom resource
+
+`oc apply -f config/samples/_v1alpha1_fas2discourseconfig.yaml`
+
+* Check t/he logs:
+
+`oc get pods`
+`oc logs -f <pod NAME>`
+
+
+== Local testing or developing
+The guide above will work only when running in the cluster.
+For local testing, it is necessary to create a secret.
+For that you have to create a [Discourse API key]
+(https://meta.discourse.org/t/create-and-configure-an-api-key/230124)
+(probably in staging Discourse instance)
+and a [keytab file]
+(https://pagure.io/fedora-infra/howtos/blob/main/f/create_keytab.md)
+for kinit.
+
+=== Create a secret
+With command `oc create secret generic`. Let's name it `fas2discourse-operator-discourse-apikey-secret`.
+
+`FAS2DISCOURSE_API_KEY=<insert your Discourse API key>`
+`DISCOURSE_HOST=https://askfedora.staged-by-discourse.com/`
+`KEYTAB_NAME=<insert name on the keytab file>`
+`KEYTAB_PATH=<insert path to the keytab file>`
+
+For example:
+`oc create secret generic fas2discourse-operator-discourse-apikey-secret -n fas2discourse-operator --from-literal fas2discourse-discourse-apikey=$FAS2DISCOURSE_API_KEY --from-literal discourse-host="$DISCOURSE_HOST"`
+
+`oc create secret generic fas2discourse-operator-keytab-secret --from-file=$KEYTAB_NAME="$KEYTAB_PATH"`
+
+To confirm the secret was created, run:
+`oc get secrets`
+
+Continue with applying the Fas2DiscourseConfig custom resource.
--- a/modules/sysadmin_guide/pages/sop_fas2discourse_operator_installation.adoc
+++ b/modules/sysadmin_guide/pages/sop_fas2discourse_operator_installation.adoc
@ -0,0 +1,34 @@
+= Installation of the fas2discourse Operator
+
+== Resources
+- [1] Code: https://pagure.io/cpe/fas2discourse/
+- [2] Playbook: https://pagure.io/fedora-infra/ansible/blob/main/f/playbooks/manual/fas2discourse.yml
+- [3] Role: https://pagure.io/fedora-infra/ansible/blob/main/f/roles/fas2discourse
+
+== Installation on Fedora Infra
+
+There is a playbook [2] and role [3]. To install the operator in staging and production, run the playbook [2]. Users in the `sysadmin-openshift` group have permissions to run this playbook.
+
+
+== Installation on a CRC cluster
+There is a Makefile bundled with the code [1] of this operator.
+
+To install the operator the basic steps are followed:
+
+- From a terminal, be logged into the cluster with cluster-admin privileges.
+- Run `make deploy`
+
+To activate the operator we need to create a `fas2discourseconfig` custom resource. An example of one exists in `config/samples/_v1alpha1_fas2discourseconfig.yaml`
+
+Create it with the following:
+
+----
+oc apply -f config/samples/_v1alpha1_fas2discourseconfig.yaml
+----
+
+
+
+
+== Configuration
+
+- No other configuration is required for this operator.
--- a/modules/sysadmin_guide/pages/sop_fas2discourse_operator_interacting.adoc
+++ b/modules/sysadmin_guide/pages/sop_fas2discourse_operator_interacting.adoc
@ -0,0 +1,48 @@
+= Interacting with the fas2discourse Operator
+
+== Resources
+- [1] Code: https://pagure.io/cpe/fas2discourse/
+- [2] Playbook: https://pagure.io/fedora-infra/ansible/blob/main/f/playbooks/manual/fas2discourse.yml
+- [3] Role: https://pagure.io/fedora-infra/ansible/blob/main/f/roles/fas2discourse
+
+== Overview of the fas2discourse Operator
+The role of this operator is to synchronise group membership between IPA and Discourse. It does not synchronise all groups and all members, but only groups which exist in Discourse.
+
+To start the synchronisation of a group, you must first request that a Discourse admin create it in Discourse. The fas2discourse operator will then begin to synchronise users to that group based on their membership in this group in IPA.
+
+== Configuration of the fas2discourse operator
+All configuration for the fas2idscourse operator is contained in the Fedora Infra private ansible repo.
+
+Default vars contains the list which are used in the playbook which deploys the operator:
+
+----
+fas2discourse_hostname: "fas2discourse.hostna.me"
+fas2discourse_namespace: "fas2discourse-operator"
+fas2discourse_project_description: "The fas2discourse-operator is responsible for synchronising group membership for users between Discourse and IPA."
+fas2discourse_keytab_file: "OVERRIDEME WITH A FILE LOOKUP"
+fas2discourse_discourse_apikey: "OVERRIDEME WITH A DISCOURSE APIKEY"
+----
+
+The Operator has the following vars which it uses internally. These vars are populated by querying secrets in Openshift:
+
+----
+# defaults file for Fas2discourseConfig
+fas2discourse_keytab_path: "/etc/fas2discourse"
+fas2discourse_principal: "fas2discourse/fas2discourse.hostna.me@FEDORAPROJECT.ORG"
+f2d_namespace: "fas2discourse-operator"
+f2d_secret: "fas2discourse-operator-k8s-secret"
+f2d_discourse_secret: "fas2discourse-operator-discourse-apikey-secret"
+fasjson_host: "OVERRIDEME"
+discourse_host: "OVERRIDEME"
+discourse_api: "OVERRIDEME"
+discourse_ignored_groups:
+  - "admins"
+  - "staff"
+  - "moderators"
+  - "trust_level_0"
+  - "trust_level_1"
+  - "trust_level_2"
+  - "trust_level_3"
+  - "trust_level_4"
+----
+
--- a/modules/sysadmin_guide/pages/sop_fas2discourse_operator_testing.adoc
+++ b/modules/sysadmin_guide/pages/sop_fas2discourse_operator_testing.adoc
@ -0,0 +1,14 @@
+= Test the fas2discourse Operator
+
+== Resources
+- [1] Code: https://pagure.io/cpe/fas2discourse/
+- [2] Molecule: https://molecule.readthedocs.io/en/latest/
+
+== Installation
+There is a molecule directory bundled with the code [1] of this operator. They currently are designed to only run against a code ready container cluster.
+
+To run the operator molecule tests:
+
+- Ensure that the molecule utility is installed `dnf install python3-molecule`
+- From a terminal, be logged into the crc cluster with cluster-admin privileges.
+- Run `molecule test`
--- a/modules/sysadmin_guide/pages/sop_graceful_shutdown_ocp_cluster.adoc
+++ b/modules/sysadmin_guide/pages/sop_graceful_shutdown_ocp_cluster.adoc
@ -0,0 +1,30 @@
+= Graceful Shutdown of an Openshift 4 Cluster
+This SOP should be followed in the following scenarios:
+
+- Graceful full shut down of the Openshift 4 cluster is required.
+
+== Steps
+
+Prequisite steps:
+- Follow the SOP for cordoning and draining the nodes.
+- Follow the SOP for creating an `etcd` backup.
+
+
+1. Connect to the `os-control01` host associated with this ENV. Become root `sudo su -`.
+
+2. Get a list of the nodes
+
+----
+nodes=$(oc get nodes -o name  | sed -E "s/node\///")
+----
+
+3. Shutdown the nodes from the administration box associated with the cluster `ENV` eg production/staging.
+
+----
+for node in ${nodes[@]}; do ssh -i /root/ocp4/ocp-<ENV>/ssh/id_rsa core@$node sudo shutdown -h now; done
+----
+
+
+=== Resources
+
+- [1] https://docs.openshift.com/container-platform/4.5/backup_and_restore/graceful-cluster-shutdown.html[Graceful Cluster Shutdown]
--- a/modules/sysadmin_guide/pages/sop_graceful_startup_ocp_cluster.adoc
+++ b/modules/sysadmin_guide/pages/sop_graceful_startup_ocp_cluster.adoc
@ -0,0 +1,88 @@
+= Graceful Startup of an Openshift 4 Cluster
+This SOP should be followed in the following scenarios:
+
+- Graceful start up of an Openshift 4 cluster.
+
+== Steps
+Prequisite steps:
+
+
+=== Start the VM Control Plane instances
+Ensure that the control plane instances start first.
+
+----
+# Virsh command to start the VMs
+----
+
+
+=== Start the physical nodes
+To connect to `idrac`, you must be connected to the Red Hat VPN. Next find the management IP associated with each node.
+
+On the `batcave01` instance, in the dns configuration, the following bare metal machines make up the production/staging OCP4 worker nodes.
+
+----
+oshift-dell01             IN        A     10.3.160.180  # worker01 prod
+oshift-dell02             IN        A     10.3.160.181  # worker02 prod
+oshift-dell03             IN        A     10.3.160.182  # worker03 prod
+oshift-dell04             IN        A     10.3.160.183  # worker01 staging
+oshift-dell05             IN        A     10.3.160.184  # worker02 staging
+oshift-dell06             IN        A     10.3.160.185  # worker03 staging
+----
+
+Login to the `idrac` interface that corresponds with each worker, one at a time. Ensure the node is booting via harddrive, then power it on.
+
+=== Once the nodes have been started they must be uncordoned if appropriate
+
+----
+oc get nodes
+NAME                       STATUS                     ROLES    AGE    VERSION
+dumpty-n1.ci.centos.org    Ready,SchedulingDisabled   worker   77d    v1.18.3+6c42de8
+dumpty-n2.ci.centos.org    Ready,SchedulingDisabled   worker   77d    v1.18.3+6c42de8
+dumpty-n3.ci.centos.org    Ready,SchedulingDisabled   worker   77d    v1.18.3+6c42de8
+dumpty-n4.ci.centos.org    Ready,SchedulingDisabled   worker   77d    v1.18.3+6c42de8
+dumpty-n5.ci.centos.org    Ready,SchedulingDisabled   worker   77d    v1.18.3+6c42de8
+kempty-n10.ci.centos.org   Ready,SchedulingDisabled   worker   106d   v1.18.3+6c42de8
+kempty-n11.ci.centos.org   Ready,SchedulingDisabled   worker   106d   v1.18.3+6c42de8
+kempty-n12.ci.centos.org   Ready,SchedulingDisabled   worker   106d   v1.18.3+6c42de8
+kempty-n6.ci.centos.org    Ready,SchedulingDisabled   master   106d   v1.18.3+6c42de8
+kempty-n7.ci.centos.org    Ready,SchedulingDisabled   master   106d   v1.18.3+6c42de8
+kempty-n8.ci.centos.org    Ready,SchedulingDisabled   master   106d   v1.18.3+6c42de8
+kempty-n9.ci.centos.org    Ready,SchedulingDisabled   worker   106d   v1.18.3+6c42de8
+
+nodes=$(oc get nodes -o name  | sed -E "s/node\///")
+
+for node in ${nodes[@]}; do oc adm uncordon $node; done
+node/dumpty-n1.ci.centos.org uncordoned
+node/dumpty-n2.ci.centos.org uncordoned
+node/dumpty-n3.ci.centos.org uncordoned
+node/dumpty-n4.ci.centos.org uncordoned
+node/dumpty-n5.ci.centos.org uncordoned
+node/kempty-n10.ci.centos.org uncordoned
+node/kempty-n11.ci.centos.org uncordoned
+node/kempty-n12.ci.centos.org uncordoned
+node/kempty-n6.ci.centos.org uncordoned
+node/kempty-n7.ci.centos.org uncordoned
+node/kempty-n8.ci.centos.org uncordoned
+node/kempty-n9.ci.centos.org uncordoned
+
+oc get nodes
+NAME                       STATUS   ROLES    AGE    VERSION
+dumpty-n1.ci.centos.org    Ready    worker   77d    v1.18.3+6c42de8
+dumpty-n2.ci.centos.org    Ready    worker   77d    v1.18.3+6c42de8
+dumpty-n3.ci.centos.org    Ready    worker   77d    v1.18.3+6c42de8
+dumpty-n4.ci.centos.org    Ready    worker   77d    v1.18.3+6c42de8
+dumpty-n5.ci.centos.org    Ready    worker   77d    v1.18.3+6c42de8
+kempty-n10.ci.centos.org   Ready    worker   106d   v1.18.3+6c42de8
+kempty-n11.ci.centos.org   Ready    worker   106d   v1.18.3+6c42de8
+kempty-n12.ci.centos.org   Ready    worker   106d   v1.18.3+6c42de8
+kempty-n6.ci.centos.org    Ready    master   106d   v1.18.3+6c42de8
+kempty-n7.ci.centos.org    Ready    master   106d   v1.18.3+6c42de8
+kempty-n8.ci.centos.org    Ready    master   106d   v1.18.3+6c42de8
+kempty-n9.ci.centos.org    Ready    worker   106d   v1.18.3+6c42de8
+----
+
+
+== Resources
+
+- [1] https://docs.openshift.com/container-platform/4.5/backup_and_restore/graceful-cluster-restart.html[Graceful Cluster Startup]
+- [2] https://docs.openshift.com/container-platform/4.5/backup_and_restore/disaster_recovery/scenario-2-restoring-cluster-state.html#dr-restoring-cluster-state[Cluster disaster recovery]
--- a/modules/sysadmin_guide/pages/sop_installation.adoc
+++ b/modules/sysadmin_guide/pages/sop_installation.adoc
@ -0,0 +1,215 @@
+= SOP Installation/Configuration of OCP4 on Fedora Infra
+
+== Resources
+
+- [1]: https://docs.openshift.com/container-platform/4.8/installing/installing_bare_metal/[Official OCP4 Installation Documentation]
+
+== Install
+To install OCP4 on Fedora Infra, one must be apart of the following groups:
+
+- `sysadmin-openshift`
+- `sysadmin-noc`
+
+
+=== Prerequisites
+Visit the https://console.redhat.com/openshift/install/metal/user-provisioned[OpenShift Console] and download the following OpenShift tools:
+
+* A Red Hat Access account is required
+* OC client tools https://access.redhat.com/downloads/content/290/ver=4.8/rhel---8/4.8.10/x86_64/product-software[Here]
+* OC installation tool https://access.redhat.com/downloads/content/290/ver=4.8/rhel---8/4.8.10/x86_64/product-software[Here]
+* Ensure the downloaded tools are available on the `PATH`
+* A valid OCP4 subscription is required to complete the installation configuration, by default you have a 60 day trial.
+* Take a copy of your pull secret file you will need to put this in the `install-config.yaml` file in the next step.
+
+
+=== Generate install-config.yaml file
+We must create a `install-config.yaml` file, use the following example for inspiration, alternatively refer to the documentation[1] for more detailed information/explainations.
+
+----
+apiVersion: v1
+baseDomain: stg.fedoraproject.org
+compute:
+- hyperthreading: Enabled
+  name: worker
+  replicas: 0
+controlPlane:
+  hyperthreading: Enabled
+  name: master
+  replicas: 3
+metadata:
+  name: 'ocp'
+networking:
+  clusterNetwork:
+  - cidr: 10.128.0.0/14
+    hostPrefix: 23
+  networkType: OpenShiftSDN
+  serviceNetwork:
+  - 172.30.0.0/16
+platform:
+  none: {}
+fips: false
+pullSecret: 'PUT PULL SECRET HERE'
+sshKey: 'PUT SSH PUBLIC KEY HERE kubeadmin@core'
+----
+
+* Login to the `os-control01` corresponding with the environment
+* Make a directory to hold the installation files: `mkdir ocp4-<ENV>`
+* Enter this newly created directory: `cd ocp4-<ENV>`
+* Generate a fresh SSH keypair: `ssh-keygen -f ./ocp4-<ENV>-ssh`
+* Create a `ssh` directory and place this keypair into it.
+* Put the contents of the public key in the `sshKey` value in the `install-config.yaml` file
+* Put the contents of your Pull Secret in the `pullSecret` value in the `install-config.yaml`
+* Take a backup of the `install-config.yaml` to `install-config.yaml.bak`, as running the next steps consumes this file, having a backup allows you to recover from mistakes quickly.
+
+
+=== Create the Installation Files
+Using the `openshift-install` tool we can generate the installation files. Make sure that the `install-config.yaml` file is in the `/path/to/ocp4-<ENV>` location before attempting the next steps.
+
+==== Create the Manifest Files
+The manifest files are human readable, at this stage you can put any customisations required before the installation begins.
+
+* Create the manifests: `openshift-install create manifests --dir=/path/to/ocp4-<ENV>`
+* All configuration for RHCOS must be done via MachineConfigs configuration. If there is known configuration which must be performed, such as NTP, you can copy the MachineConfigs into the `/path/to/ocp4-<ENV>/openshift` directory now.
+* The following step should be performed at this point, edit the `/path/to/ocp4-<ENV>/manifests/cluster-scheduler-02-config.yml` change the `mastersSchedulable` value to `false`.
+
+
+==== Create the Ignition Files
+The ignition files have been generated from the manifests and MachineConfig files to generate the final installation files for the three roles: `bootstrap`, `master`, `worker`. In Fedora we prefer not to use the term `master` here, we have renamed this role to `controlplane`.
+
+* Create the ignition files: `openshift-install create ignition-configs --dir=/path/to/ocp4-<ENV>`
+* At this point you should have the following three files: `bootstrap.ign`, `master.ign` and `worker.ign`.
+* Rename the `master.ign` to `controlplane.ign`.
+* A directory has been created, `auth`. This contains two files: `kubeadmin-password` and `kubeconfig`. These allow `cluster-admin` access to the cluster.
+
+
+=== Copy the Ignition files to the `batcave01` server
+On the `batcave01` at the following location: `/srv/web/infra/bigfiles/openshiftboot/`:
+
+* Create a directory to match the environment: `mkdir /srv/web/infra/bigfiles/openshiftboot/ocp4-<ENV>`
+* Copy the ignition files, the ssh files and the auth files generated in previous steps, to this newly created directory. Users with `sysadmin-openshift` should have the necessary permissions to write to this location.
+* when this is complete it should look like the following:
+----
+    ├── <ENV>
+    │   ├── auth
+    │   │   ├── kubeadmin-password
+    │   │   └── kubeconfig
+    │   ├── bootstrap.ign
+    │   ├── controlplane.ign
+    │   ├── ssh
+    │   │   ├── id_rsa
+    │   │   └── id_rsa.pub
+    │   └── worker.ign
+----
+
+
+=== Update the ansible inventory
+The ansible inventory/hostvars/group vars should be updated with the new hosts information.
+
+For inspiration see the following https://pagure.io/fedora-infra/ansible/pull-request/765[PR] where we added the ocp4 production changes.
+
+
+=== Update the DNS/DHCP configuration
+The DNS and DHCP configuration must also be updated. This https://pagure.io/fedora-infra/ansible/pull-request/765[PR] contains the necessiary changes DHCP for prod and can be done in ansible.
+
+However the DNS changes may only be performed by `sysadmin-main`. For this reason any DNS changes must go via a patch snippet which is emailed to the `infrastructure@lists.fedoraproject.org` mailing list for review and approval. This process may take several days.
+
+
+=== Generate the TLS Certs for the new environment
+This is beyond the scope of this SOP, the best option is to create a ticket for Fedora Infra to request that these certs are created and available for use. The following certs should be available:
+
+- `*.apps.<ENV>.fedoraproject.org`
+- `api.<ENV>.fedoraproject.org`
+- `api-int.<ENV>.fedoraproject.org`
+
+
+=== Run the Playbooks
+There are a number of playbooks required to be run. Once all the previous steps have been reached, we can run these playbooks from the `batcave01` instance.
+
+- `sudo rbac-playbook groups/noc.yml -t 'tftp_server,dhcp_server'`
+- `sudo rbac-playbook groups/proxies.yml -t 'haproxy,httpd,iptables'`
+
+
+==== Baremetal / VMs
+Depending on if some of the nodes are VMs or baremetal, different tags should be supplied to the following playbook. If the entire cluster is baremetal you can skip the `kvm_deploy` tag entirely.
+
+If there are VMs used for some of the roles, make sure to leave it in.
+
+- `sudo rbac-playbook manual/ocp4-place-ignitionfiles.yml -t "ignition,repo,kvm_deploy"`
+
+
+==== Baremetal
+At this point we can switch on the baremetal nodes and begin the PXE/UEFI boot process. The baremetal nodes should via DHCP/DNS have the configuration necessary to reach out to the `noc01.iad2.fedoraproject.org` server and retrieve the UEFI boot configuration via PXE.
+
+Once booted up, you should visit the management console for this node, and manually choose the UEFI configuration appropriate for its role.
+
+The node will begin booting, and during the boot process it will reach out to the `os-control01` instance specific to the `<ENV>` to retrieve the ignition file appropriate to its role.
+
+The system will then become autonomous, it will install and potentially reboot multiple times as updates are retrieved/applied etc.
+
+Eventually you will be presented with a SSH login prompt, where it should have the correct hostname eg: `ocp01` to match what is in the DNS configuration.
+
+
+=== Bootstrapping completed
+When the control plane is up, we should see all controlplane instances available in the appropriate haproxy dashboard. eg: https://admin.fedoraproject.org/haproxy/proxy01=ocp-masters-backend-kapi[haproxy].
+
+At this time we should take the `bootstrap` instance out of the haproxy load balancer.
+
+- Make the necessiary changes to ansible at: `ansible/roles/haproxy/templates/haproxy.cfg`
+- Once merged, run the following playbook once more: `sudo rbac-playbook groups/proxies.yml -t 'haproxy'`
+
+
+=== Begin instllation of the worker nodes
+Follow the same processes listed in the Baremetal section above to switch on the worker nodes and begin installation.
+
+
+=== Configure the `os-control01` to authenticate with the new OCP4 cluster
+Copy the `kubeconfig` to `~root/.kube/config` on the `os-control01` instance.
+This will allow the `root` user to automatically be authenticated to the new OCP4 cluster with `cluster-admin` privileges.
+
+
+=== Accept Node CSR Certs
+To accept the worker/compute nodes into the cluster we need to accept their CSR certs.
+
+List the CSR certs. The ones we're interested in will show as pending:
+
+----
+oc get csr
+----
+
+To accept all the OCP4 node CSRs in a one liner do the following:
+
+----
+oc get csr -o go-template='{{range .items}}{{if not .status}}{{.metadata.name}}{{"\n"}}{{end}}{{end}}' | xargs oc adm certificate approve
+----
+
+This should look something like this once completed:
+
+----
+[root@os-control01 ocp4][STG]= oc get nodes
+NAME                                      STATUS   ROLES    AGE   VERSION
+ocp01.ocp.stg.iad2.fedoraproject.org      Ready    master   34d   v1.21.1+9807387
+ocp02.ocp.stg.iad2.fedoraproject.org      Ready    master   34d   v1.21.1+9807387
+ocp03.ocp.stg.iad2.fedoraproject.org      Ready    master   34d   v1.21.1+9807387
+worker01.ocp.stg.iad2.fedoraproject.org   Ready    worker   21d   v1.21.1+9807387
+worker02.ocp.stg.iad2.fedoraproject.org   Ready    worker   20d   v1.21.1+9807387
+worker03.ocp.stg.iad2.fedoraproject.org   Ready    worker   20d   v1.21.1+9807387
+worker04.ocp.stg.iad2.fedoraproject.org   Ready    worker   34d   v1.21.1+9807387
+worker05.ocp.stg.iad2.fedoraproject.org   Ready    worker   34d   v1.21.1+9807387
+----
+
+At this point the cluster is basically up and running.
+
+
+== Follow on SOPs
+Several other SOPs should be followed to perform the post installation configuration on the cluster.
+
+- xref:sop_configure_baremetal_pxe_uefi_boot.adoc[SOP Configure Baremetal PXE-UEFI Boot]
+- xref:sop_create_machineconfigs.adoc[SOP Create MachineConfigs to Configure RHCOS]
+- xref:sop_retrieve_ocp4_cacert.adoc[SOP Retrieve OCP4 CACERT]
+- xref:sop_configure_image_registry_operator.adoc[SOP Configure the Image Registry Operator]
+- xref:sop_disable_provisioners_role.adoc[SOP Disable the Provisioners Role]
+- xref:sop_configure_oauth_ipa.adoc[SOP Configure oauth Authentication via IPA/Noggin]
+- xref:sop_configure_local_storage_operator.adoc[SOP Configure the Local Storage Operator]
+- xref:sop_configure_openshift_container_storage.adoc[SOP Configure the Openshift Container Storage Operator]
+- xref:sop_configure_userworkload_monitoring_stack.adoc[SOP Configure the Userworkload Monitoring Stack]
+
--- a/modules/sysadmin_guide/pages/sop_retrieve_ocp4_cacert.adoc
+++ b/modules/sysadmin_guide/pages/sop_retrieve_ocp4_cacert.adoc
@ -0,0 +1,22 @@
+= SOP Retrieve OCP4 Cluster CACERT
+
+== Resources
+
+- [1] https://pagure.io/fedora-infra/ansible/blob/main/f/roles/dhcp_server[Ansible Role DHCP Server]
+
+== Retrieve CACERT
+In Fedora Infra, we have Apache terminating TLS for the cluster. Connections to the api and the machineconfig server are handled by haproxy. To prevent TLS errors we must configure haproxy with the OCP4 Cluster CA Cert.
+
+This can be retrieved once the cluster control plane has been installed, from the `os-control01` node like so:
+
+----
+oc get configmap kube-root-ca.crt -o yaml -n openshift-ingress
+----
+
+Extract this CACERT in full, and commit it to ansible at: `https://pagure.io/fedora-infra/ansible/blob/main/f/roles/haproxy/files/ocp.<ENV>-iad2.pem`
+
+To deploy this cert, one must be apart of the `sysadmin-noc` group. Run the following playbook:
+
+----
+sudo rbac-playbook groups/proxies.yml -t 'haproxy'
+----
--- a/modules/sysadmin_guide/pages/sop_upgrade.adoc
+++ b/modules/sysadmin_guide/pages/sop_upgrade.adoc
@ -0,0 +1,40 @@
+= Upgrade OCP4 Cluster
+Please see the official documentation for more information [1][3], this SOP can be used as a rough guide.
+
+== Resources
+
+- [1] https://docs.openshift.com/container-platform/4.8/updating/updating-cluster-between-minor.html[Upgrading OCP4 Cluster Between Minor Versions]
+- [2] xref:sop_etcd_backup.adoc[SOP Create etcd backup]
+- [3] https://docs.openshift.com/container-platform/4.8/operators/admin/olm-upgrading-operators.html
+- [4] https://docs.openshift.com/container-platform/4.8/backup_and_restore/disaster_recovery/scenario-2-restoring-cluster-state.html#dr-restoring-cluster-state[Restore etcd backup]
+- [5] https://docs.openshift.com/container-platform/4.8/operators/admin/olm-upgrading-operators.html#olm-upgrading-operators[Upgrading Operators Prior to Cluster Update]
+- [6] https://access.redhat.com/downloads/content/290/ver=4.8/rhel---8/4.8.18/x86_64/packages[Openshift Clients RPM Download]
+
+== Prerequisites
+
+- Incase an upgrade fails, it is wise to first take an `etcd` backup. To do so follow the SOP [2].
+- Ensure that all installed Operators are at the latest versions for their channel [5].
+- Ensure that the latest `oc` client rpm is available at `/srv/web/infra/bigfiles/openshiftboot/oc-client/` on the `batcave01` server. Retrieve the RPM from [6] choose the `Openshift Clients Binary` rpm. Rename rpm to `oc-client.rpm`
+- Ensure that the `sudo rbac-playbook manual/ocp4-sysadmin-openshift.yml -t "upgrade-rpm"` playbook is run to install this updated oc client rpm.
+
+== Upgrade OCP
+At the time of writing the version installed on the cluster is `4.8.11` and the `upgrade channel` is set to `stable-4.8`. It is easiest to update the cluster via the web console. Go to:
+
+- Administration
+- Cluster Settings
+- In order to upgrade between `z` or `patch` version (x.y.z), when one is available, click the update button.
+- When moving between `y` or `minor` versions, you must first switch the `upgrade channel` to `fast-4.9` as an example. You should also be on the very latest `z`/`patch` version before upgrading.
+- When the upgrade has finished, switch back to the `upgrade channel` for stable.
+
+
+== Upgrade failures
+In the worst case scenario we may have to restore etcd from the backups taken at the start [4]. Or reinstall a node entirely.
+
+=== Troubleshooting
+There are many possible ways an upgrade can fail mid way through.
+
+- Check the monitoring alerts currently firing, this can often hint towards the problem
+- Often individual nodes are failing to take the new MachineConfig changes and will show up when examining the `MachineConfigPool` status.
+- Might require a manual reboot of that particular node
+- Might require killing pods on that particular node
+
--- a/modules/sysadmin_guide/pages/sop_velero.adoc
+++ b/modules/sysadmin_guide/pages/sop_velero.adoc
@ -0,0 +1,76 @@
+= SOP Velero
+This SOP should be used in the following scenario:
+
+- Performing a data migration between OpenShift clusters.
+- Performing a data backup to S3
+- Velero doesn't support restoring into a cluster with a lower Kubernetes version than where the backup was taken.
+
+== Resources
+- [1] https://velero.io/docs/main/migration-case/[Migrating between OpenShift clusters using Velero]
+
+
+== Steps
+1. Install the Velero CLI client.
+
+eg:
+
+----
+wget https://github.com/vmware-tanzu/velero/releases/download/v1.8.1/velero-v1.8.1-linux-amd64.tar.gz
+tar -zxf velero-v1.8.1-linux-amd64.tar.gz
+ln -s velero-v1.8.1-linux-amd64/velero ~/bin/velero
+----
+
+
+2. Configure Velero to access S3
+
+Create a file `credentials-velero` which contains the AWS access key and secret access key with permissions to access an S3 bucket.
+
+----
+[default]
+aws_access_key_id=XXX
+aws_secret_access_key=XXX
+----
+
+
+3. Next install Velero in the cluster
+
+Ensure you are authenticated to the OpenShift cluster via the CLI.
+
+Using something like the following:
+
+----
+REGION="us-east-1"
+S3BUCKET="fedora-openshift-migration"
+
+velero install \
+    --provider aws \
+    --plugins velero/velero-plugin-for-aws:v1.4.0 \
+    --bucket $S3BUCKET \
+    --backup-location-config region=$REGION \
+    --snapshot-location-config region=$REGION \
+    --use-volume-snapshots=true \
+    --image velero/velero:v1.4.0  \
+    --secret-file ./credentials-velero \
+    --use-restic
+----
+
+4. Perform a backup
+
+eg:
+
+----
+velero backup create backupName --include-cluster-resources=true --ordered-resources 'persistentvolumes=pvName' --include-namespaces=namespaceName
+----
+
+
+5. Restore a backup
+
+While authenticated to a second cluster to restore to, or original cluster where you are recovering to you can restore a backup like so:
+
+----
+velero backup get
+velero restore create --from-backup backupName
+----
+
+For more information see the `Velero` documentation at [1].
+