Metrics-for-apps: Added SOPs

- cordoning nodes - graceful shutdown - graceful startup Signed-off-by: David Kirwan <dkirwan@redhat.com>
2021-09-24 15:07:45 +09:00 · 2021-09-24 15:07:45 +09:00 · 6c17d91dbb
commit 6c17d91dbb
parent 23c84096dd
4 changed files with 182 additions and 5 deletions
--- a/modules/ocp4/pages/sop_cordoning_nodes_and_draining_pods.adoc
+++ b/modules/ocp4/pages/sop_cordoning_nodes_and_draining_pods.adoc
@ -0,0 +1,56 @@
+== Cordoning Nodes and Draining Pods
+This SOP should be followed in the following scenarios:
+
+- If maintenance is scheduled to be carried out on an Openshift node.
+
+
+=== Steps
+
+1. Connect to the `os-control01` host associated with this ENV. Become root `sudo su -`.
+
+2. Mark the node as unschedulable:
+
+----
+nodes=$(oc get nodes -o name  | sed -E "s/node\///")
+echo $nodes
+
+for node in ${nodes[@]}; do oc adm cordon $node; done
+node/<node> cordoned
+----
+
+3. Check that the node status is `NotReady,SchedulingDisabled`
+
+----
+oc get node <node1>
+NAME        STATUS                        ROLES     AGE       VERSION
+<node1>     NotReady,SchedulingDisabled   worker    1d        v1.18.3
+----
+
+Note: It might not switch to `NotReady` immediately, there maybe many pods still running.
+
+
+4. Evacuate the Pods from **worker nodes** using one of the following methods
+This will drain node `<node1>`, delete any local data, and ignore daemonsets, and give a period of 60 seconds for pods to drain gracefully.
+
+----
+oc adm drain <node1> --delete-emptydir-data=true --ignore-daemonsets=true --grace-period=15
+----
+
+5. Perform the scheduled maintenance on the node
+Do what ever is required in the scheduled maintenance window
+
+
+6. Once the node is ready to be added back into the cluster
+We must uncordon the node. This allows it to be marked scheduleable once more.
+
+----
+nodes=$(oc get nodes -o name  | sed -E "s/node\///")
+echo $nodes
+
+for node in ${nodes[@]}; do oc adm uncordon $node; done
+----
+
+
+===  Resources
+
+- [1] [Nodes - working with nodes](https://docs.openshift.com/container-platform/4.8/nodes/nodes/nodes-nodes-working.html)
--- a/modules/ocp4/pages/sop_graceful_shutdown_ocp_cluster.adoc
+++ b/modules/ocp4/pages/sop_graceful_shutdown_ocp_cluster.adoc
@ -0,0 +1,30 @@
+== Graceful Shutdown of an Openshift 4 Cluster
+This SOP should be followed in the following scenarios:
+
+- Graceful full shut down of the Openshift 4 cluster is required.
+
+=== Steps
+
+Prequisite steps:
+- Follow the SOP for cordoning and draining the nodes.
+- Follow the SOP for creating an `etcd` backup.
+
+
+1. Connect to the `os-control01` host associated with this ENV. Become root `sudo su -`.
+
+2. Get a list of the nodes
+
+----
+nodes=$(oc get nodes -o name  | sed -E "s/node\///")
+----
+
+3. Shutdown the nodes from the administration box associated with the cluster `ENV` eg production/staging.
+
+----
+for node in ${nodes[@]}; do ssh -i /root/ocp4/ocp-<ENV>/ssh/id_rsa core@$node sudo shutdown -h now; done
+----
+
+
+==== Resources
+
+- [1] https://docs.openshift.com/container-platform/4.5/backup_and_restore/graceful-cluster-shutdown.html[Graceful Cluster Shutdown]
--- a/modules/ocp4/pages/sop_graceful_startup_ocp_cluster.adoc
+++ b/modules/ocp4/pages/sop_graceful_startup_ocp_cluster.adoc
@ -0,0 +1,88 @@
+== Graceful Startup of an Openshift 4 Cluster
+This SOP should be followed in the following scenarios:
+
+- Graceful start up of an Openshift 4 cluster.
+
+=== Steps
+Prequisite steps:
+
+
+==== Start the VM Control Plane instances
+Ensure that the control plane instances start first.
+
+----
+# Virsh command to start the VMs
+----
+
+
+==== Start the physical nodes
+To connect to `idrac`, you must be connected to the Red Hat VPN. Next find the management IP associated with each node.
+
+On the `batcave01` instance, in the dns configuration, the following bare metal machines make up the production/staging OCP4 worker nodes.
+
+----
+oshift-dell01             IN        A     10.3.160.180  # worker01 prod
+oshift-dell02             IN        A     10.3.160.181  # worker02 prod
+oshift-dell03             IN        A     10.3.160.182  # worker03 prod
+oshift-dell04             IN        A     10.3.160.183  # worker01 staging
+oshift-dell05             IN        A     10.3.160.184  # worker02 staging
+oshift-dell06             IN        A     10.3.160.185  # worker03 staging
+----
+
+Login to the `idrac` interface that corresponds with each worker, one at a time. Ensure the node is booting via harddrive, then power it on.
+
+==== Once the nodes have been started they must be uncordoned if appropriate
+
+----
+oc get nodes
+NAME                       STATUS                     ROLES    AGE    VERSION
+dumpty-n1.ci.centos.org    Ready,SchedulingDisabled   worker   77d    v1.18.3+6c42de8
+dumpty-n2.ci.centos.org    Ready,SchedulingDisabled   worker   77d    v1.18.3+6c42de8
+dumpty-n3.ci.centos.org    Ready,SchedulingDisabled   worker   77d    v1.18.3+6c42de8
+dumpty-n4.ci.centos.org    Ready,SchedulingDisabled   worker   77d    v1.18.3+6c42de8
+dumpty-n5.ci.centos.org    Ready,SchedulingDisabled   worker   77d    v1.18.3+6c42de8
+kempty-n10.ci.centos.org   Ready,SchedulingDisabled   worker   106d   v1.18.3+6c42de8
+kempty-n11.ci.centos.org   Ready,SchedulingDisabled   worker   106d   v1.18.3+6c42de8
+kempty-n12.ci.centos.org   Ready,SchedulingDisabled   worker   106d   v1.18.3+6c42de8
+kempty-n6.ci.centos.org    Ready,SchedulingDisabled   master   106d   v1.18.3+6c42de8
+kempty-n7.ci.centos.org    Ready,SchedulingDisabled   master   106d   v1.18.3+6c42de8
+kempty-n8.ci.centos.org    Ready,SchedulingDisabled   master   106d   v1.18.3+6c42de8
+kempty-n9.ci.centos.org    Ready,SchedulingDisabled   worker   106d   v1.18.3+6c42de8
+
+nodes=$(oc get nodes -o name  | sed -E "s/node\///")
+
+for node in ${nodes[@]}; do oc adm uncordon $node; done
+node/dumpty-n1.ci.centos.org uncordoned
+node/dumpty-n2.ci.centos.org uncordoned
+node/dumpty-n3.ci.centos.org uncordoned
+node/dumpty-n4.ci.centos.org uncordoned
+node/dumpty-n5.ci.centos.org uncordoned
+node/kempty-n10.ci.centos.org uncordoned
+node/kempty-n11.ci.centos.org uncordoned
+node/kempty-n12.ci.centos.org uncordoned
+node/kempty-n6.ci.centos.org uncordoned
+node/kempty-n7.ci.centos.org uncordoned
+node/kempty-n8.ci.centos.org uncordoned
+node/kempty-n9.ci.centos.org uncordoned
+
+oc get nodes
+NAME                       STATUS   ROLES    AGE    VERSION
+dumpty-n1.ci.centos.org    Ready    worker   77d    v1.18.3+6c42de8
+dumpty-n2.ci.centos.org    Ready    worker   77d    v1.18.3+6c42de8
+dumpty-n3.ci.centos.org    Ready    worker   77d    v1.18.3+6c42de8
+dumpty-n4.ci.centos.org    Ready    worker   77d    v1.18.3+6c42de8
+dumpty-n5.ci.centos.org    Ready    worker   77d    v1.18.3+6c42de8
+kempty-n10.ci.centos.org   Ready    worker   106d   v1.18.3+6c42de8
+kempty-n11.ci.centos.org   Ready    worker   106d   v1.18.3+6c42de8
+kempty-n12.ci.centos.org   Ready    worker   106d   v1.18.3+6c42de8
+kempty-n6.ci.centos.org    Ready    master   106d   v1.18.3+6c42de8
+kempty-n7.ci.centos.org    Ready    master   106d   v1.18.3+6c42de8
+kempty-n8.ci.centos.org    Ready    master   106d   v1.18.3+6c42de8
+kempty-n9.ci.centos.org    Ready    worker   106d   v1.18.3+6c42de8
+----
+
+
+=== Resources
+
+- [1] https://docs.openshift.com/container-platform/4.5/backup_and_restore/graceful-cluster-restart.html[Graceful Cluster Startup]
+- [2] https://docs.openshift.com/container-platform/4.5/backup_and_restore/disaster_recovery/scenario-2-restoring-cluster-state.html#dr-restoring-cluster-state[Cluster disaster recovery]
--- a/modules/ocp4/pages/sops.adoc
+++ b/modules/ocp4/pages/sops.adoc
@ -1,12 +1,15 @@
 == SOPs

- xref:sop_installation.adoc[SOP Openshift 4 Installation on Fedora Infra]
- xref:sop_create_machineconfigs.adoc[SOP Create MachineConfigs to Configure RHCOS]
 - xref:sop_configure_baremetal_pxe_uefi_boot.adoc[SOP Configure Baremetal PXE-UEFI Boot]
- xref:sop_retrieve_ocp4_cacert.adoc[SOP Retrieve OCP4 CACERT]
 - xref:sop_configure_image_registry_operator.adoc[SOP Configure the Image Registry Operator]
- xref:sop_configure_oauth_ipa.adoc[SOP Configure oauth Authentication via IPA/Noggin]
- xref:sop_disable_provisioners_role.adoc[SOP Disable the Provisioners Role]
 - xref:sop_configure_local_storage_operator.adoc[SOP Configure the Local Storage Operator]
+- xref:sop_configure_oauth_ipa.adoc[SOP Configure oauth Authentication via IPA/Noggin]
 - xref:sop_configure_openshift_container_storage.adoc[SOP Configure the Openshift Container Storage Operator]
 - xref:sop_configure_userworkload_monitoring_stack.adoc[SOP Configure the Userworkload Monitoring Stack]
+- xref:sop_cordoning_nodes_and_draining_pods.adoc[SOP Cordoning and Draining Nodes]
+- xref:sop_create_machineconfigs.adoc[SOP Create MachineConfigs to Configure RHCOS]
+- xref:sop_disable_provisioners_role.adoc[SOP Disable the Provisioners Role]
+- xref:sop_graceful_shutdown_ocp_cluster.adoc[SOP Graceful Cluster Shutdown]
+- xref:sop_graceful_startup_ocp_cluster.adoc[SOP Graceful Cluster Startup]
+- xref:sop_installation.adoc[SOP Openshift 4 Installation on Fedora Infra]
+- xref:sop_retrieve_ocp4_cacert.adoc[SOP Retrieve OCP4 CACERT]