From 6c17d91dbbeec9e35c463c747d4d685732243fd8 Mon Sep 17 00:00:00 2001 From: David Kirwan Date: Fri, 24 Sep 2021 15:07:45 +0900 Subject: [PATCH] Metrics-for-apps: Added SOPs - cordoning nodes - graceful shutdown - graceful startup Signed-off-by: David Kirwan --- ...sop_cordoning_nodes_and_draining_pods.adoc | 56 ++++++++++++ .../sop_graceful_shutdown_ocp_cluster.adoc | 30 +++++++ .../sop_graceful_startup_ocp_cluster.adoc | 88 +++++++++++++++++++ modules/ocp4/pages/sops.adoc | 13 +-- 4 files changed, 182 insertions(+), 5 deletions(-) create mode 100644 modules/ocp4/pages/sop_cordoning_nodes_and_draining_pods.adoc create mode 100644 modules/ocp4/pages/sop_graceful_shutdown_ocp_cluster.adoc create mode 100644 modules/ocp4/pages/sop_graceful_startup_ocp_cluster.adoc diff --git a/modules/ocp4/pages/sop_cordoning_nodes_and_draining_pods.adoc b/modules/ocp4/pages/sop_cordoning_nodes_and_draining_pods.adoc new file mode 100644 index 0000000..004657a --- /dev/null +++ b/modules/ocp4/pages/sop_cordoning_nodes_and_draining_pods.adoc @@ -0,0 +1,56 @@ +== Cordoning Nodes and Draining Pods +This SOP should be followed in the following scenarios: + +- If maintenance is scheduled to be carried out on an Openshift node. + + +=== Steps + +1. Connect to the `os-control01` host associated with this ENV. Become root `sudo su -`. + +2. Mark the node as unschedulable: + +---- +nodes=$(oc get nodes -o name | sed -E "s/node\///") +echo $nodes + +for node in ${nodes[@]}; do oc adm cordon $node; done +node/ cordoned +---- + +3. Check that the node status is `NotReady,SchedulingDisabled` + +---- +oc get node +NAME STATUS ROLES AGE VERSION + NotReady,SchedulingDisabled worker 1d v1.18.3 +---- + +Note: It might not switch to `NotReady` immediately, there maybe many pods still running. + + +4. Evacuate the Pods from **worker nodes** using one of the following methods +This will drain node ``, delete any local data, and ignore daemonsets, and give a period of 60 seconds for pods to drain gracefully. + +---- +oc adm drain --delete-emptydir-data=true --ignore-daemonsets=true --grace-period=15 +---- + +5. Perform the scheduled maintenance on the node +Do what ever is required in the scheduled maintenance window + + +6. Once the node is ready to be added back into the cluster +We must uncordon the node. This allows it to be marked scheduleable once more. + +---- +nodes=$(oc get nodes -o name | sed -E "s/node\///") +echo $nodes + +for node in ${nodes[@]}; do oc adm uncordon $node; done +---- + + +=== Resources + +- [1] [Nodes - working with nodes](https://docs.openshift.com/container-platform/4.8/nodes/nodes/nodes-nodes-working.html) diff --git a/modules/ocp4/pages/sop_graceful_shutdown_ocp_cluster.adoc b/modules/ocp4/pages/sop_graceful_shutdown_ocp_cluster.adoc new file mode 100644 index 0000000..7de41f5 --- /dev/null +++ b/modules/ocp4/pages/sop_graceful_shutdown_ocp_cluster.adoc @@ -0,0 +1,30 @@ +== Graceful Shutdown of an Openshift 4 Cluster +This SOP should be followed in the following scenarios: + +- Graceful full shut down of the Openshift 4 cluster is required. + +=== Steps + +Prequisite steps: +- Follow the SOP for cordoning and draining the nodes. +- Follow the SOP for creating an `etcd` backup. + + +1. Connect to the `os-control01` host associated with this ENV. Become root `sudo su -`. + +2. Get a list of the nodes + +---- +nodes=$(oc get nodes -o name | sed -E "s/node\///") +---- + +3. Shutdown the nodes from the administration box associated with the cluster `ENV` eg production/staging. + +---- +for node in ${nodes[@]}; do ssh -i /root/ocp4/ocp-/ssh/id_rsa core@$node sudo shutdown -h now; done +---- + + +==== Resources + +- [1] https://docs.openshift.com/container-platform/4.5/backup_and_restore/graceful-cluster-shutdown.html[Graceful Cluster Shutdown] diff --git a/modules/ocp4/pages/sop_graceful_startup_ocp_cluster.adoc b/modules/ocp4/pages/sop_graceful_startup_ocp_cluster.adoc new file mode 100644 index 0000000..4fe76ae --- /dev/null +++ b/modules/ocp4/pages/sop_graceful_startup_ocp_cluster.adoc @@ -0,0 +1,88 @@ +== Graceful Startup of an Openshift 4 Cluster +This SOP should be followed in the following scenarios: + +- Graceful start up of an Openshift 4 cluster. + +=== Steps +Prequisite steps: + + +==== Start the VM Control Plane instances +Ensure that the control plane instances start first. + +---- +# Virsh command to start the VMs +---- + + +==== Start the physical nodes +To connect to `idrac`, you must be connected to the Red Hat VPN. Next find the management IP associated with each node. + +On the `batcave01` instance, in the dns configuration, the following bare metal machines make up the production/staging OCP4 worker nodes. + +---- +oshift-dell01 IN A 10.3.160.180 # worker01 prod +oshift-dell02 IN A 10.3.160.181 # worker02 prod +oshift-dell03 IN A 10.3.160.182 # worker03 prod +oshift-dell04 IN A 10.3.160.183 # worker01 staging +oshift-dell05 IN A 10.3.160.184 # worker02 staging +oshift-dell06 IN A 10.3.160.185 # worker03 staging +---- + +Login to the `idrac` interface that corresponds with each worker, one at a time. Ensure the node is booting via harddrive, then power it on. + +==== Once the nodes have been started they must be uncordoned if appropriate + +---- +oc get nodes +NAME STATUS ROLES AGE VERSION +dumpty-n1.ci.centos.org Ready,SchedulingDisabled worker 77d v1.18.3+6c42de8 +dumpty-n2.ci.centos.org Ready,SchedulingDisabled worker 77d v1.18.3+6c42de8 +dumpty-n3.ci.centos.org Ready,SchedulingDisabled worker 77d v1.18.3+6c42de8 +dumpty-n4.ci.centos.org Ready,SchedulingDisabled worker 77d v1.18.3+6c42de8 +dumpty-n5.ci.centos.org Ready,SchedulingDisabled worker 77d v1.18.3+6c42de8 +kempty-n10.ci.centos.org Ready,SchedulingDisabled worker 106d v1.18.3+6c42de8 +kempty-n11.ci.centos.org Ready,SchedulingDisabled worker 106d v1.18.3+6c42de8 +kempty-n12.ci.centos.org Ready,SchedulingDisabled worker 106d v1.18.3+6c42de8 +kempty-n6.ci.centos.org Ready,SchedulingDisabled master 106d v1.18.3+6c42de8 +kempty-n7.ci.centos.org Ready,SchedulingDisabled master 106d v1.18.3+6c42de8 +kempty-n8.ci.centos.org Ready,SchedulingDisabled master 106d v1.18.3+6c42de8 +kempty-n9.ci.centos.org Ready,SchedulingDisabled worker 106d v1.18.3+6c42de8 + +nodes=$(oc get nodes -o name | sed -E "s/node\///") + +for node in ${nodes[@]}; do oc adm uncordon $node; done +node/dumpty-n1.ci.centos.org uncordoned +node/dumpty-n2.ci.centos.org uncordoned +node/dumpty-n3.ci.centos.org uncordoned +node/dumpty-n4.ci.centos.org uncordoned +node/dumpty-n5.ci.centos.org uncordoned +node/kempty-n10.ci.centos.org uncordoned +node/kempty-n11.ci.centos.org uncordoned +node/kempty-n12.ci.centos.org uncordoned +node/kempty-n6.ci.centos.org uncordoned +node/kempty-n7.ci.centos.org uncordoned +node/kempty-n8.ci.centos.org uncordoned +node/kempty-n9.ci.centos.org uncordoned + +oc get nodes +NAME STATUS ROLES AGE VERSION +dumpty-n1.ci.centos.org Ready worker 77d v1.18.3+6c42de8 +dumpty-n2.ci.centos.org Ready worker 77d v1.18.3+6c42de8 +dumpty-n3.ci.centos.org Ready worker 77d v1.18.3+6c42de8 +dumpty-n4.ci.centos.org Ready worker 77d v1.18.3+6c42de8 +dumpty-n5.ci.centos.org Ready worker 77d v1.18.3+6c42de8 +kempty-n10.ci.centos.org Ready worker 106d v1.18.3+6c42de8 +kempty-n11.ci.centos.org Ready worker 106d v1.18.3+6c42de8 +kempty-n12.ci.centos.org Ready worker 106d v1.18.3+6c42de8 +kempty-n6.ci.centos.org Ready master 106d v1.18.3+6c42de8 +kempty-n7.ci.centos.org Ready master 106d v1.18.3+6c42de8 +kempty-n8.ci.centos.org Ready master 106d v1.18.3+6c42de8 +kempty-n9.ci.centos.org Ready worker 106d v1.18.3+6c42de8 +---- + + +=== Resources + +- [1] https://docs.openshift.com/container-platform/4.5/backup_and_restore/graceful-cluster-restart.html[Graceful Cluster Startup] +- [2] https://docs.openshift.com/container-platform/4.5/backup_and_restore/disaster_recovery/scenario-2-restoring-cluster-state.html#dr-restoring-cluster-state[Cluster disaster recovery] diff --git a/modules/ocp4/pages/sops.adoc b/modules/ocp4/pages/sops.adoc index 0e74933..7620939 100644 --- a/modules/ocp4/pages/sops.adoc +++ b/modules/ocp4/pages/sops.adoc @@ -1,12 +1,15 @@ == SOPs -- xref:sop_installation.adoc[SOP Openshift 4 Installation on Fedora Infra] -- xref:sop_create_machineconfigs.adoc[SOP Create MachineConfigs to Configure RHCOS] - xref:sop_configure_baremetal_pxe_uefi_boot.adoc[SOP Configure Baremetal PXE-UEFI Boot] -- xref:sop_retrieve_ocp4_cacert.adoc[SOP Retrieve OCP4 CACERT] - xref:sop_configure_image_registry_operator.adoc[SOP Configure the Image Registry Operator] -- xref:sop_configure_oauth_ipa.adoc[SOP Configure oauth Authentication via IPA/Noggin] -- xref:sop_disable_provisioners_role.adoc[SOP Disable the Provisioners Role] - xref:sop_configure_local_storage_operator.adoc[SOP Configure the Local Storage Operator] +- xref:sop_configure_oauth_ipa.adoc[SOP Configure oauth Authentication via IPA/Noggin] - xref:sop_configure_openshift_container_storage.adoc[SOP Configure the Openshift Container Storage Operator] - xref:sop_configure_userworkload_monitoring_stack.adoc[SOP Configure the Userworkload Monitoring Stack] +- xref:sop_cordoning_nodes_and_draining_pods.adoc[SOP Cordoning and Draining Nodes] +- xref:sop_create_machineconfigs.adoc[SOP Create MachineConfigs to Configure RHCOS] +- xref:sop_disable_provisioners_role.adoc[SOP Disable the Provisioners Role] +- xref:sop_graceful_shutdown_ocp_cluster.adoc[SOP Graceful Cluster Shutdown] +- xref:sop_graceful_startup_ocp_cluster.adoc[SOP Graceful Cluster Startup] +- xref:sop_installation.adoc[SOP Openshift 4 Installation on Fedora Infra] +- xref:sop_retrieve_ocp4_cacert.adoc[SOP Retrieve OCP4 CACERT]