Metrics-for-apps: Added SOPs

- cordoning nodes
- graceful shutdown
- graceful startup

Signed-off-by: David Kirwan <dkirwan@redhat.com>
This commit is contained in:
David Kirwan 2021-09-24 15:07:45 +09:00
parent 23c84096dd
commit 6c17d91dbb
4 changed files with 182 additions and 5 deletions

View file

@ -0,0 +1,56 @@
== Cordoning Nodes and Draining Pods
This SOP should be followed in the following scenarios:
- If maintenance is scheduled to be carried out on an Openshift node.
=== Steps
1. Connect to the `os-control01` host associated with this ENV. Become root `sudo su -`.
2. Mark the node as unschedulable:
----
nodes=$(oc get nodes -o name | sed -E "s/node\///")
echo $nodes
for node in ${nodes[@]}; do oc adm cordon $node; done
node/<node> cordoned
----
3. Check that the node status is `NotReady,SchedulingDisabled`
----
oc get node <node1>
NAME STATUS ROLES AGE VERSION
<node1> NotReady,SchedulingDisabled worker 1d v1.18.3
----
Note: It might not switch to `NotReady` immediately, there maybe many pods still running.
4. Evacuate the Pods from **worker nodes** using one of the following methods
This will drain node `<node1>`, delete any local data, and ignore daemonsets, and give a period of 60 seconds for pods to drain gracefully.
----
oc adm drain <node1> --delete-emptydir-data=true --ignore-daemonsets=true --grace-period=15
----
5. Perform the scheduled maintenance on the node
Do what ever is required in the scheduled maintenance window
6. Once the node is ready to be added back into the cluster
We must uncordon the node. This allows it to be marked scheduleable once more.
----
nodes=$(oc get nodes -o name | sed -E "s/node\///")
echo $nodes
for node in ${nodes[@]}; do oc adm uncordon $node; done
----
=== Resources
- [1] [Nodes - working with nodes](https://docs.openshift.com/container-platform/4.8/nodes/nodes/nodes-nodes-working.html)

View file

@ -0,0 +1,30 @@
== Graceful Shutdown of an Openshift 4 Cluster
This SOP should be followed in the following scenarios:
- Graceful full shut down of the Openshift 4 cluster is required.
=== Steps
Prequisite steps:
- Follow the SOP for cordoning and draining the nodes.
- Follow the SOP for creating an `etcd` backup.
1. Connect to the `os-control01` host associated with this ENV. Become root `sudo su -`.
2. Get a list of the nodes
----
nodes=$(oc get nodes -o name | sed -E "s/node\///")
----
3. Shutdown the nodes from the administration box associated with the cluster `ENV` eg production/staging.
----
for node in ${nodes[@]}; do ssh -i /root/ocp4/ocp-<ENV>/ssh/id_rsa core@$node sudo shutdown -h now; done
----
==== Resources
- [1] https://docs.openshift.com/container-platform/4.5/backup_and_restore/graceful-cluster-shutdown.html[Graceful Cluster Shutdown]

View file

@ -0,0 +1,88 @@
== Graceful Startup of an Openshift 4 Cluster
This SOP should be followed in the following scenarios:
- Graceful start up of an Openshift 4 cluster.
=== Steps
Prequisite steps:
==== Start the VM Control Plane instances
Ensure that the control plane instances start first.
----
# Virsh command to start the VMs
----
==== Start the physical nodes
To connect to `idrac`, you must be connected to the Red Hat VPN. Next find the management IP associated with each node.
On the `batcave01` instance, in the dns configuration, the following bare metal machines make up the production/staging OCP4 worker nodes.
----
oshift-dell01 IN A 10.3.160.180 # worker01 prod
oshift-dell02 IN A 10.3.160.181 # worker02 prod
oshift-dell03 IN A 10.3.160.182 # worker03 prod
oshift-dell04 IN A 10.3.160.183 # worker01 staging
oshift-dell05 IN A 10.3.160.184 # worker02 staging
oshift-dell06 IN A 10.3.160.185 # worker03 staging
----
Login to the `idrac` interface that corresponds with each worker, one at a time. Ensure the node is booting via harddrive, then power it on.
==== Once the nodes have been started they must be uncordoned if appropriate
----
oc get nodes
NAME STATUS ROLES AGE VERSION
dumpty-n1.ci.centos.org Ready,SchedulingDisabled worker 77d v1.18.3+6c42de8
dumpty-n2.ci.centos.org Ready,SchedulingDisabled worker 77d v1.18.3+6c42de8
dumpty-n3.ci.centos.org Ready,SchedulingDisabled worker 77d v1.18.3+6c42de8
dumpty-n4.ci.centos.org Ready,SchedulingDisabled worker 77d v1.18.3+6c42de8
dumpty-n5.ci.centos.org Ready,SchedulingDisabled worker 77d v1.18.3+6c42de8
kempty-n10.ci.centos.org Ready,SchedulingDisabled worker 106d v1.18.3+6c42de8
kempty-n11.ci.centos.org Ready,SchedulingDisabled worker 106d v1.18.3+6c42de8
kempty-n12.ci.centos.org Ready,SchedulingDisabled worker 106d v1.18.3+6c42de8
kempty-n6.ci.centos.org Ready,SchedulingDisabled master 106d v1.18.3+6c42de8
kempty-n7.ci.centos.org Ready,SchedulingDisabled master 106d v1.18.3+6c42de8
kempty-n8.ci.centos.org Ready,SchedulingDisabled master 106d v1.18.3+6c42de8
kempty-n9.ci.centos.org Ready,SchedulingDisabled worker 106d v1.18.3+6c42de8
nodes=$(oc get nodes -o name | sed -E "s/node\///")
for node in ${nodes[@]}; do oc adm uncordon $node; done
node/dumpty-n1.ci.centos.org uncordoned
node/dumpty-n2.ci.centos.org uncordoned
node/dumpty-n3.ci.centos.org uncordoned
node/dumpty-n4.ci.centos.org uncordoned
node/dumpty-n5.ci.centos.org uncordoned
node/kempty-n10.ci.centos.org uncordoned
node/kempty-n11.ci.centos.org uncordoned
node/kempty-n12.ci.centos.org uncordoned
node/kempty-n6.ci.centos.org uncordoned
node/kempty-n7.ci.centos.org uncordoned
node/kempty-n8.ci.centos.org uncordoned
node/kempty-n9.ci.centos.org uncordoned
oc get nodes
NAME STATUS ROLES AGE VERSION
dumpty-n1.ci.centos.org Ready worker 77d v1.18.3+6c42de8
dumpty-n2.ci.centos.org Ready worker 77d v1.18.3+6c42de8
dumpty-n3.ci.centos.org Ready worker 77d v1.18.3+6c42de8
dumpty-n4.ci.centos.org Ready worker 77d v1.18.3+6c42de8
dumpty-n5.ci.centos.org Ready worker 77d v1.18.3+6c42de8
kempty-n10.ci.centos.org Ready worker 106d v1.18.3+6c42de8
kempty-n11.ci.centos.org Ready worker 106d v1.18.3+6c42de8
kempty-n12.ci.centos.org Ready worker 106d v1.18.3+6c42de8
kempty-n6.ci.centos.org Ready master 106d v1.18.3+6c42de8
kempty-n7.ci.centos.org Ready master 106d v1.18.3+6c42de8
kempty-n8.ci.centos.org Ready master 106d v1.18.3+6c42de8
kempty-n9.ci.centos.org Ready worker 106d v1.18.3+6c42de8
----
=== Resources
- [1] https://docs.openshift.com/container-platform/4.5/backup_and_restore/graceful-cluster-restart.html[Graceful Cluster Startup]
- [2] https://docs.openshift.com/container-platform/4.5/backup_and_restore/disaster_recovery/scenario-2-restoring-cluster-state.html#dr-restoring-cluster-state[Cluster disaster recovery]

View file

@ -1,12 +1,15 @@
== SOPs
- xref:sop_installation.adoc[SOP Openshift 4 Installation on Fedora Infra]
- xref:sop_create_machineconfigs.adoc[SOP Create MachineConfigs to Configure RHCOS]
- xref:sop_configure_baremetal_pxe_uefi_boot.adoc[SOP Configure Baremetal PXE-UEFI Boot]
- xref:sop_retrieve_ocp4_cacert.adoc[SOP Retrieve OCP4 CACERT]
- xref:sop_configure_image_registry_operator.adoc[SOP Configure the Image Registry Operator]
- xref:sop_configure_oauth_ipa.adoc[SOP Configure oauth Authentication via IPA/Noggin]
- xref:sop_disable_provisioners_role.adoc[SOP Disable the Provisioners Role]
- xref:sop_configure_local_storage_operator.adoc[SOP Configure the Local Storage Operator]
- xref:sop_configure_oauth_ipa.adoc[SOP Configure oauth Authentication via IPA/Noggin]
- xref:sop_configure_openshift_container_storage.adoc[SOP Configure the Openshift Container Storage Operator]
- xref:sop_configure_userworkload_monitoring_stack.adoc[SOP Configure the Userworkload Monitoring Stack]
- xref:sop_cordoning_nodes_and_draining_pods.adoc[SOP Cordoning and Draining Nodes]
- xref:sop_create_machineconfigs.adoc[SOP Create MachineConfigs to Configure RHCOS]
- xref:sop_disable_provisioners_role.adoc[SOP Disable the Provisioners Role]
- xref:sop_graceful_shutdown_ocp_cluster.adoc[SOP Graceful Cluster Shutdown]
- xref:sop_graceful_startup_ocp_cluster.adoc[SOP Graceful Cluster Startup]
- xref:sop_installation.adoc[SOP Openshift 4 Installation on Fedora Infra]
- xref:sop_retrieve_ocp4_cacert.adoc[SOP Retrieve OCP4 CACERT]