diff --git a/modules/ocp4/pages/sop_add_node.adoc b/modules/ocp4/pages/sop_add_node.adoc new file mode 100644 index 0000000..663caea --- /dev/null +++ b/modules/ocp4/pages/sop_add_node.adoc @@ -0,0 +1,134 @@ +== SOP Add an OCP4 Node to an Existing Cluster +This SOP should be used in the following scenario: + +- Red Hat OpenShift Container Platform 4.x cluster has been installed some time ago (1+ days ago) and additional worker nodes are required to increase the capacity for the cluster. + + +=== Resources +- [1] https://access.redhat.com/solutions/4246261[How to add Openshift 4 RHCOS worker nodes in UPI <24 hours] +- [2] https://access.redhat.com/solutions/4799921[How to add Openshift 4 RHCOS worker nodes to UPI >24 hours] +- [3] https://docs.openshift.com/container-platform/4.8/post_installation_configuration/node-tasks.html[Adding RHCOS worker nodes] + + +=== Steps +1. Add the new nodes to the ansible inventory file in the appropriate group. + +eg: + +---- +[ocp_workers] +worker01.ocp.iad2.fedoraproject.org +worker02.ocp.iad2.fedoraproject.org +worker03.ocp.iad2.fedoraproject.org + + +[ocp_workers_stg] +worker01.ocp.stg.iad2.fedoraproject.org +worker02.ocp.stg.iad2.fedoraproject.org +worker03.ocp.stg.iad2.fedoraproject.org +worker04.ocp.stg.iad2.fedoraproject.org +worker05.ocp.stg.iad2.fedoraproject.org +---- + +2. Add the new hostvars for each new host being added, see the following examples for `VM` vs `baremetal` hosts. + +---- +# control plane VM +inventory/host_vars/ocp01.ocp.iad2.fedoraproject.org + +# compute baremetal +inventory/host_vars/worker01.ocp.iad2.fedoraproject.org +---- + +3. If the nodes are `compute` or `worker` nodes, they must be also added to the following group_vars `proxies` for prod, `proxies_stg` for staging + +---- +inventory/group_vars/proxies:ocp_nodes: +inventory/group_vars/proxies_stg:ocp_nodes_stg: +---- + +4. Changes must be made to the `roles/dhcp_server/files/dhcpd.conf.noc01.iad2.fedoraproject.org` file for DHCP to ensure that the node will receieve an IP address based on its MAC address, and tells the node to reach out to the `next-server` where it can find the UEFI boot configuration. + +---- +host worker01-ocp { # UPDATE THIS + hardware ethernet 68:05:CA:CE:A3:C9; # UPDATE THIS + fixed-address 10.3.163.123; # UPDATE THIS + filename "uefi/grubx64.efi"; + next-server 10.3.163.10; + option routers 10.3.163.254; + option subnet-mask 255.255.255.0; +} +---- + +5. Changes must be made to DNS. To do this, one must send a patch request to the Fedora Infra mailing list for review. + +See the following examples for the `worker01.ocp` nodes for production and staging. + +---- +master/163.3.10.in-addr.arpa:123 IN PTR worker01.ocp.iad2.fedoraproject.org. +master/166.3.10.in-addr.arpa:118 IN PTR worker01.ocp.stg.iad2.fedoraproject.org. +master/iad2.fedoraproject.org:worker01.ocp IN A 10.3.163.123 +master/stg.iad2.fedoraproject.org:worker01.ocp IN A 10.3.166.118 +---- + +6. Run the playbook to update the haproxy config to monitor the new nodes, and add it to the load balancer. + +---- +sudo rbac-playbook groups/noc.yml -t "tftp_server,dhcp_server" +sudo rbac-playbook groups/proxies.yml -t 'haproxy,httpd' +---- + + +7. DHCP instructs the node to reach out to the `next-server` when it is handed out an IP address. The `next-server` runs a tftp server which contains the kernel, initramfs and UEFI boot configuration. `uefi/grub.cfg`. Contained in this grub.cfg is the following which relates to the OCP4 nodes: + +---- +menuentry 'RHCOS 4.8 worker staging' { + linuxefi images/RHCOS/4.8/x86_64/rhcos-4.8.2-x86_64-live-kernel-x86_64 ip=dhcp nameserver=10.3.163.33 coreos.inst.install_dev=/dev/sda +coreos.live.rootfs_url=http://10.3.166.50/rhcos/rhcos-4.8.2-x86_64-live-rootfs.x86_64.img coreos.inst.ignition_url=http://10.3.166.50/rhcos/worker.ign + initrdefi images/RHCOS/4.8/x86_64/rhcos-4.8.2-x86_64-live-initramfs.x86_64.img +} +menuentry 'RHCOS 4.8 worker production' { + linuxefi images/RHCOS/4.8/x86_64/rhcos-4.8.2-x86_64-live-kernel-x86_64 ip=dhcp nameserver=10.3.163.33 coreos.inst.install_dev=/dev/sda +coreos.live.rootfs_url=http://10.3.163.65/rhcos/rhcos-4.8.2-x86_64-live-rootfs.x86_64.img coreos.inst.ignition_url=http://10.3.163.65/rhcos/worker.ign + initrdefi images/RHCOS/4.8/x86_64/rhcos-4.8.2-x86_64-live-initramfs.x86_64.img +} +---- + +When a node is booted up, and reads this UEFI boot configuration, the menu option must be manually selected: + +- To add a node to the staging cluster choose: `RHCOS 4.8 worker staging` +- To add a node to the production cluster choose: `RHCOS 4.8 worker production` + +8. Connect to the `os-control01` node which corresponds with the ENV which the new node is being added to. + +Verify that you are authenticated correctly to the Openshift cluster as the `system:admin` user. + +---- +oc whoami +system:admin +---- + +9. Contained within the UEFI boot menu configuration, is links to the web server running on the `os-control01` host specific to the ENV. This server should not be left running it should only run when we wish to reinstall an existing node or install a new node. Start it using systemctl manually: + +---- +systemctl start httpd.service +---- + +10. Boot up the node and select the appropriate menu entry to install the node into the correct cluster. +From the console wait until the node displays a SSH login prompt with the nodes name. It may reboot several times during the process. + +11. As the new nodes are provisioned, they will attempt to join the cluster. They must first be accepted. +From the `os-control01` node run the following: + +---- +# List the certs. If you see status pending, this is the worker/compute nodes attempting to join the cluster. It must be approved. +oc get csr + +# Accept all node CSRs one liner +oc get csr -o go-template='{{range .items}}{{if not .status}}{{.metadata.name}}{{"\n"}}{{end}}{{end}}' | xargs oc adm certificate approve +---- + +This process usually needs to be repeated twice, for each new node. + +To see more information about adding new worker/compute nodes to a user provisioned infrastructure based OCP4 cluster see the detailed steps at [1],[2]. + diff --git a/modules/ocp4/pages/sops.adoc b/modules/ocp4/pages/sops.adoc index 292d40a..e068c57 100644 --- a/modules/ocp4/pages/sops.adoc +++ b/modules/ocp4/pages/sops.adoc @@ -16,3 +16,5 @@ - xref:sop_upgrade.adoc[SOP Upgrade OCP4 Cluster] - xref:sop_etcd_backup.adoc[SOP Create etcd backup] - xref:sop_configure_openshift_virtualization_operator.adoc[SOP Configure the Openshift Virtualization Operator] +- xref:sop_add_node.adoc[SOP Add an OCP4 Node to an Existing Cluster] +