133 lines
5.9 KiB
Text
133 lines
5.9 KiB
Text
= SOP Add an OCP4 Node to an Existing Cluster
|
|
This SOP should be used in the following scenario:
|
|
|
|
- Red Hat OpenShift Container Platform 4.x cluster has been installed some time ago (1+ days ago) and additional worker nodes are required to increase the capacity for the cluster.
|
|
|
|
|
|
== Resources
|
|
- [1] https://access.redhat.com/solutions/4246261[How to add OpenShift 4 RHCOS worker nodes in UPI within the first 24 hours]
|
|
- [2] https://access.redhat.com/solutions/4799921[How to add OpenShift 4 RHCOS worker nodes to UPI after the first 24 hours]
|
|
- [3] https://docs.openshift.com/container-platform/4.8/post_installation_configuration/node-tasks.html[Adding RHCOS worker nodes]
|
|
|
|
|
|
== Steps
|
|
1. Add the new nodes to the Ansible inventory file in the appropriate group.
|
|
+
|
|
eg:
|
|
+
|
|
----
|
|
[ocp_workers]
|
|
worker01.ocp.rdu3.fedoraproject.org
|
|
worker02.ocp.rdu3.fedoraproject.org
|
|
worker03.ocp.rdu3.fedoraproject.org
|
|
|
|
|
|
[ocp_workers_stg]
|
|
worker01.ocp.stg.rdu3.fedoraproject.org
|
|
worker02.ocp.stg.rdu3.fedoraproject.org
|
|
worker03.ocp.stg.rdu3.fedoraproject.org
|
|
worker04.ocp.stg.rdu3.fedoraproject.org
|
|
worker05.ocp.stg.rdu3.fedoraproject.org
|
|
----
|
|
|
|
2. Add the new hostvars for each new host being added, see the following examples for `VM` vs `baremetal` hosts.
|
|
+
|
|
----
|
|
# control plane VM
|
|
inventory/host_vars/ocp01.ocp.rdu3.fedoraproject.org
|
|
|
|
# compute baremetal
|
|
inventory/host_vars/worker01.ocp.rdu3.fedoraproject.org
|
|
----
|
|
|
|
3. If the nodes are `compute` or `worker` nodes, they must be also added to the following group_vars `proxies` for prod, `proxies_stg` for staging
|
|
+
|
|
----
|
|
inventory/group_vars/proxies:ocp_nodes:
|
|
inventory/group_vars/proxies_stg:ocp_nodes_stg:
|
|
----
|
|
|
|
4. Changes must be made to the `roles/dhcp_server/files/dhcpd.conf.noc01.rdu3.fedoraproject.org` file for DHCP to ensure that the node will receive an IP address based on its MAC address, and tells the node to reach out to the `next-server` where it can find the UEFI boot configuration.
|
|
+
|
|
----
|
|
host worker01-ocp { # UPDATE THIS
|
|
hardware ethernet 68:05:CA:CE:A3:C9; # UPDATE THIS
|
|
fixed-address 10.16.163.123; # UPDATE THIS
|
|
filename "uefi/grubx64.efi";
|
|
next-server 10.16.163.10;
|
|
option routers 10.16.163.254;
|
|
option subnet-mask 255.255.255.0;
|
|
}
|
|
----
|
|
|
|
5. Changes must be made to DNS. To do this one must be a member of `sysadmin-main`, if you are not, one must send a patch request to the Fedora Infra mailing list for review which will be merged by the sysadmin-main members.
|
|
+
|
|
See the following examples for the `worker01.ocp` nodes for production and staging.
|
|
+
|
|
----
|
|
master/163.3.10.in-addr.arpa:123 IN PTR worker01.ocp.rdu3.fedoraproject.org.
|
|
master/166.3.10.in-addr.arpa:118 IN PTR worker01.ocp.stg.rdu3.fedoraproject.org.
|
|
master/rdu3.fedoraproject.org:worker01.ocp IN A 10.16.163.123
|
|
master/stg.rdu3.fedoraproject.org:worker01.ocp IN A 10.16.166.118
|
|
----
|
|
|
|
6. Run the playbook to update the haproxy config to monitor the new nodes, and add it to the load balancer.
|
|
+
|
|
----
|
|
sudo rbac-playbook groups/noc.yml -t "tftp_server,dhcp_server"
|
|
sudo rbac-playbook groups/proxies.yml -t 'haproxy,httpd'
|
|
----
|
|
|
|
7. DHCP instructs the node to reach out to the `next-server` when it is handed out an IP address. The `next-server` runs a tftp server which contains the kernel, initramfs and UEFI boot configuration. `uefi/grub.cfg`. Contained in this grub.cfg is the following which relates to the OCP4 nodes:
|
|
+
|
|
----
|
|
menuentry 'RHCOS 4.8 worker staging' {
|
|
linuxefi images/RHCOS/4.8/x86_64/rhcos-4.8.2-x86_64-live-kernel-x86_64 ip=dhcp nameserver=10.16.163.33 coreos.inst.install_dev=/dev/sda
|
|
coreos.live.rootfs_url=http://10.16.166.50/rhcos/rhcos-4.8.2-x86_64-live-rootfs.x86_64.img coreos.inst.ignition_url=http://10.16.166.50/rhcos/worker.ign
|
|
initrdefi images/RHCOS/4.8/x86_64/rhcos-4.8.2-x86_64-live-initramfs.x86_64.img
|
|
}
|
|
menuentry 'RHCOS 4.8 worker production' {
|
|
linuxefi images/RHCOS/4.8/x86_64/rhcos-4.8.2-x86_64-live-kernel-x86_64 ip=dhcp nameserver=10.16.163.33 coreos.inst.install_dev=/dev/sda
|
|
coreos.live.rootfs_url=http://10.16.163.65/rhcos/rhcos-4.8.2-x86_64-live-rootfs.x86_64.img coreos.inst.ignition_url=http://10.16.163.65/rhcos/worker.ign
|
|
initrdefi images/RHCOS/4.8/x86_64/rhcos-4.8.2-x86_64-live-initramfs.x86_64.img
|
|
}
|
|
----
|
|
+
|
|
When a node is booted up, and reads this UEFI boot configuration, the menu option must be manually selected:
|
|
+
|
|
- To add a node to the staging cluster choose: `RHCOS 4.8 worker staging`
|
|
- To add a node to the production cluster choose: `RHCOS 4.8 worker production`
|
|
|
|
8. Connect to the `os-control01` node which corresponds with the ENV which the new node is being added to.
|
|
+
|
|
Verify that you are authenticated correctly to the OpenShift cluster as the `system:admin` user.
|
|
+
|
|
----
|
|
oc whoami
|
|
system:admin
|
|
----
|
|
|
|
9. Contained within the UEFI boot menu configuration are the links to the web server running on the `os-control01` host specific to the ENV. This server should only run when we wish to reinstall an existing node or install a new node. Start it using systemctl manually:
|
|
+
|
|
----
|
|
systemctl start httpd.service
|
|
----
|
|
|
|
10. Boot up the node and select the appropriate menu entry to install the node into the correct cluster.
|
|
Wait until the node displays a SSH login prompt with the nodes name. It may reboot several times during the process.
|
|
|
|
11. As the new nodes are provisioned, they will attempt to join the cluster. They must first be accepted.
|
|
From the `os-control01` node run the following:
|
|
+
|
|
----
|
|
# List the certs. If you see status pending, this is the worker/compute nodes attempting to join the cluster. It must be approved.
|
|
oc get csr
|
|
|
|
# Accept all node CSRs one liner
|
|
oc get csr -o go-template='{{range .items}}{{if not .status}}{{.metadata.name}}{{"\n"}}{{end}}{{end}}' | xargs oc adm certificate approve
|
|
----
|
|
+
|
|
This process usually needs to be repeated twice, for each new node.
|
|
|
|
To see more information about adding new worker/compute nodes to a user provisioned infrastructure based OCP4 cluster see the detailed steps at [1],[2].
|
|
|