infra-docs-fpo/modules/sysadmin_guide/pages/sop_add_node.adoc

= SOP Add an OCP4 Node to an Existing Cluster
This SOP should be used in the following scenario:

- Red Hat OpenShift Container Platform 4.x cluster has been installed some time ago (1+ days ago) and additional worker nodes are required to increase the capacity for the cluster.


== Resources
- [1] https://access.redhat.com/solutions/4246261[How to add OpenShift 4 RHCOS worker nodes in UPI within the first 24 hours]
- [2] https://access.redhat.com/solutions/4799921[How to add OpenShift 4 RHCOS worker nodes to UPI after the first 24 hours]
- [3] https://docs.openshift.com/container-platform/4.8/post_installation_configuration/node-tasks.html[Adding RHCOS worker nodes]


== Steps
1. Add the new nodes to the Ansible inventory file in the appropriate group.
+
eg:
+
----
[ocp_workers]
worker01.ocp.iad2.fedoraproject.org
worker02.ocp.iad2.fedoraproject.org
worker03.ocp.iad2.fedoraproject.org


[ocp_workers_stg]
worker01.ocp.stg.iad2.fedoraproject.org
worker02.ocp.stg.iad2.fedoraproject.org
worker03.ocp.stg.iad2.fedoraproject.org
worker04.ocp.stg.iad2.fedoraproject.org
worker05.ocp.stg.iad2.fedoraproject.org
----

2. Add the new hostvars for each new host being added, see the following examples for `VM` vs `baremetal` hosts.
+
----
# control plane VM
inventory/host_vars/ocp01.ocp.iad2.fedoraproject.org

# compute baremetal
inventory/host_vars/worker01.ocp.iad2.fedoraproject.org
----

3. If the nodes are `compute` or `worker` nodes, they must be also added to the following group_vars `proxies` for prod, `proxies_stg` for staging
+
----
inventory/group_vars/proxies:ocp_nodes:
inventory/group_vars/proxies_stg:ocp_nodes_stg:
----

4. Changes must be made to the `roles/dhcp_server/files/dhcpd.conf.noc01.iad2.fedoraproject.org` file for DHCP to ensure that the node will receive an IP address based on its MAC address, and tells the node to reach out to the `next-server` where it can find the UEFI boot configuration.
+
----
host worker01-ocp {                        # UPDATE THIS
     hardware ethernet 68:05:CA:CE:A3:C9;  # UPDATE THIS
     fixed-address 10.3.163.123;           # UPDATE THIS
     filename "uefi/grubx64.efi";
     next-server 10.3.163.10;
     option routers 10.3.163.254;
     option subnet-mask 255.255.255.0;
}
----

5. Changes must be made to DNS. To do this one must be a member of `sysadmin-main`, if you are not, one must send a patch request to the Fedora Infra mailing list for review which will be merged by the sysadmin-main members.
+
See the following examples for the `worker01.ocp` nodes for production and staging.
+
----
master/163.3.10.in-addr.arpa:123      IN        PTR      worker01.ocp.iad2.fedoraproject.org.
master/166.3.10.in-addr.arpa:118      IN        PTR      worker01.ocp.stg.iad2.fedoraproject.org.
master/iad2.fedoraproject.org:worker01.ocp            IN      A       10.3.163.123
master/stg.iad2.fedoraproject.org:worker01.ocp            IN      A       10.3.166.118
----

6. Run the playbook to update the haproxy config to monitor the new nodes, and add it to the load balancer.
+
----
sudo rbac-playbook groups/noc.yml -t "tftp_server,dhcp_server"
sudo rbac-playbook groups/proxies.yml -t 'haproxy,httpd'
----

7. DHCP instructs the node to reach out to the `next-server` when it is handed out an IP address. The `next-server` runs a tftp server which contains the kernel, initramfs and UEFI boot configuration. `uefi/grub.cfg`. Contained in this grub.cfg is the following which relates to the OCP4 nodes:
+
----
menuentry 'RHCOS 4.8 worker staging' {
  linuxefi images/RHCOS/4.8/x86_64/rhcos-4.8.2-x86_64-live-kernel-x86_64 ip=dhcp nameserver=10.3.163.33 coreos.inst.install_dev=/dev/sda
coreos.live.rootfs_url=http://10.3.166.50/rhcos/rhcos-4.8.2-x86_64-live-rootfs.x86_64.img coreos.inst.ignition_url=http://10.3.166.50/rhcos/worker.ign
  initrdefi images/RHCOS/4.8/x86_64/rhcos-4.8.2-x86_64-live-initramfs.x86_64.img
}
menuentry 'RHCOS 4.8 worker production' {
  linuxefi images/RHCOS/4.8/x86_64/rhcos-4.8.2-x86_64-live-kernel-x86_64 ip=dhcp nameserver=10.3.163.33 coreos.inst.install_dev=/dev/sda
coreos.live.rootfs_url=http://10.3.163.65/rhcos/rhcos-4.8.2-x86_64-live-rootfs.x86_64.img coreos.inst.ignition_url=http://10.3.163.65/rhcos/worker.ign
  initrdefi images/RHCOS/4.8/x86_64/rhcos-4.8.2-x86_64-live-initramfs.x86_64.img
}
----
+
When a node is booted up, and reads this UEFI boot configuration, the menu option must be manually selected:
+
- To add a node to the staging cluster choose: `RHCOS 4.8 worker staging`
- To add a node to the production cluster choose: `RHCOS 4.8 worker production`

8. Connect to the `os-control01` node which corresponds with the ENV which the new node is being added to.
+
Verify that you are authenticated correctly to the OpenShift cluster as the `system:admin` user.
+
----
oc whoami
system:admin
----

9. Contained within the UEFI boot menu configuration are the links to the web server running on the `os-control01` host specific to the ENV. This server should only run when we wish to reinstall an existing node or install a new node. Start it using systemctl manually:
+
----
systemctl start httpd.service
----

10. Boot up the node and select the appropriate menu entry to install the node into the correct cluster.
Wait until the node displays a SSH login prompt with the nodes name. It may reboot several times during the process.

11. As the new nodes are provisioned, they will attempt to join the cluster. They must first be accepted.
From the `os-control01` node run the following:
+
----
# List the certs. If you see status pending, this is the worker/compute nodes attempting to join the cluster. It must be approved.
oc get csr

# Accept all node CSRs one liner
oc get csr -o go-template='{{range .items}}{{if not .status}}{{.metadata.name}}{{"\n"}}{{end}}{{end}}' | xargs oc adm certificate approve
----
+
This process usually needs to be repeated twice, for each new node.

To see more information about adding new worker/compute nodes to a user provisioned infrastructure based OCP4 cluster see the detailed steps at [1],[2].
ocp4: reordering header levels 2022-03-03 11:44:45 +00:00			`= SOP Add an OCP4 Node to an Existing Cluster`
metrics-for-apps: SOP add ocp4 nodes Signed-off-by: David Kirwan <dkirwan@redhat.com> 2021-09-29 11:43:10 +09:00			`This SOP should be used in the following scenario:`

			`- Red Hat OpenShift Container Platform 4.x cluster has been installed some time ago (1+ days ago) and additional worker nodes are required to increase the capacity for the cluster.`


ocp4: reordering header levels 2022-03-03 11:44:45 +00:00			`== Resources`
Metrics-for-apps: Typos Signed-off-by: David Kirwan <dkirwan@redhat.com> 2021-09-29 11:51:13 +09:00			`- [1] https://access.redhat.com/solutions/4246261[How to add OpenShift 4 RHCOS worker nodes in UPI within the first 24 hours]`
			`- [2] https://access.redhat.com/solutions/4799921[How to add OpenShift 4 RHCOS worker nodes to UPI after the first 24 hours]`
metrics-for-apps: SOP add ocp4 nodes Signed-off-by: David Kirwan <dkirwan@redhat.com> 2021-09-29 11:43:10 +09:00			`- [3] https://docs.openshift.com/container-platform/4.8/post_installation_configuration/node-tasks.html[Adding RHCOS worker nodes]`


ocp4: reordering header levels 2022-03-03 11:44:45 +00:00			`== Steps`
Metrics-for-apps: Typos Signed-off-by: David Kirwan <dkirwan@redhat.com> 2021-09-29 11:51:13 +09:00			`1. Add the new nodes to the Ansible inventory file in the appropriate group.`
Fix Antora warnings about list item numbering. 2022-04-04 22:22:56 +02:00			`+`
metrics-for-apps: SOP add ocp4 nodes Signed-off-by: David Kirwan <dkirwan@redhat.com> 2021-09-29 11:43:10 +09:00			`eg:`
Fix Antora warnings about list item numbering. 2022-04-04 22:22:56 +02:00			`+`
metrics-for-apps: SOP add ocp4 nodes Signed-off-by: David Kirwan <dkirwan@redhat.com> 2021-09-29 11:43:10 +09:00			`----`
			`[ocp_workers]`
			`worker01.ocp.iad2.fedoraproject.org`
			`worker02.ocp.iad2.fedoraproject.org`
			`worker03.ocp.iad2.fedoraproject.org`


			`[ocp_workers_stg]`
			`worker01.ocp.stg.iad2.fedoraproject.org`
			`worker02.ocp.stg.iad2.fedoraproject.org`
			`worker03.ocp.stg.iad2.fedoraproject.org`
			`worker04.ocp.stg.iad2.fedoraproject.org`
			`worker05.ocp.stg.iad2.fedoraproject.org`
			`----`

			2. Add the new hostvars for each new host being added, see the following examples for `VM` vs `baremetal` hosts.
Fix Antora warnings about list item numbering. 2022-04-04 22:22:56 +02:00			`+`
metrics-for-apps: SOP add ocp4 nodes Signed-off-by: David Kirwan <dkirwan@redhat.com> 2021-09-29 11:43:10 +09:00			`----`
			`# control plane VM`
			`inventory/host_vars/ocp01.ocp.iad2.fedoraproject.org`

			`# compute baremetal`
			`inventory/host_vars/worker01.ocp.iad2.fedoraproject.org`
			`----`

			3. If the nodes are `compute` or `worker` nodes, they must be also added to the following group_vars `proxies` for prod, `proxies_stg` for staging
Fix Antora warnings about list item numbering. 2022-04-04 22:22:56 +02:00			`+`
metrics-for-apps: SOP add ocp4 nodes Signed-off-by: David Kirwan <dkirwan@redhat.com> 2021-09-29 11:43:10 +09:00			`----`
			`inventory/group_vars/proxies:ocp_nodes:`
			`inventory/group_vars/proxies_stg:ocp_nodes_stg:`
			`----`

Metrics-for-apps: Typos Signed-off-by: David Kirwan <dkirwan@redhat.com> 2021-09-29 11:51:13 +09:00			4. Changes must be made to the `roles/dhcp_server/files/dhcpd.conf.noc01.iad2.fedoraproject.org` file for DHCP to ensure that the node will receive an IP address based on its MAC address, and tells the node to reach out to the `next-server` where it can find the UEFI boot configuration.
Fix Antora warnings about list item numbering. 2022-04-04 22:22:56 +02:00			`+`
metrics-for-apps: SOP add ocp4 nodes Signed-off-by: David Kirwan <dkirwan@redhat.com> 2021-09-29 11:43:10 +09:00			`----`
			`host worker01-ocp { # UPDATE THIS`
			`hardware ethernet 68:05:CA:CE:A3:C9; # UPDATE THIS`
			`fixed-address 10.3.163.123; # UPDATE THIS`
			`filename "uefi/grubx64.efi";`
			`next-server 10.3.163.10;`
			`option routers 10.3.163.254;`
			`option subnet-mask 255.255.255.0;`
			`}`
			`----`

Metrics-for-apps: Typos Signed-off-by: David Kirwan <dkirwan@redhat.com> 2021-09-29 11:51:13 +09:00			5. Changes must be made to DNS. To do this one must be a member of `sysadmin-main`, if you are not, one must send a patch request to the Fedora Infra mailing list for review which will be merged by the sysadmin-main members.
Fix Antora warnings about list item numbering. 2022-04-04 22:22:56 +02:00			`+`
metrics-for-apps: SOP add ocp4 nodes Signed-off-by: David Kirwan <dkirwan@redhat.com> 2021-09-29 11:43:10 +09:00			See the following examples for the `worker01.ocp` nodes for production and staging.
Fix Antora warnings about list item numbering. 2022-04-04 22:22:56 +02:00			`+`
metrics-for-apps: SOP add ocp4 nodes Signed-off-by: David Kirwan <dkirwan@redhat.com> 2021-09-29 11:43:10 +09:00			`----`
			`master/163.3.10.in-addr.arpa:123 IN PTR worker01.ocp.iad2.fedoraproject.org.`
			`master/166.3.10.in-addr.arpa:118 IN PTR worker01.ocp.stg.iad2.fedoraproject.org.`
			`master/iad2.fedoraproject.org:worker01.ocp IN A 10.3.163.123`
			`master/stg.iad2.fedoraproject.org:worker01.ocp IN A 10.3.166.118`
			`----`

			`6. Run the playbook to update the haproxy config to monitor the new nodes, and add it to the load balancer.`
Fix Antora warnings about list item numbering. 2022-04-04 22:22:56 +02:00			`+`
metrics-for-apps: SOP add ocp4 nodes Signed-off-by: David Kirwan <dkirwan@redhat.com> 2021-09-29 11:43:10 +09:00			`----`
			`sudo rbac-playbook groups/noc.yml -t "tftp_server,dhcp_server"`
			`sudo rbac-playbook groups/proxies.yml -t 'haproxy,httpd'`
			`----`

			7. DHCP instructs the node to reach out to the `next-server` when it is handed out an IP address. The `next-server` runs a tftp server which contains the kernel, initramfs and UEFI boot configuration. `uefi/grub.cfg`. Contained in this grub.cfg is the following which relates to the OCP4 nodes:
Fix Antora warnings about list item numbering. 2022-04-04 22:22:56 +02:00			`+`
metrics-for-apps: SOP add ocp4 nodes Signed-off-by: David Kirwan <dkirwan@redhat.com> 2021-09-29 11:43:10 +09:00			`----`
			`menuentry 'RHCOS 4.8 worker staging' {`
			`linuxefi images/RHCOS/4.8/x86_64/rhcos-4.8.2-x86_64-live-kernel-x86_64 ip=dhcp nameserver=10.3.163.33 coreos.inst.install_dev=/dev/sda`
			`coreos.live.rootfs_url=http://10.3.166.50/rhcos/rhcos-4.8.2-x86_64-live-rootfs.x86_64.img coreos.inst.ignition_url=http://10.3.166.50/rhcos/worker.ign`
			`initrdefi images/RHCOS/4.8/x86_64/rhcos-4.8.2-x86_64-live-initramfs.x86_64.img`
			`}`
			`menuentry 'RHCOS 4.8 worker production' {`
			`linuxefi images/RHCOS/4.8/x86_64/rhcos-4.8.2-x86_64-live-kernel-x86_64 ip=dhcp nameserver=10.3.163.33 coreos.inst.install_dev=/dev/sda`
			`coreos.live.rootfs_url=http://10.3.163.65/rhcos/rhcos-4.8.2-x86_64-live-rootfs.x86_64.img coreos.inst.ignition_url=http://10.3.163.65/rhcos/worker.ign`
			`initrdefi images/RHCOS/4.8/x86_64/rhcos-4.8.2-x86_64-live-initramfs.x86_64.img`
			`}`
			`----`
Fix Antora warnings about list item numbering. 2022-04-04 22:22:56 +02:00			`+`
metrics-for-apps: SOP add ocp4 nodes Signed-off-by: David Kirwan <dkirwan@redhat.com> 2021-09-29 11:43:10 +09:00			`When a node is booted up, and reads this UEFI boot configuration, the menu option must be manually selected:`
Fix Antora warnings about list item numbering. 2022-04-04 22:22:56 +02:00			`+`
metrics-for-apps: SOP add ocp4 nodes Signed-off-by: David Kirwan <dkirwan@redhat.com> 2021-09-29 11:43:10 +09:00			- To add a node to the staging cluster choose: `RHCOS 4.8 worker staging`
			- To add a node to the production cluster choose: `RHCOS 4.8 worker production`

			8. Connect to the `os-control01` node which corresponds with the ENV which the new node is being added to.
Fix Antora warnings about list item numbering. 2022-04-04 22:22:56 +02:00			`+`
Metrics-for-apps: Typos Signed-off-by: David Kirwan <dkirwan@redhat.com> 2021-09-29 11:51:13 +09:00			Verify that you are authenticated correctly to the OpenShift cluster as the `system:admin` user.
Fix Antora warnings about list item numbering. 2022-04-04 22:22:56 +02:00			`+`
metrics-for-apps: SOP add ocp4 nodes Signed-off-by: David Kirwan <dkirwan@redhat.com> 2021-09-29 11:43:10 +09:00			`----`
			`oc whoami`
			`system:admin`
			`----`

Metrics-for-apps: Typos Signed-off-by: David Kirwan <dkirwan@redhat.com> 2021-09-29 11:51:13 +09:00			9. Contained within the UEFI boot menu configuration are the links to the web server running on the `os-control01` host specific to the ENV. This server should only run when we wish to reinstall an existing node or install a new node. Start it using systemctl manually:
Fix Antora warnings about list item numbering. 2022-04-04 22:22:56 +02:00			`+`
metrics-for-apps: SOP add ocp4 nodes Signed-off-by: David Kirwan <dkirwan@redhat.com> 2021-09-29 11:43:10 +09:00			`----`
			`systemctl start httpd.service`
			`----`

			`10. Boot up the node and select the appropriate menu entry to install the node into the correct cluster.`
Metrics-for-apps: Typos Signed-off-by: David Kirwan <dkirwan@redhat.com> 2021-09-29 11:51:13 +09:00			`Wait until the node displays a SSH login prompt with the nodes name. It may reboot several times during the process.`
metrics-for-apps: SOP add ocp4 nodes Signed-off-by: David Kirwan <dkirwan@redhat.com> 2021-09-29 11:43:10 +09:00
			`11. As the new nodes are provisioned, they will attempt to join the cluster. They must first be accepted.`
			From the `os-control01` node run the following:
Fix Antora warnings about list item numbering. 2022-04-04 22:22:56 +02:00			`+`
metrics-for-apps: SOP add ocp4 nodes Signed-off-by: David Kirwan <dkirwan@redhat.com> 2021-09-29 11:43:10 +09:00			`----`
			`# List the certs. If you see status pending, this is the worker/compute nodes attempting to join the cluster. It must be approved.`
			`oc get csr`

			`# Accept all node CSRs one liner`
			`oc get csr -o go-template='{{range .items}}{{if not .status}}{{.metadata.name}}{{"\n"}}{{end}}{{end}}' \| xargs oc adm certificate approve`
			`----`
Fix Antora warnings about list item numbering. 2022-04-04 22:22:56 +02:00			`+`
metrics-for-apps: SOP add ocp4 nodes Signed-off-by: David Kirwan <dkirwan@redhat.com> 2021-09-29 11:43:10 +09:00			`This process usually needs to be repeated twice, for each new node.`

			`To see more information about adding new worker/compute nodes to a user provisioned infrastructure based OCP4 cluster see the detailed steps at [1],[2].`