ocp4: reordering header levels

This commit is contained in:
David Kirwan 2022-03-03 11:44:45 +00:00
parent fad0c1df70
commit dc54bec3d5
19 changed files with 82 additions and 82 deletions

View file

@ -1,16 +1,16 @@
== SOP Add an OCP4 Node to an Existing Cluster
= SOP Add an OCP4 Node to an Existing Cluster
This SOP should be used in the following scenario:
- Red Hat OpenShift Container Platform 4.x cluster has been installed some time ago (1+ days ago) and additional worker nodes are required to increase the capacity for the cluster.
=== Resources
== Resources
- [1] https://access.redhat.com/solutions/4246261[How to add OpenShift 4 RHCOS worker nodes in UPI within the first 24 hours]
- [2] https://access.redhat.com/solutions/4799921[How to add OpenShift 4 RHCOS worker nodes to UPI after the first 24 hours]
- [3] https://docs.openshift.com/container-platform/4.8/post_installation_configuration/node-tasks.html[Adding RHCOS worker nodes]
=== Steps
== Steps
1. Add the new nodes to the Ansible inventory file in the appropriate group.
eg:

View file

@ -1,4 +1,4 @@
== SOP Add new capacity to the OCP4 ODF Storage Cluster
= SOP Add new capacity to the OCP4 ODF Storage Cluster
This SOP should be used in the following scenario:
- Red Hat OpenShift Container Platform 4.x cluster has been installed
@ -6,13 +6,13 @@ This SOP should be used in the following scenario:
- These additional worker nodes have storage resources which we wish to add to the Openshift Datafoundation Storage Cluster
- We are adding enough storage to meet the minimum of 3 replicas. eg: 3 nodes, or enough storage devices that the the number is divisble by 3.
=== Resources
== Resources
- [1] https://access.redhat.com/solutions/4246261[How to add OpenShift 4 RHCOS worker nodes in UPI within the first 24 hours]
- [2] https://access.redhat.com/solutions/4799921[How to add OpenShift 4 RHCOS worker nodes to UPI after the first 24 hours]
- [3] https://docs.openshift.com/container-platform/4.8/post_installation_configuration/node-tasks.html[Adding RHCOS worker nodes]
- [4] https://access.redhat.com/documentation/en-us/red_hat_openshift_data_foundation/4.9[Openshift Data Foundation Product Notes]
=== Steps
== Steps
1. Once a new node has been added to the Openshift cluster, we can manage the extra local storage devices on this node from within Openshift itself, providing that they do not contain file paritions/filesystems. In the case of a node being repurposed, please first ensure that all storage devices except `/dev/sda` are partition and filesystem free before starting.
2. From within the Openshift webconsole, or via cli search for all "LocalVolumeDiscovery" objects.

View file

@ -1,4 +1,4 @@
== Configure Baremetal PXE-UEFI Boot
= Configure Baremetal PXE-UEFI Boot
A high level overview of how a baremetal node in the Fedora Infra gets booted via UEFI is as follows.
- Server powered on
@ -9,12 +9,12 @@ A high level overview of how a baremetal node in the Fedora Infra gets booted vi
- tftpboot serves kernel and initramfs to the server
- Server boots with kernel and initramfs, and retrieves ingition file from `os-control01`
=== Resources
== Resources
- [1] https://pagure.io/fedora-infra/ansible/blob/main/f/roles/dhcp_server[Ansible Role DHCP Server]
- [2] https://pagure.io/fedora-infra/ansible/blob/main/f/roles/tftp_server[Ansible Role tftpboot server]
=== UEFI Configuration
== UEFI Configuration
The configuration for UEFI booting is contained in the `grub.cfg` config which is not currently under source control. It is located on the `batcave01` at: `/srv/web/infra/bigfiles/tftpboot2/uefi/grub.cfg`.
The following is a sample configuration to install a baremetal OCP4 worker in the Staging cluster.
@ -28,7 +28,7 @@ menuentry 'RHCOS 4.8 worker staging' {
Any new changes must be made here. Writing to this file requires one to be a member of the `sysadmin-main` group, so best to instead create a ticket in the Fedora Infra issue tracker with patch request. See the following https://pagure.io/fedora-infrastructure/issue/10213[PR] for inspiration.
=== Pushing new changes out to the tftpboot server
== Pushing new changes out to the tftpboot server
To push out changes made to the `grub.cfg` the following playbook should be run, which requires `sysadmin-noc` group permissions:
----

View file

@ -1,10 +1,10 @@
== SOP Configure the Image Registry Operator
= SOP Configure the Image Registry Operator
=== Resources
== Resources
- [1] https://docs.openshift.com/container-platform/4.8/registry/configuring_registry_storage/configuring-registry-storage-baremetal.html#configuring-registry-storage-baremetal[Configuring Registry Storage Baremetal]
=== Enable the image registry operator
== Enable the image registry operator
For detailed instructions please refer to the official documentation for the particular version of Openshift [1].
From the `os-control01` node we can enable the Image Registry Operator set it to a `Managed` state like so via the CLI.:

View file

@ -1,11 +1,11 @@
== Configure the Local Storage Operator
= Configure the Local Storage Operator
=== Resources
== Resources
- [1] https://access.redhat.com/documentation/en-us/red_hat_openshift_container_storage/4.7/html/deploying_openshift_container_storage_using_bare_metal_infrastructure/deploy-using-local-storage-devices-bm
- [2] https://github.com/centosci/ocp4-docs/blob/master/sops/localstorage/installation.md
=== Installation
== Installation
For installation instructions visit the official docs at: [1]. The CentOS CI SOP at [2] also has more context but it is now slightly dated.
- From the webconsole, click on the `Operators` option, then `OperatorHub`
@ -17,7 +17,7 @@ For installation instructions visit the official docs at: [1]. The CentOS CI SOP
- Update approval set to automatic
- Click install
=== Configuration
== Configuration
A prerequisite to this step is to have all volumes on the nodes already formatted and available prior to this step. This can be done via a machineconfig/ignition file during installation time, or alternatively SSH onto the boxes and manually create / format the volumes.
- Create a `LocalVolumeDiscovery` and configured it to target the disks on all nodes

View file

@ -1,12 +1,12 @@
== SOP Configure oauth Authentication via IPA/Noggin
= SOP Configure oauth Authentication via IPA/Noggin
=== Resources
== Resources
- [1] https://pagure.io/fedora-infra/ansible/blob/main/f/files/communishift/objects[Example Config from Communishift]
=== OIDC Setup
== OIDC Setup
The first step is to request that a secret be created for this environment, please open a ticket with Fedora Infra. Once the secret has been made available we can add it to an Openshift Secret in the cluster like so:
----

View file

@ -1,12 +1,12 @@
== Configure the Openshift Container Storage Operator
= Configure the Openshift Container Storage Operator
=== Resources
== Resources
- [1] https://docs.openshift.com/container-platform/4.8/storage/persistent_storage/persistent-storage-ocs.html[Official Docs]
- [2] https://github.com/red-hat-storage/ocs-operator[Github]
=== Installation
== Installation
Important: before following this SOP, please ensure that you have already followed the SOP to install the Local Storage Operator first, as this is a requirement for the OCS operator.
For full detailed instructions please refer to the official docs at: [1]. For general instructions see below:
@ -22,7 +22,7 @@ For full detailed instructions please refer to the official docs at: [1]. For ge
- Click install
=== Configuration
== Configuration
When the operator is finished installing, we can continue, please ensure that a minimum of 3 nodes are available.
- A `StorageCluster` is required to complete this installation, click the Create StorageCluster.

View file

@ -1,11 +1,11 @@
== Installation of the Openshift Virtualisation Operator
= Installation of the Openshift Virtualisation Operator
=== Resources
== Resources
- [1] https://alt.fedoraproject.org/cloud/[Fedora Images]
- [2] https://github.com/kubevirt/kubevirt/blob/main/docs/container-register-disks.md[Kubevirt Importing Containers of VMI Images]
=== Installation
== Installation
From the web console, choose the `Operators` menu, and choose `OperatorHub`.
Search for `Openshift Virtualization`
@ -17,7 +17,7 @@ When the installation of the Operator is completed, create a `HyperConverged` ob
Next create a `HostPathProvisioner` object the default options should be fine, click next through the menus.
=== Verification
== Verification
To verify that the installation of the Operator is successful, we can attempt to create a VM.
From the [1] location download the Fedora34 `Cloud Base image for Openstack` image with the `qcow2` format locally.

View file

@ -1,12 +1,12 @@
== Enable User Workload Monitoring Stack
= Enable User Workload Monitoring Stack
=== Resources
== Resources
- [1] https://docs.openshift.com/container-platform/4.8/monitoring/enabling-monitoring-for-user-defined-projects.html[Official Docs]
- [2] https://docs.openshift.com/container-platform/4.8/monitoring/enabling-monitoring-for-user-defined-projects.html#granting-users-permission-to-monitor-user-defined-projects_enabling-monitoring-for-user-defined-projects[Providing Access to the UWMS features]
- [3] https://docs.openshift.com/container-platform/4.8/monitoring/enabling-monitoring-for-user-defined-projects.html#granting-user-permissions-using-the-web-console_enabling-monitoring-for-user-defined-projects[Providing Access to the UWMS dashboard]
- [4] https://docs.openshift.com/container-platform/4.8/monitoring/configuring-the-monitoring-stack.html#configuring-persistent-storage[Configure Monitoring Stack]
=== Configuration
== Configuration
To enable the stack edit the `cluster-monitoring` ConfigMap like so:
----

View file

@ -1,10 +1,10 @@
== Cordoning Nodes and Draining Pods
= Cordoning Nodes and Draining Pods
This SOP should be followed in the following scenarios:
- If maintenance is scheduled to be carried out on an Openshift node.
=== Steps
== Steps
1. Connect to the `os-control01` host associated with this ENV. Become root `sudo su -`.
@ -51,6 +51,6 @@ for node in ${nodes[@]}; do oc adm uncordon $node; done
----
=== Resources
== Resources
- [1] [Nodes - working with nodes](https://docs.openshift.com/container-platform/4.8/nodes/nodes/nodes-nodes-working.html)

View file

@ -1,11 +1,11 @@
== Create MachineConfigs to Configure RHCOS
= Create MachineConfigs to Configure RHCOS
=== Resources
== Resources
- [1] https://coreos.github.io/butane/getting-started/[Butane Getting Started]
- [2] https://docs.openshift.com/container-platform/4.8/post_installation_configuration/machine-configuration-tasks.html#installation-special-config-chrony_post-install-machine-configuration-tasks[OCP4 Post Installation Configuration]
=== Butane
== Butane
"Butane (formerly the Fedora CoreOS Config Transpiler) is a tool that consumes a Butane Config and produces an Ignition Config, which is a JSON document that can be given to a Fedora CoreOS machine when it first boots." [1]
Butane is available in a container image, we can pull the latest version locally like so:

View file

@ -1,11 +1,11 @@
== SOP Disable `self-provisioners` Role
= SOP Disable `self-provisioners` Role
=== Resources
== Resources
- [1] https://docs.openshift.com/container-platform/4.4/applications/projects/configuring-project-creation.html#disabling-project-self-provisioning_configuring-project-creation
=== Disabling self-provisioners role
== Disabling self-provisioners role
By default, when a user authenticates with Openshift via Oauth, it is part of the `self-provisioners` group. This group provides the ability to create new projects. On CentOS CI we do not want users to be able to create their own projects, as we have a system in place where we create a project and control the administrators of that project.
To disable the self-provisioner role do the following as outlined in the documentation[1].

View file

@ -1,14 +1,14 @@
== Create etcd backup
= Create etcd backup
This SOP should be followed in the following scenarios:
- When the need exists to create an etcd backup.
- When shutting a cluster down gracefully.
=== Resources
== Resources
- [1] https://docs.openshift.com/container-platform/4.8/backup_and_restore/backing-up-etcd.html[Creating an etcd backup]
=== Take etcd backup
== Take etcd backup
1. Connect to the `os-control01` node associated with the ENV.

View file

@ -1,9 +1,9 @@
== Graceful Shutdown of an Openshift 4 Cluster
= Graceful Shutdown of an Openshift 4 Cluster
This SOP should be followed in the following scenarios:
- Graceful full shut down of the Openshift 4 cluster is required.
=== Steps
== Steps
Prequisite steps:
- Follow the SOP for cordoning and draining the nodes.
@ -25,6 +25,6 @@ for node in ${nodes[@]}; do ssh -i /root/ocp4/ocp-<ENV>/ssh/id_rsa core@$node su
----
==== Resources
=== Resources
- [1] https://docs.openshift.com/container-platform/4.5/backup_and_restore/graceful-cluster-shutdown.html[Graceful Cluster Shutdown]

View file

@ -1,13 +1,13 @@
== Graceful Startup of an Openshift 4 Cluster
= Graceful Startup of an Openshift 4 Cluster
This SOP should be followed in the following scenarios:
- Graceful start up of an Openshift 4 cluster.
=== Steps
== Steps
Prequisite steps:
==== Start the VM Control Plane instances
=== Start the VM Control Plane instances
Ensure that the control plane instances start first.
----
@ -15,7 +15,7 @@ Ensure that the control plane instances start first.
----
==== Start the physical nodes
=== Start the physical nodes
To connect to `idrac`, you must be connected to the Red Hat VPN. Next find the management IP associated with each node.
On the `batcave01` instance, in the dns configuration, the following bare metal machines make up the production/staging OCP4 worker nodes.
@ -31,7 +31,7 @@ oshift-dell06 IN A 10.3.160.185 # worker03 staging
Login to the `idrac` interface that corresponds with each worker, one at a time. Ensure the node is booting via harddrive, then power it on.
==== Once the nodes have been started they must be uncordoned if appropriate
=== Once the nodes have been started they must be uncordoned if appropriate
----
oc get nodes
@ -82,7 +82,7 @@ kempty-n9.ci.centos.org Ready worker 106d v1.18.3+6c42de8
----
=== Resources
== Resources
- [1] https://docs.openshift.com/container-platform/4.5/backup_and_restore/graceful-cluster-restart.html[Graceful Cluster Startup]
- [2] https://docs.openshift.com/container-platform/4.5/backup_and_restore/disaster_recovery/scenario-2-restoring-cluster-state.html#dr-restoring-cluster-state[Cluster disaster recovery]

View file

@ -1,17 +1,17 @@
== SOP Installation/Configuration of OCP4 on Fedora Infra
= SOP Installation/Configuration of OCP4 on Fedora Infra
=== Resources
== Resources
- [1]: https://docs.openshift.com/container-platform/4.8/installing/installing_bare_metal/[Official OCP4 Installation Documentation]
=== Install
== Install
To install OCP4 on Fedora Infra, one must be apart of the following groups:
- `sysadmin-openshift`
- `sysadmin-noc`
==== Prerequisites
=== Prerequisites
Visit the https://console.redhat.com/openshift/install/metal/user-provisioned[OpenShift Console] and download the following OpenShift tools:
* A Red Hat Access account is required
@ -22,7 +22,7 @@ Visit the https://console.redhat.com/openshift/install/metal/user-provisioned[Op
* Take a copy of your pull secret file you will need to put this in the `install-config.yaml` file in the next step.
==== Generate install-config.yaml file
=== Generate install-config.yaml file
We must create a `install-config.yaml` file, use the following example for inspiration, alternatively refer to the documentation[1] for more detailed information/explainations.
----
@ -62,10 +62,10 @@ sshKey: 'PUT SSH PUBLIC KEY HERE kubeadmin@core'
* Take a backup of the `install-config.yaml` to `install-config.yaml.bak`, as running the next steps consumes this file, having a backup allows you to recover from mistakes quickly.
==== Create the Installation Files
=== Create the Installation Files
Using the `openshift-install` tool we can generate the installation files. Make sure that the `install-config.yaml` file is in the `/path/to/ocp4-<ENV>` location before attempting the next steps.
===== Create the Manifest Files
==== Create the Manifest Files
The manifest files are human readable, at this stage you can put any customisations required before the installation begins.
* Create the manifests: `openshift-install create manifests --dir=/path/to/ocp4-<ENV>`
@ -73,7 +73,7 @@ The manifest files are human readable, at this stage you can put any customisati
* The following step should be performed at this point, edit the `/path/to/ocp4-<ENV>/manifests/cluster-scheduler-02-config.yml` change the `mastersSchedulable` value to `false`.
===== Create the Ignition Files
==== Create the Ignition Files
The ignition files have been generated from the manifests and MachineConfig files to generate the final installation files for the three roles: `bootstrap`, `master`, `worker`. In Fedora we prefer not to use the term `master` here, we have renamed this role to `controlplane`.
* Create the ignition files: `openshift-install create ignition-configs --dir=/path/to/ocp4-<ENV>`
@ -82,7 +82,7 @@ The ignition files have been generated from the manifests and MachineConfig file
* A directory has been created, `auth`. This contains two files: `kubeadmin-password` and `kubeconfig`. These allow `cluster-admin` access to the cluster.
==== Copy the Ignition files to the `batcave01` server
=== Copy the Ignition files to the `batcave01` server
On the `batcave01` at the following location: `/srv/web/infra/bigfiles/openshiftboot/`:
* Create a directory to match the environment: `mkdir /srv/web/infra/bigfiles/openshiftboot/ocp4-<ENV>`
@ -102,19 +102,19 @@ On the `batcave01` at the following location: `/srv/web/infra/bigfiles/openshift
----
==== Update the ansible inventory
=== Update the ansible inventory
The ansible inventory/hostvars/group vars should be updated with the new hosts information.
For inspiration see the following https://pagure.io/fedora-infra/ansible/pull-request/765[PR] where we added the ocp4 production changes.
==== Update the DNS/DHCP configuration
=== Update the DNS/DHCP configuration
The DNS and DHCP configuration must also be updated. This https://pagure.io/fedora-infra/ansible/pull-request/765[PR] contains the necessiary changes DHCP for prod and can be done in ansible.
However the DNS changes may only be performed by `sysadmin-main`. For this reason any DNS changes must go via a patch snippet which is emailed to the `infrastructure@lists.fedoraproject.org` mailing list for review and approval. This process may take several days.
==== Generate the TLS Certs for the new environment
=== Generate the TLS Certs for the new environment
This is beyond the scope of this SOP, the best option is to create a ticket for Fedora Infra to request that these certs are created and available for use. The following certs should be available:
- `*.apps.<ENV>.fedoraproject.org`
@ -122,14 +122,14 @@ This is beyond the scope of this SOP, the best option is to create a ticket for
- `api-int.<ENV>.fedoraproject.org`
==== Run the Playbooks
=== Run the Playbooks
There are a number of playbooks required to be run. Once all the previous steps have been reached, we can run these playbooks from the `batcave01` instance.
- `sudo rbac-playbook groups/noc.yml -t 'tftp_server,dhcp_server'`
- `sudo rbac-playbook groups/proxies.yml -t 'haproxy,httpd,iptables'`
===== Baremetal / VMs
==== Baremetal / VMs
Depending on if some of the nodes are VMs or baremetal, different tags should be supplied to the following playbook. If the entire cluster is baremetal you can skip the `kvm_deploy` tag entirely.
If there are VMs used for some of the roles, make sure to leave it in.
@ -137,7 +137,7 @@ If there are VMs used for some of the roles, make sure to leave it in.
- `sudo rbac-playbook manual/ocp4-place-ignitionfiles.yml -t "ignition,repo,kvm_deploy"`
===== Baremetal
==== Baremetal
At this point we can switch on the baremetal nodes and begin the PXE/UEFI boot process. The baremetal nodes should via DHCP/DNS have the configuration necessary to reach out to the `noc01.iad2.fedoraproject.org` server and retrieve the UEFI boot configuration via PXE.
Once booted up, you should visit the management console for this node, and manually choose the UEFI configuration appropriate for its role.
@ -149,7 +149,7 @@ The system will then become autonomous, it will install and potentially reboot m
Eventually you will be presented with a SSH login prompt, where it should have the correct hostname eg: `ocp01` to match what is in the DNS configuration.
==== Bootstrapping completed
=== Bootstrapping completed
When the control plane is up, we should see all controlplane instances available in the appropriate haproxy dashboard. eg: https://admin.fedoraproject.org/haproxy/proxy01=ocp-masters-backend-kapi[haproxy].
At this time we should take the `bootstrap` instance out of the haproxy load balancer.
@ -158,16 +158,16 @@ At this time we should take the `bootstrap` instance out of the haproxy load bal
- Once merged, run the following playbook once more: `sudo rbac-playbook groups/proxies.yml -t 'haproxy'`
==== Begin instllation of the worker nodes
=== Begin instllation of the worker nodes
Follow the same processes listed in the Baremetal section above to switch on the worker nodes and begin installation.
==== Configure the `os-control01` to authenticate with the new OCP4 cluster
=== Configure the `os-control01` to authenticate with the new OCP4 cluster
Copy the `kubeconfig` to `~root/.kube/config` on the `os-control01` instance.
This will allow the `root` user to automatically be authenticated to the new OCP4 cluster with `cluster-admin` privileges.
==== Accept Node CSR Certs
=== Accept Node CSR Certs
To accept the worker/compute nodes into the cluster we need to accept their CSR certs.
List the CSR certs. The ones we're interested in will show as pending:
@ -200,7 +200,7 @@ worker05.ocp.stg.iad2.fedoraproject.org Ready worker 34d v1.21.1+980738
At this point the cluster is basically up and running.
=== Follow on SOPs
== Follow on SOPs
Several other SOPs should be followed to perform the post installation configuration on the cluster.
- xref:sop_configure_baremetal_pxe_uefi_boot.adoc[SOP Configure Baremetal PXE-UEFI Boot]

View file

@ -1,10 +1,10 @@
== SOP Retrieve OCP4 Cluster CACERT
= SOP Retrieve OCP4 Cluster CACERT
=== Resources
== Resources
- [1] https://pagure.io/fedora-infra/ansible/blob/main/f/roles/dhcp_server[Ansible Role DHCP Server]
=== Retrieve CACERT
== Retrieve CACERT
In Fedora Infra, we have Apache terminating TLS for the cluster. Connections to the api and the machineconfig server are handled by haproxy. To prevent TLS errors we must configure haproxy with the OCP4 Cluster CA Cert.
This can be retrieved once the cluster control plane has been installed, from the `os-control01` node like so:

View file

@ -1,7 +1,7 @@
== Upgrade OCP4 Cluster
= Upgrade OCP4 Cluster
Please see the official documentation for more information [1][3], this SOP can be used as a rough guide.
=== Resources
== Resources
- [1] https://docs.openshift.com/container-platform/4.8/updating/updating-cluster-between-minor.html[Upgrading OCP4 Cluster Between Minor Versions]
- [2] xref:sop_etcd_backup.adoc[SOP Create etcd backup]
@ -10,14 +10,14 @@ Please see the official documentation for more information [1][3], this SOP can
- [5] https://docs.openshift.com/container-platform/4.8/operators/admin/olm-upgrading-operators.html#olm-upgrading-operators[Upgrading Operators Prior to Cluster Update]
- [6] https://access.redhat.com/downloads/content/290/ver=4.8/rhel---8/4.8.18/x86_64/packages[Openshift Clients RPM Download]
=== Prerequisites
== Prerequisites
- Incase an upgrade fails, it is wise to first take an `etcd` backup. To do so follow the SOP [2].
- Ensure that all installed Operators are at the latest versions for their channel [5].
- Ensure that the latest `oc` client rpm is available at `/srv/web/infra/bigfiles/openshiftboot/oc-client/` on the `batcave01` server. Retrieve the RPM from [6] choose the `Openshift Clients Binary` rpm. Rename rpm to `oc-client.rpm`
- Ensure that the `sudo rbac-playbook manual/ocp4-sysadmin-openshift.yml -t "upgrade-rpm"` playbook is run to install this updated oc client rpm.
=== Upgrade OCP
== Upgrade OCP
At the time of writing the version installed on the cluster is `4.8.11` and the `upgrade channel` is set to `stable-4.8`. It is easiest to update the cluster via the web console. Go to:
- Administration
@ -27,10 +27,10 @@ At the time of writing the version installed on the cluster is `4.8.11` and the
- When the upgrade has finished, switch back to the `upgrade channel` for stable.
=== Upgrade failures
== Upgrade failures
In the worst case scenario we may have to restore etcd from the backups taken at the start [4]. Or reinstall a node entirely.
==== Troubleshooting
=== Troubleshooting
There are many possible ways an upgrade can fail mid way through.
- Check the monitoring alerts currently firing, this can often hint towards the problem

View file

@ -1,4 +1,4 @@
== SOPs
= SOPs
- xref:sop_configure_baremetal_pxe_uefi_boot.adoc[SOP Configure Baremetal PXE-UEFI Boot]
- xref:sop_configure_image_registry_operator.adoc[SOP Configure the Image Registry Operator]