infra-docs-fpo/modules/sysadmin_guide/pages/sop_installation.adoc
David Kirwan c0d6947dba
ocp4 sops moved into sysadmin_guide
Signed-off-by: David Kirwan <davidkirwanirl@gmail.com>
2024-07-22 10:37:01 +01:00

215 lines
11 KiB
Text

= SOP Installation/Configuration of OCP4 on Fedora Infra
== Resources
- [1]: https://docs.openshift.com/container-platform/4.8/installing/installing_bare_metal/[Official OCP4 Installation Documentation]
== Install
To install OCP4 on Fedora Infra, one must be apart of the following groups:
- `sysadmin-openshift`
- `sysadmin-noc`
=== Prerequisites
Visit the https://console.redhat.com/openshift/install/metal/user-provisioned[OpenShift Console] and download the following OpenShift tools:
* A Red Hat Access account is required
* OC client tools https://access.redhat.com/downloads/content/290/ver=4.8/rhel---8/4.8.10/x86_64/product-software[Here]
* OC installation tool https://access.redhat.com/downloads/content/290/ver=4.8/rhel---8/4.8.10/x86_64/product-software[Here]
* Ensure the downloaded tools are available on the `PATH`
* A valid OCP4 subscription is required to complete the installation configuration, by default you have a 60 day trial.
* Take a copy of your pull secret file you will need to put this in the `install-config.yaml` file in the next step.
=== Generate install-config.yaml file
We must create a `install-config.yaml` file, use the following example for inspiration, alternatively refer to the documentation[1] for more detailed information/explainations.
----
apiVersion: v1
baseDomain: stg.fedoraproject.org
compute:
- hyperthreading: Enabled
name: worker
replicas: 0
controlPlane:
hyperthreading: Enabled
name: master
replicas: 3
metadata:
name: 'ocp'
networking:
clusterNetwork:
- cidr: 10.128.0.0/14
hostPrefix: 23
networkType: OpenShiftSDN
serviceNetwork:
- 172.30.0.0/16
platform:
none: {}
fips: false
pullSecret: 'PUT PULL SECRET HERE'
sshKey: 'PUT SSH PUBLIC KEY HERE kubeadmin@core'
----
* Login to the `os-control01` corresponding with the environment
* Make a directory to hold the installation files: `mkdir ocp4-<ENV>`
* Enter this newly created directory: `cd ocp4-<ENV>`
* Generate a fresh SSH keypair: `ssh-keygen -f ./ocp4-<ENV>-ssh`
* Create a `ssh` directory and place this keypair into it.
* Put the contents of the public key in the `sshKey` value in the `install-config.yaml` file
* Put the contents of your Pull Secret in the `pullSecret` value in the `install-config.yaml`
* Take a backup of the `install-config.yaml` to `install-config.yaml.bak`, as running the next steps consumes this file, having a backup allows you to recover from mistakes quickly.
=== Create the Installation Files
Using the `openshift-install` tool we can generate the installation files. Make sure that the `install-config.yaml` file is in the `/path/to/ocp4-<ENV>` location before attempting the next steps.
==== Create the Manifest Files
The manifest files are human readable, at this stage you can put any customisations required before the installation begins.
* Create the manifests: `openshift-install create manifests --dir=/path/to/ocp4-<ENV>`
* All configuration for RHCOS must be done via MachineConfigs configuration. If there is known configuration which must be performed, such as NTP, you can copy the MachineConfigs into the `/path/to/ocp4-<ENV>/openshift` directory now.
* The following step should be performed at this point, edit the `/path/to/ocp4-<ENV>/manifests/cluster-scheduler-02-config.yml` change the `mastersSchedulable` value to `false`.
==== Create the Ignition Files
The ignition files have been generated from the manifests and MachineConfig files to generate the final installation files for the three roles: `bootstrap`, `master`, `worker`. In Fedora we prefer not to use the term `master` here, we have renamed this role to `controlplane`.
* Create the ignition files: `openshift-install create ignition-configs --dir=/path/to/ocp4-<ENV>`
* At this point you should have the following three files: `bootstrap.ign`, `master.ign` and `worker.ign`.
* Rename the `master.ign` to `controlplane.ign`.
* A directory has been created, `auth`. This contains two files: `kubeadmin-password` and `kubeconfig`. These allow `cluster-admin` access to the cluster.
=== Copy the Ignition files to the `batcave01` server
On the `batcave01` at the following location: `/srv/web/infra/bigfiles/openshiftboot/`:
* Create a directory to match the environment: `mkdir /srv/web/infra/bigfiles/openshiftboot/ocp4-<ENV>`
* Copy the ignition files, the ssh files and the auth files generated in previous steps, to this newly created directory. Users with `sysadmin-openshift` should have the necessary permissions to write to this location.
* when this is complete it should look like the following:
----
├── <ENV>
│ ├── auth
│ │ ├── kubeadmin-password
│ │ └── kubeconfig
│ ├── bootstrap.ign
│ ├── controlplane.ign
│ ├── ssh
│ │ ├── id_rsa
│ │ └── id_rsa.pub
│ └── worker.ign
----
=== Update the ansible inventory
The ansible inventory/hostvars/group vars should be updated with the new hosts information.
For inspiration see the following https://pagure.io/fedora-infra/ansible/pull-request/765[PR] where we added the ocp4 production changes.
=== Update the DNS/DHCP configuration
The DNS and DHCP configuration must also be updated. This https://pagure.io/fedora-infra/ansible/pull-request/765[PR] contains the necessiary changes DHCP for prod and can be done in ansible.
However the DNS changes may only be performed by `sysadmin-main`. For this reason any DNS changes must go via a patch snippet which is emailed to the `infrastructure@lists.fedoraproject.org` mailing list for review and approval. This process may take several days.
=== Generate the TLS Certs for the new environment
This is beyond the scope of this SOP, the best option is to create a ticket for Fedora Infra to request that these certs are created and available for use. The following certs should be available:
- `*.apps.<ENV>.fedoraproject.org`
- `api.<ENV>.fedoraproject.org`
- `api-int.<ENV>.fedoraproject.org`
=== Run the Playbooks
There are a number of playbooks required to be run. Once all the previous steps have been reached, we can run these playbooks from the `batcave01` instance.
- `sudo rbac-playbook groups/noc.yml -t 'tftp_server,dhcp_server'`
- `sudo rbac-playbook groups/proxies.yml -t 'haproxy,httpd,iptables'`
==== Baremetal / VMs
Depending on if some of the nodes are VMs or baremetal, different tags should be supplied to the following playbook. If the entire cluster is baremetal you can skip the `kvm_deploy` tag entirely.
If there are VMs used for some of the roles, make sure to leave it in.
- `sudo rbac-playbook manual/ocp4-place-ignitionfiles.yml -t "ignition,repo,kvm_deploy"`
==== Baremetal
At this point we can switch on the baremetal nodes and begin the PXE/UEFI boot process. The baremetal nodes should via DHCP/DNS have the configuration necessary to reach out to the `noc01.iad2.fedoraproject.org` server and retrieve the UEFI boot configuration via PXE.
Once booted up, you should visit the management console for this node, and manually choose the UEFI configuration appropriate for its role.
The node will begin booting, and during the boot process it will reach out to the `os-control01` instance specific to the `<ENV>` to retrieve the ignition file appropriate to its role.
The system will then become autonomous, it will install and potentially reboot multiple times as updates are retrieved/applied etc.
Eventually you will be presented with a SSH login prompt, where it should have the correct hostname eg: `ocp01` to match what is in the DNS configuration.
=== Bootstrapping completed
When the control plane is up, we should see all controlplane instances available in the appropriate haproxy dashboard. eg: https://admin.fedoraproject.org/haproxy/proxy01=ocp-masters-backend-kapi[haproxy].
At this time we should take the `bootstrap` instance out of the haproxy load balancer.
- Make the necessiary changes to ansible at: `ansible/roles/haproxy/templates/haproxy.cfg`
- Once merged, run the following playbook once more: `sudo rbac-playbook groups/proxies.yml -t 'haproxy'`
=== Begin instllation of the worker nodes
Follow the same processes listed in the Baremetal section above to switch on the worker nodes and begin installation.
=== Configure the `os-control01` to authenticate with the new OCP4 cluster
Copy the `kubeconfig` to `~root/.kube/config` on the `os-control01` instance.
This will allow the `root` user to automatically be authenticated to the new OCP4 cluster with `cluster-admin` privileges.
=== Accept Node CSR Certs
To accept the worker/compute nodes into the cluster we need to accept their CSR certs.
List the CSR certs. The ones we're interested in will show as pending:
----
oc get csr
----
To accept all the OCP4 node CSRs in a one liner do the following:
----
oc get csr -o go-template='{{range .items}}{{if not .status}}{{.metadata.name}}{{"\n"}}{{end}}{{end}}' | xargs oc adm certificate approve
----
This should look something like this once completed:
----
[root@os-control01 ocp4][STG]= oc get nodes
NAME STATUS ROLES AGE VERSION
ocp01.ocp.stg.iad2.fedoraproject.org Ready master 34d v1.21.1+9807387
ocp02.ocp.stg.iad2.fedoraproject.org Ready master 34d v1.21.1+9807387
ocp03.ocp.stg.iad2.fedoraproject.org Ready master 34d v1.21.1+9807387
worker01.ocp.stg.iad2.fedoraproject.org Ready worker 21d v1.21.1+9807387
worker02.ocp.stg.iad2.fedoraproject.org Ready worker 20d v1.21.1+9807387
worker03.ocp.stg.iad2.fedoraproject.org Ready worker 20d v1.21.1+9807387
worker04.ocp.stg.iad2.fedoraproject.org Ready worker 34d v1.21.1+9807387
worker05.ocp.stg.iad2.fedoraproject.org Ready worker 34d v1.21.1+9807387
----
At this point the cluster is basically up and running.
== Follow on SOPs
Several other SOPs should be followed to perform the post installation configuration on the cluster.
- xref:sop_configure_baremetal_pxe_uefi_boot.adoc[SOP Configure Baremetal PXE-UEFI Boot]
- xref:sop_create_machineconfigs.adoc[SOP Create MachineConfigs to Configure RHCOS]
- xref:sop_retrieve_ocp4_cacert.adoc[SOP Retrieve OCP4 CACERT]
- xref:sop_configure_image_registry_operator.adoc[SOP Configure the Image Registry Operator]
- xref:sop_disable_provisioners_role.adoc[SOP Disable the Provisioners Role]
- xref:sop_configure_oauth_ipa.adoc[SOP Configure oauth Authentication via IPA/Noggin]
- xref:sop_configure_local_storage_operator.adoc[SOP Configure the Local Storage Operator]
- xref:sop_configure_openshift_container_storage.adoc[SOP Configure the Openshift Container Storage Operator]
- xref:sop_configure_userworkload_monitoring_stack.adoc[SOP Configure the Userworkload Monitoring Stack]