From 71d0af5b2d3218a35902d8f02e6651f2502d2f87 Mon Sep 17 00:00:00 2001 From: Samyak Jain Date: Mon, 27 May 2024 17:31:42 +0530 Subject: [PATCH] Adds SOP for Replacing Failed Hard Drives from machines Signed-off-by: Samyak Jain --- modules/ROOT/nav.adoc | 1 + .../sysadmin_guide/pages/failedharddrive.adoc | 88 +++++++++++++++++++ 2 files changed, 89 insertions(+) create mode 100644 modules/sysadmin_guide/pages/failedharddrive.adoc diff --git a/modules/ROOT/nav.adoc b/modules/ROOT/nav.adoc index 50f947f..7027660 100644 --- a/modules/ROOT/nav.adoc +++ b/modules/ROOT/nav.adoc @@ -33,6 +33,7 @@ ** xref:sysadmin_guide:index.adoc[Sysadmin Guide] *** xref:sysadmin_guide:orientation.adoc[Orientation for Sysadmin Guide] *** xref:sysadmin_guide:index.adoc#_standard_operating_procedures[Standard Operation Procedures] +*** xref:sysadmin_guide:failedharddrive.adoc[Replacing Failed Hard Drives] *** xref:sysadmin_guide:index.adoc#_howtos[HOWTOs] * xref:release_guide:index.adoc[Release Engineering] ** xref:release_guide:release_process.adoc[Release process] diff --git a/modules/sysadmin_guide/pages/failedharddrive.adoc b/modules/sysadmin_guide/pages/failedharddrive.adoc new file mode 100644 index 0000000..33968c4 --- /dev/null +++ b/modules/sysadmin_guide/pages/failedharddrive.adoc @@ -0,0 +1,88 @@ + += Replacing a Failed Hard Drive +:page-description: Steps for replacing a failed drive on a Fedora infrastructure server. +:page-aliases: replacing-failed-drive.adoc + +== Overview + +This document provides a step-by-step procedure for replacing a failed hard drive on a Fedora infrastructure server. It includes access requirements, necessary tools, and the process for initiating and completing the drive replacement. + +== Contact Information + +Owner:: + Fedora Infrastructure Team +Contact:: + #fedora-admin, sysadmin-main +Purpose:: + Provide basic orientation and introduction to the sysadmin group + +== Access Level + +To perform this procedure, you may need to have sysadmin-main access. In the future, access details might be shared with a dedicated assignee or stored in a smaller vault. Currently, reach out to the sysadmin-main team for necessary information exchange. + +== Requirements + +* Red Hat VPN Access - Needed for SSH access to the machine. +* Bitwarden Vault Access - Access to the vault is under discussion. For now, consult the sysadmin-main team for the login credentials. + +== Process + +.Firstly, access the management console: +. Ensure you are connected to the official Red Hat VPN. +. Identify the server in question. For this SOP, we will use `bvmhost-x86-01.stg.iad2.fedoraproject.org` as an example. +. To access the management console, append `.mgmt` to the hostname: `bvmhost-x86-01-stg.mgmt.iad2.fedoraproject.org`. +. Obtain the IP address by pinging the server from `batcave01`: ++ +[source,bash] +---- +ssh batcave01.iad2.fedoraproject.org +ping bvmhost-x86-01-stg.mgmt.iad2.fedoraproject.org +---- + +. Visit the IP address in a web browser. The management console uses HTTPS, so accept the self-signed certificate: ++ +[source] +---- +https:// +---- + +. Login using the credentials found in the `admin-stg` entry in Bitwarden. + +.Navigate to the overview page to find the serial number/service tag of the machine. + +=== Identify the Failed Drive + +. Navigate to the storage menu to identify the failed drive. Warnings about failing/failed drives will be indicated here. +. Note the failed drive's details (e.g., drive 4). +. Create a failed drice report by clicking on the exporting the information of failed drive. + +=== Create a Support Ticket + +. In the management console, click on the support link in the top right corner. +. Follow these steps to contact technical support: +.. Go to the top left search bar and select "Support > Contact Technical Support". +.. Search for the device using the service tag from the overview page. +.. Select "HardDrive and RAID Controller" from the drop-down menu. +.. Choose one of the support options: +... Call: 24/7 +... Live Chat: 7 am - 9 pm CDT, Monday - Friday +... Social Connect + +. In the live chat support, provide the failed drive report, once they verify and confirm the failure issue, they will send an email regarding replacement details. +. If live chat is unsuccessful, call support at 1-866-362-5350 (available 24/7). + +=== Follow-Up with the Support Ticket + +. Once the support ticket is created, the assignee will receive a form via email. +. Forward this form to Patrick Cole (pcole@redhat.com) along with the machine's serial number and location. ++ +[NOTE] +==== +At this point, Patrick Cole will handle the coordination with Dell for the drive replacement. This avoids adding unnecessary intermediaries. +==== + +Patrick will then coordinate the replacement with Dell, including arranging access for the technician if needed. + +== Conclusion + +Following this SOP ensures a systematic approach to replacing failed drives, minimizing downtime and maintaining system integrity. Always reach out to the sysadmin-main team for any clarifications or additional support. \ No newline at end of file