From 71d0af5b2d3218a35902d8f02e6651f2502d2f87 Mon Sep 17 00:00:00 2001
From: Samyak Jain <samyak.jn11@gmail.com>
Date: Mon, 27 May 2024 17:31:42 +0530
Subject: [PATCH] Adds SOP for Replacing Failed Hard Drives from machines

Signed-off-by: Samyak Jain <samyak.jn11@gmail.com>
---
 modules/ROOT/nav.adoc                         |  1 +
 .../sysadmin_guide/pages/failedharddrive.adoc | 88 +++++++++++++++++++
 2 files changed, 89 insertions(+)
 create mode 100644 modules/sysadmin_guide/pages/failedharddrive.adoc

diff --git a/modules/ROOT/nav.adoc b/modules/ROOT/nav.adoc
index 50f947f..7027660 100644
--- a/modules/ROOT/nav.adoc
+++ b/modules/ROOT/nav.adoc
@@ -33,6 +33,7 @@
 ** xref:sysadmin_guide:index.adoc[Sysadmin Guide]
 *** xref:sysadmin_guide:orientation.adoc[Orientation for Sysadmin Guide]
 *** xref:sysadmin_guide:index.adoc#_standard_operating_procedures[Standard Operation Procedures]
+*** xref:sysadmin_guide:failedharddrive.adoc[Replacing Failed Hard Drives]
 *** xref:sysadmin_guide:index.adoc#_howtos[HOWTOs]
 * xref:release_guide:index.adoc[Release Engineering]
 ** xref:release_guide:release_process.adoc[Release process]
diff --git a/modules/sysadmin_guide/pages/failedharddrive.adoc b/modules/sysadmin_guide/pages/failedharddrive.adoc
new file mode 100644
index 0000000..33968c4
--- /dev/null
+++ b/modules/sysadmin_guide/pages/failedharddrive.adoc
@@ -0,0 +1,88 @@
+
+= Replacing a Failed Hard Drive
+:page-description: Steps for replacing a failed drive on a Fedora infrastructure server.
+:page-aliases: replacing-failed-drive.adoc
+
+== Overview
+
+This document provides a step-by-step procedure for replacing a failed hard drive on a Fedora infrastructure server. It includes access requirements, necessary tools, and the process for initiating and completing the drive replacement.
+
+== Contact Information
+
+Owner::
+  Fedora Infrastructure Team
+Contact::
+  #fedora-admin, sysadmin-main
+Purpose::
+  Provide basic orientation and introduction to the sysadmin group
+
+== Access Level
+
+To perform this procedure, you may need to have sysadmin-main access. In the future, access details might be shared with a dedicated assignee or stored in a smaller vault. Currently, reach out to the sysadmin-main team for necessary information exchange.
+
+== Requirements
+
+* Red Hat VPN Access - Needed for SSH access to the machine.
+* Bitwarden Vault Access - Access to the vault is under discussion. For now, consult the sysadmin-main team for the login credentials.
+
+== Process
+
+.Firstly, access the management console:
+. Ensure you are connected to the official Red Hat VPN.
+. Identify the server in question. For this SOP, we will use `bvmhost-x86-01.stg.iad2.fedoraproject.org` as an example.
+. To access the management console, append `.mgmt` to the hostname: `bvmhost-x86-01-stg.mgmt.iad2.fedoraproject.org`.
+. Obtain the IP address by pinging the server from `batcave01`:
++
+[source,bash]
+----
+ssh batcave01.iad2.fedoraproject.org
+ping bvmhost-x86-01-stg.mgmt.iad2.fedoraproject.org
+----
+
+. Visit the IP address in a web browser. The management console uses HTTPS, so accept the self-signed certificate:
++
+[source]
+----
+https://<IP_ADDRESS>
+----
+
+. Login using the credentials found in the `admin-stg` entry in Bitwarden.
+
+.Navigate to the overview page to find the serial number/service tag of the machine.
+
+=== Identify the Failed Drive
+
+. Navigate to the storage menu to identify the failed drive. Warnings about failing/failed drives will be indicated here.
+. Note the failed drive's details (e.g., drive 4).
+. Create a failed drice report by clicking on the exporting the information of failed drive.
+
+=== Create a Support Ticket
+
+. In the management console, click on the support link in the top right corner.
+. Follow these steps to contact technical support:
+.. Go to the top left search bar and select "Support > Contact Technical Support".
+.. Search for the device using the service tag from the overview page.
+.. Select "HardDrive and RAID Controller" from the drop-down menu.
+.. Choose one of the support options:
+... Call: 24/7
+... Live Chat: 7 am - 9 pm CDT, Monday - Friday
+... Social Connect
+
+. In the live chat support, provide the failed drive report, once they verify and confirm the failure issue, they will send an email regarding replacement details.
+. If live chat is unsuccessful, call support at 1-866-362-5350 (available 24/7).
+
+=== Follow-Up with the Support Ticket
+
+. Once the support ticket is created, the assignee will receive a form via email.
+. Forward this form to Patrick Cole (pcole@redhat.com) along with the machine's serial number and location.
++
+[NOTE]
+====
+At this point, Patrick Cole will handle the coordination with Dell for the drive replacement. This avoids adding unnecessary intermediaries.
+====
+
+Patrick will then coordinate the replacement with Dell, including arranging access for the technician if needed.
+
+== Conclusion
+
+Following this SOP ensures a systematic approach to replacing failed drives, minimizing downtime and maintaining system integrity. Always reach out to the sysadmin-main team for any clarifications or additional support.
\ No newline at end of file