infra-docs-fpo/modules/sysadmin_guide/pages/failedharddrive.adoc
Nils Philippsen b4afb2f945 DC move: iad => rdu3, 10.3. => 10.16.
And remove some obsolete things.

Signed-off-by: Nils Philippsen <nils@redhat.com>
2025-07-04 16:32:42 +02:00

88 lines
No EOL
3.6 KiB
Text

= Replacing a Failed Hard Drive
:page-description: Steps for replacing a failed drive on a Fedora infrastructure server.
:page-aliases: replacing-failed-drive.adoc
== Overview
This document provides a step-by-step procedure for replacing a failed hard drive on a Fedora infrastructure server. It includes access requirements, necessary tools, and the process for initiating and completing the drive replacement.
== Contact Information
Owner::
Fedora Infrastructure Team
Contact::
#fedora-admin, sysadmin-main
Purpose::
Provide basic orientation and introduction to the sysadmin group
== Access Level
To perform this procedure, you may need to have sysadmin-main access. In the future, access details might be shared with a dedicated assignee or stored in a smaller vault. Currently, reach out to the sysadmin-main team for necessary information exchange.
== Requirements
* Red Hat VPN Access - Needed for SSH access to the machine.
* Bitwarden Vault Access - Access to the vault is under discussion. For now, consult the sysadmin-main team for the login credentials.
== Process
.Firstly, access the management console:
. Ensure you are connected to the official Red Hat VPN.
. Identify the server in question. For this SOP, we will use `bvmhost-x86-01.stg.rdu3.fedoraproject.org` as an example.
. To access the management console, append `.mgmt` to the hostname: `bvmhost-x86-01-stg.mgmt.rdu3.fedoraproject.org`.
. Obtain the IP address by pinging the server from `batcave01`:
+
[source,bash]
----
ssh batcave01.rdu3.fedoraproject.org
ping bvmhost-x86-01-stg.mgmt.rdu3.fedoraproject.org
----
. Visit the IP address in a web browser. The management console uses HTTPS, so accept the self-signed certificate:
+
[source]
----
https://<IP_ADDRESS>
----
. Login using the credentials found in the `admin-stg` entry in Bitwarden.
.Navigate to the overview page to find the serial number/service tag of the machine.
=== Identify the Failed Drive
. Navigate to the storage menu to identify the failed drive. Warnings about failing/failed drives will be indicated here.
. Note the failed drive's details (e.g., drive 4).
. Create a failed drice report by clicking on the exporting the information of failed drive.
=== Create a Support Ticket
. In the management console, click on the support link in the top right corner.
. Follow these steps to contact technical support:
.. Go to the top left search bar and select "Support > Contact Technical Support".
.. Search for the device using the service tag from the overview page.
.. Select "HardDrive and RAID Controller" from the drop-down menu.
.. Choose one of the support options:
... Call: 24/7
... Live Chat: 7 am - 9 pm CDT, Monday - Friday
... Social Connect
. In the live chat support, provide the failed drive report, once they verify and confirm the failure issue, they will send an email regarding replacement details.
. If live chat is unsuccessful, call support at 1-866-362-5350 (available 24/7).
=== Follow-Up with the Support Ticket
. Once the support ticket is created, the assignee will receive a form via email.
. Forward this form to Patrick Cole (pcole@redhat.com) along with the machine's serial number and location.
+
[NOTE]
====
At this point, Patrick Cole will handle the coordination with Dell for the drive replacement. This avoids adding unnecessary intermediaries.
====
Patrick will then coordinate the replacement with Dell, including arranging access for the technician if needed.
== Conclusion
Following this SOP ensures a systematic approach to replacing failed drives, minimizing downtime and maintaining system integrity. Always reach out to the sysadmin-main team for any clarifications or additional support.