hardware troubleshooting SOP

Signed-off-by: David Kirwan <davidkirwanirl@gmail.com>
This commit is contained in:
David Kirwan 2024-07-12 08:45:31 +01:00
parent 7ee63f7e5a
commit 8a82b94423
No known key found for this signature in database
GPG key ID: A5893AB6474AC37D
3 changed files with 108 additions and 19 deletions

View file

@ -33,7 +33,6 @@
** xref:sysadmin_guide:index.adoc[Sysadmin Guide] ** xref:sysadmin_guide:index.adoc[Sysadmin Guide]
*** xref:sysadmin_guide:orientation.adoc[Orientation for Sysadmin Guide] *** xref:sysadmin_guide:orientation.adoc[Orientation for Sysadmin Guide]
*** xref:sysadmin_guide:index.adoc#_standard_operating_procedures[Standard Operation Procedures] *** xref:sysadmin_guide:index.adoc#_standard_operating_procedures[Standard Operation Procedures]
*** xref:sysadmin_guide:failedharddrive.adoc[Replacing Failed Hard Drives]
*** xref:sysadmin_guide:index.adoc#_howtos[HOWTOs] *** xref:sysadmin_guide:index.adoc#_howtos[HOWTOs]
* xref:release_guide:index.adoc[Release Engineering] * xref:release_guide:index.adoc[Release Engineering]
** xref:release_guide:release_process.adoc[Release process] ** xref:release_guide:release_process.adoc[Release process]

View file

@ -70,21 +70,19 @@ procedures for Fedora Infrastructure applications. For information on
how to write a new standard operating procedure, consult the guide on how to write a new standard operating procedure, consult the guide on
xref:developer_guide:sops.adoc[Developing Standard Operating Procedures]. xref:developer_guide:sops.adoc[Developing Standard Operating Procedures].
* xref:2-factor.adoc[Two factor auth]
* xref:accountdeletion.adoc[Account Deletion SOP] * xref:accountdeletion.adoc[Account Deletion SOP]
* xref:fedmsg-new-message-type.adoc[Adding a new fedmsg message type]
* xref:anitya.adoc[Anitya Infrastructure SOP] * xref:anitya.adoc[Anitya Infrastructure SOP]
* xref:ansible.adoc[Ansible] * xref:ansible.adoc[Ansible]
* xref:apps-fp-o.adoc[apps.fedoraproject.org] * xref:apps-fp-o.adoc[apps.fedoraproject.org]
* xref:aws-access.adoc[Amazon Web Services Access] * xref:aws-access.adoc[Amazon Web Services Access]
* xref:mirrormanager-S3-EC2-netblocks.adoc[Amazon Web Services Mirrors]
* xref:bastion-hosts-info.adoc[Bastion Hosts] * xref:bastion-hosts-info.adoc[Bastion Hosts]
* xref:blockerbugs.adoc[Blockerbugs Infrastructure] * xref:blockerbugs.adoc[Blockerbugs Infrastructure]
* xref:bodhi.adoc[Bodhi Infrastructure - Releng]
* xref:bodhi-deploy.adoc[Bodhi Infrastructure - Deployment] * xref:bodhi-deploy.adoc[Bodhi Infrastructure - Deployment]
* xref:bodhi.adoc[Bodhi Infrastructure - Releng]
* xref:bugzilla2fedmsg.adoc[bugzilla2fedmsg] * xref:bugzilla2fedmsg.adoc[bugzilla2fedmsg]
* xref:collectd.adoc[Collectd] * xref:collectd.adoc[Collectd]
* xref:compose-tracker.adoc[Compose Tracker] * xref:compose-tracker.adoc[Compose Tracker]
* xref:registry.adoc[Container registry]
* xref:contenthosting.adoc[Content Hosting Infrastructure] * xref:contenthosting.adoc[Content Hosting Infrastructure]
* xref:copr.adoc[Copr] * xref:copr.adoc[Copr]
* xref:coreos-cincinnati.adoc[CoreOS Cincinnati] * xref:coreos-cincinnati.adoc[CoreOS Cincinnati]
@ -96,46 +94,46 @@ xref:developer_guide:sops.adoc[Developing Standard Operating Procedures].
* xref:dns.adoc[DNS repository for fedoraproject] * xref:dns.adoc[DNS repository for fedoraproject]
* xref:docs.fedoraproject.org.adoc[Docs] * xref:docs.fedoraproject.org.adoc[Docs]
* xref:externally-hosted-services.adoc[Externally Hosted Services] * xref:externally-hosted-services.adoc[Externally Hosted Services]
* xref:failedharddrive.adoc[Replacing Failed Hard Drives]
* xref:fas-openid.adoc[FAS-OpenID] * xref:fas-openid.adoc[FAS-OpenID]
* xref:fedmsg-certs.adoc[fedmsg (Fedora Messaging) Certs, Keys, and CA] * xref:fedmsg-certs.adoc[fedmsg (Fedora Messaging) Certs, Keys, and CA]
* xref:fedmsg-gateway.adoc[fedmsg-gateway] * xref:fedmsg-gateway.adoc[fedmsg-gateway]
* xref:fedmsg-introduction.adoc[fedmsg introduction and basics] * xref:fedmsg-introduction.adoc[fedmsg introduction and basics]
* xref:fedmsg-new-message-type.adoc[Adding a new fedmsg message type]
* xref:fedmsg-relay.adoc[fedmsg-relay] * xref:fedmsg-relay.adoc[fedmsg-relay]
* xref:fedmsg-websocket.adoc[WebSocket]
* xref:fedocal.adoc[Fedocal] * xref:fedocal.adoc[Fedocal]
* xref:fedora-releases.adoc[Fedora Release Infrastructure] * xref:fedora-releases.adoc[Fedora Release Infrastructure]
* xref:fedorawebsites.adoc[Websites Release]
* xref:gather-easyfix.adoc[Fedora gather easyfix] * xref:gather-easyfix.adoc[Fedora gather easyfix]
* xref:status-fedora.adoc[Fedora Status Service]
* xref:gdpr_delete.adoc[GDPR Delete] * xref:gdpr_delete.adoc[GDPR Delete]
* xref:gdpr_sar.adoc[GDPR SAR] * xref:gdpr_sar.adoc[GDPR SAR]
* xref:geoip-city-wsgi.adoc[geoip-city-wsgi] * xref:geoip-city-wsgi.adoc[geoip-city-wsgi]
* xref:github.adoc[Using github for Infra Projects] * xref:github.adoc[Using github for Infra Projects]
* xref:github2fedmsg.adoc[github2fedmsg] * xref:github2fedmsg.adoc[github2fedmsg]
* xref:greenwave.adoc[Greenwave] * xref:greenwave.adoc[Greenwave]
* xref:guest_migrate.adoc[Migrate Guest VMs]
* xref:guestdisk.adoc[Guest Disk Resize] * xref:guestdisk.adoc[Guest Disk Resize]
* xref:guestedit.adoc[Guest Editing] * xref:guestedit.adoc[Guest Editing]
* xref:guest_migrate.adoc[Migrate Guest VMs]
* xref:haproxy.adoc[Haproxy Infrastructure] * xref:haproxy.adoc[Haproxy Infrastructure]
* xref:hotfix.adoc[HOTFIXES] * xref:hotfix.adoc[HOTFIXES]
* xref:tickets.adoc[How to handle new tickets in fedora-infrastructure] * xref:hotness.adoc[The New Hotness]
* xref:infra-git-repo.adoc[Infrastructure Git Repos] * xref:infra-git-repo.adoc[Infrastructure Git Repos]
* xref:infra-hostrename.adoc[Infrastructure Host Rename] * xref:infra-hostrename.adoc[Infrastructure Host Rename]
* xref:infra_handover.adoc[Initiative Handover]
* xref:infra-raidmismatch.adoc[Infrastructure Raid Mismatch Count] * xref:infra-raidmismatch.adoc[Infrastructure Raid Mismatch Count]
* xref:infra-repo.adoc[Infrastructure DNF Repo] * xref:infra-repo.adoc[Infrastructure DNF Repo]
* xref:infra-retiremachine.adoc[Infrastructure retire machine] * xref:infra-retiremachine.adoc[Infrastructure retire machine]
* xref:infra_handover.adoc[Initiative Handover]
* xref:ipa.adoc[IPA infrastructure] * xref:ipa.adoc[IPA infrastructure]
* xref:ipsilon.adoc[Ipsilon Infrastructure] * xref:ipsilon.adoc[Ipsilon Infrastructure]
* xref:iscsi.adoc[iSCSI] * xref:iscsi.adoc[iSCSI]
* xref:kerneltest-harness.adoc[Kerneltest-harness] * xref:kerneltest-harness.adoc[Kerneltest-harness]
* xref:kickstarts.adoc[Kickstart Infrastructure] * xref:kickstarts.adoc[Kickstart Infrastructure]
* xref:koji-archive.adoc[Koji Archive] * xref:koji-archive.adoc[Koji Archive]
* xref:virt-image.adoc[Kpartx Notes] * xref:koji-builder-setup.adoc[Setup Koji Builder]
* xref:koji.adoc[Koji Infrastructure] * xref:koji.adoc[Koji Infrastructure]
* xref:koschei.adoc[Koschei] * xref:koschei.adoc[Koschei]
* xref:layered-image-buildsys.adoc[Layered Image Build System] * xref:layered-image-buildsys.adoc[Layered Image Build System]
* xref:virt-notes.adoc[Libvirt Notes]
* xref:syslog.adoc[Log Infrastructure]
* xref:publictest-dev-stg-production.adoc[Machine Classes]
* xref:mailman.adoc[Mailman Infrastructure] * xref:mailman.adoc[Mailman Infrastructure]
* xref:massupgrade.adoc[Mass Upgrade Infrastructure] * xref:massupgrade.adoc[Mass Upgrade Infrastructure]
* xref:mastermirror.adoc[Master Mirror Infrastructure] * xref:mastermirror.adoc[Master Mirror Infrastructure]
@ -144,10 +142,12 @@ xref:developer_guide:sops.adoc[Developing Standard Operating Procedures].
* xref:message-tagging-service.adoc[Message Tagging Service] * xref:message-tagging-service.adoc[Message Tagging Service]
* xref:mini_initiatives.adoc[Mini initiative Process] * xref:mini_initiatives.adoc[Mini initiative Process]
* xref:mirrorhiding.adoc[Mirror Hiding Infrastructure] * xref:mirrorhiding.adoc[Mirror Hiding Infrastructure]
* xref:mirrormanager-S3-EC2-netblocks.adoc[Amazon Web Services Mirrors]
* xref:mirrormanager.adoc[MirrorManager Infrastructure] * xref:mirrormanager.adoc[MirrorManager Infrastructure]
* xref:mote.adoc[mote] * xref:mote.adoc[mote]
* xref:nagios.adoc[Nagios] * xref:nagios.adoc[Nagios]
* xref:netapp.adoc[Netapp Infrastructure] * xref:netapp.adoc[Netapp Infrastructure]
* xref:new-virtual-hosts.adoc[Virtual Host Addition]
* xref:nonhumanaccounts.adoc[Non-human Accounts Infrastructure] * xref:nonhumanaccounts.adoc[Non-human Accounts Infrastructure]
* xref:ocp4:sops.adoc[Openshift SOPs] * xref:ocp4:sops.adoc[Openshift SOPs]
* xref:odcs.adoc[On Demand Compose Service] * xref:odcs.adoc[On Demand Compose Service]
@ -159,27 +159,29 @@ xref:developer_guide:sops.adoc[Developing Standard Operating Procedures].
* xref:pdc.adoc[PDC] * xref:pdc.adoc[PDC]
* xref:pesign-upgrade.adoc[Pesign upgrades/reboots] * xref:pesign-upgrade.adoc[Pesign upgrades/reboots]
* xref:planetsubgroup.adoc[Planet Subgroup Infrastructure] * xref:planetsubgroup.adoc[Planet Subgroup Infrastructure]
* xref:publictest-dev-stg-production.adoc[Machine Classes]
* xref:rabbitmq.adoc[RabbitMQ] * xref:rabbitmq.adoc[RabbitMQ]
* xref:rdiff-backup.adoc[rdiff-backup] * xref:rdiff-backup.adoc[rdiff-backup]
* xref:registry.adoc[Container registry]
* xref:requestforresources.adoc[Request for resources] * xref:requestforresources.adoc[Request for resources]
* xref:resultsdb.adoc[ResultsDB] * xref:resultsdb.adoc[ResultsDB]
* xref:retrace.adoc[Retrace] * xref:retrace.adoc[Retrace]
* xref:scmadmin.adoc[SCM Admin] * xref:scmadmin.adoc[SCM Admin]
* xref:selinux.adoc[SELinux Infrastructure] * xref:selinux.adoc[SELinux Infrastructure]
* xref:koji-builder-setup.adoc[Setup Koji Builder]
* xref:sigul-upgrade.adoc[Sigul servers upgrades/reboots] * xref:sigul-upgrade.adoc[Sigul servers upgrades/reboots]
* xref:sop_hardware_troubleshooting_power.adoc[Hardware Troubleshoot Power Issue SOP]
* xref:sshaccess.adoc[SSH Access Infrastructure] * xref:sshaccess.adoc[SSH Access Infrastructure]
* xref:sshknownhosts.adoc[SSH known hosts Infrastructure] * xref:sshknownhosts.adoc[SSH known hosts Infrastructure]
* xref:ssl-certificates.adoc[SSL Certificates] * xref:ssl-certificates.adoc[SSL Certificates]
* xref:staging.adoc[Staging] * xref:staging.adoc[Staging]
* xref:hotness.adoc[The New Hotness] * xref:status-fedora.adoc[Fedora Status Service]
* xref:2-factor.adoc[Two factor auth] * xref:syslog.adoc[Log Infrastructure]
* xref:tickets.adoc[How to handle new tickets in fedora-infrastructure]
* xref:unbound.adoc[Unbound Notes] * xref:unbound.adoc[Unbound Notes]
* xref:new-virtual-hosts.adoc[Virtual Host Addition] * xref:virt-image.adoc[Kpartx Notes]
* xref:virt-notes.adoc[Libvirt Notes]
* xref:voting.adoc[Voting Infrastructure] * xref:voting.adoc[Voting Infrastructure]
* xref:waiverdb.adoc[WaiverDB] * xref:waiverdb.adoc[WaiverDB]
* xref:fedorawebsites.adoc[Websites Release]
* xref:fedmsg-websocket.adoc[WebSocket]
* xref:wcidff.adoc[What Can I Do For Fedora] * xref:wcidff.adoc[What Can I Do For Fedora]
* xref:wiki.adoc[Wiki Infrastructure] * xref:wiki.adoc[Wiki Infrastructure]
* xref:zabbix.adoc[Zabbix Infrastructure] * xref:zabbix.adoc[Zabbix Infrastructure]

View file

@ -0,0 +1,88 @@
== Hardware Troubleshooting Power Issue
=== Overview
This SOP shows some of the steps required to troubleshoot and diagnose a power issue with one of our servers. A ticket was opened Infra Ticket: https://pagure.io/fedora-infrastructure/issue/11950
Symptoms:
- This server is not responding at all, and will not power on.
- To get to mgmt of RDU2-CC devices its a bit trickier than IAD2. We have a private management vlan there, but its only reachable via cloud-noc-os01.rdu-cc.fedoraproject.org. I usually use the sshuttle package/command/app to transparently forward my traffic to devices on that network. That looks something like: `sshuttle 172.23.1.0/24 -r cloud-noc-os01.rdu-cc.fedoraproject.org`
The devices are all in the 172.23.1 network. Theres a list of them in `ansible-private/docs/rdu-networks.txt` but this host is: `172.23.1.105`.
In the Bitwarden Vault, the management password can be obtained.
- Logs show issues with voltages not being in the correct range.
- At RDU2-CC we have a contact: `James Gibson`.
=== Contact Information
Owner::
Fedora Infrastructure Team
Contact::
#fedora-admin, sysadmin-main
Purpose::
Provide basic orientation and introduction to the sysadmin group
=== Requirements
- sshuttle to access the network at RDU2-CC
- Bitwarden Vault Access - Access to the vault is under discussion. For now, consult the sysadmin-main team for the login credentials.
- Access to ansible-private repo.
=== Troubleshooting Steps
.Connect to the management VLAN for the RDU2-CC network:
This is only required because this server is not in IAD2 datacenter. Use sshuttle to make a connection to the 172.23.1.0/24 (from your laptop directly, not from the batcave01 to the management network). `sshuttle 172.23.1.0/24 -r cloud-noc-os01.rdu-cc.fedoraproject.org`
.SSH to the batcave01 and retrieve the ip address for this machine
Ssh to the batcave01, access the ansible-private repo and read the IP address for this machine from the `docs/rdu-networks.txt`
.Open the Management Console
With the IP address, visit https://IP in browser to access the idrac management console. Like so: https://172.23.1.105/
.Retrieve the username and password from Bitwarden
This is a prod machine so use the username and password from Bitwarden to login.
.Once Logged in, retrieve the service tag for this server
Get the service tag: XXXXXXX its on the summary page on the management console. This is required in order to prove to Dell tech support that the server is under warranty.
.Open a tech support ticket with Dell
Open a ticket with tech support chat: https://www.dell.com/support/incidents-online/en-ie/ContactUs/Dynamic?spestate
.Collect logs from the server for Dell
https://www.dell.com/support/kbdoc/en-us/000126308/export-a-supportassist-collection-via-idrac9 how to collect logs for tech support.
.Dell requested firmware updates on the idrac and server, along with reseat of OCP card to be carried out.
Contacted James Gibson internally and opened a ticket in servicenow. Requested that he arrange a trip to the datacenter in order to reseat this OCP card.
Updated the firmware on the idrac itself successfully, but failed to update the firmware on the server obviously as it wont turn on.
.OCP reseat carried out
James finally managed to get out to the rdu-2 data center and carry out this work. Reseating the OCP had no effect, however he did troubleshoot further and removed one PSU, and still rebooting cycle, reattached and removed the other, and the server is booting fine. So we think we have identified a faulty PSU.
.Request to reupload logs
First request was to get the zip TSR logs generated and forwarded to Dell.
Use the following site to upload the TSR as it might be too big to attach to email https://tdm.dell.com/file-upload
This requires a service request, so be sure to ask the Dell technician for a service request number in order to use this form.
.Swap PSU1 with PSU2
Dell requested the following check be carried out:
Please Swap PSU1 with PSU2 and check if the server will power up.
if the issue persisit, test PSU2 on slot 1 and confirm
Once completed collect logs and share so we can proceed with action.
.Both PSUs seem functional
James Gibson, swapped the PSU units in this server on Friday, and the server is powering on as normal. So appears both PSU units are in fact working, perhaps something wrong with the chassis the units are going into ? Informed Dell just waiting on update to see what to troubleshoot next.
.Dell suggest use different power point to plug hardware into
Since both ports has been test, I'm thinking this could be an external issue or a configuration issue.
Are the PSUs set to redundant?
When plugged at the same time, are them being plug to the same outlet/UPS?
If so, can we test by plugging them to different outlets/UPS ?
.This appears to have resolved our issue.
Forwarded information to James Gibson to see what he thinks.
We have moved the power to different power points, with the 2nd PSU reattached and the server appears to be working correctly now.
Closed the ticket with Dell.