From 8a82b94423a6b35f25f0369661319e13f21d83f6 Mon Sep 17 00:00:00 2001 From: David Kirwan Date: Fri, 12 Jul 2024 08:45:31 +0100 Subject: [PATCH] hardware troubleshooting SOP Signed-off-by: David Kirwan --- modules/ROOT/nav.adoc | 1 - modules/sysadmin_guide/pages/index.adoc | 38 ++++---- .../sop_hardware_troubleshooting_power.adoc | 88 +++++++++++++++++++ 3 files changed, 108 insertions(+), 19 deletions(-) create mode 100644 modules/sysadmin_guide/pages/sop_hardware_troubleshooting_power.adoc diff --git a/modules/ROOT/nav.adoc b/modules/ROOT/nav.adoc index 7027660..50f947f 100644 --- a/modules/ROOT/nav.adoc +++ b/modules/ROOT/nav.adoc @@ -33,7 +33,6 @@ ** xref:sysadmin_guide:index.adoc[Sysadmin Guide] *** xref:sysadmin_guide:orientation.adoc[Orientation for Sysadmin Guide] *** xref:sysadmin_guide:index.adoc#_standard_operating_procedures[Standard Operation Procedures] -*** xref:sysadmin_guide:failedharddrive.adoc[Replacing Failed Hard Drives] *** xref:sysadmin_guide:index.adoc#_howtos[HOWTOs] * xref:release_guide:index.adoc[Release Engineering] ** xref:release_guide:release_process.adoc[Release process] diff --git a/modules/sysadmin_guide/pages/index.adoc b/modules/sysadmin_guide/pages/index.adoc index 8dfa159..affe60f 100644 --- a/modules/sysadmin_guide/pages/index.adoc +++ b/modules/sysadmin_guide/pages/index.adoc @@ -70,21 +70,19 @@ procedures for Fedora Infrastructure applications. For information on how to write a new standard operating procedure, consult the guide on xref:developer_guide:sops.adoc[Developing Standard Operating Procedures]. +* xref:2-factor.adoc[Two factor auth] * xref:accountdeletion.adoc[Account Deletion SOP] -* xref:fedmsg-new-message-type.adoc[Adding a new fedmsg message type] * xref:anitya.adoc[Anitya Infrastructure SOP] * xref:ansible.adoc[Ansible] * xref:apps-fp-o.adoc[apps.fedoraproject.org] * xref:aws-access.adoc[Amazon Web Services Access] -* xref:mirrormanager-S3-EC2-netblocks.adoc[Amazon Web Services Mirrors] * xref:bastion-hosts-info.adoc[Bastion Hosts] * xref:blockerbugs.adoc[Blockerbugs Infrastructure] -* xref:bodhi.adoc[Bodhi Infrastructure - Releng] * xref:bodhi-deploy.adoc[Bodhi Infrastructure - Deployment] +* xref:bodhi.adoc[Bodhi Infrastructure - Releng] * xref:bugzilla2fedmsg.adoc[bugzilla2fedmsg] * xref:collectd.adoc[Collectd] * xref:compose-tracker.adoc[Compose Tracker] -* xref:registry.adoc[Container registry] * xref:contenthosting.adoc[Content Hosting Infrastructure] * xref:copr.adoc[Copr] * xref:coreos-cincinnati.adoc[CoreOS Cincinnati] @@ -96,46 +94,46 @@ xref:developer_guide:sops.adoc[Developing Standard Operating Procedures]. * xref:dns.adoc[DNS repository for fedoraproject] * xref:docs.fedoraproject.org.adoc[Docs] * xref:externally-hosted-services.adoc[Externally Hosted Services] +* xref:failedharddrive.adoc[Replacing Failed Hard Drives] * xref:fas-openid.adoc[FAS-OpenID] * xref:fedmsg-certs.adoc[fedmsg (Fedora Messaging) Certs, Keys, and CA] * xref:fedmsg-gateway.adoc[fedmsg-gateway] * xref:fedmsg-introduction.adoc[fedmsg introduction and basics] +* xref:fedmsg-new-message-type.adoc[Adding a new fedmsg message type] * xref:fedmsg-relay.adoc[fedmsg-relay] +* xref:fedmsg-websocket.adoc[WebSocket] * xref:fedocal.adoc[Fedocal] * xref:fedora-releases.adoc[Fedora Release Infrastructure] +* xref:fedorawebsites.adoc[Websites Release] * xref:gather-easyfix.adoc[Fedora gather easyfix] -* xref:status-fedora.adoc[Fedora Status Service] * xref:gdpr_delete.adoc[GDPR Delete] * xref:gdpr_sar.adoc[GDPR SAR] * xref:geoip-city-wsgi.adoc[geoip-city-wsgi] * xref:github.adoc[Using github for Infra Projects] * xref:github2fedmsg.adoc[github2fedmsg] * xref:greenwave.adoc[Greenwave] +* xref:guest_migrate.adoc[Migrate Guest VMs] * xref:guestdisk.adoc[Guest Disk Resize] * xref:guestedit.adoc[Guest Editing] -* xref:guest_migrate.adoc[Migrate Guest VMs] * xref:haproxy.adoc[Haproxy Infrastructure] * xref:hotfix.adoc[HOTFIXES] -* xref:tickets.adoc[How to handle new tickets in fedora-infrastructure] +* xref:hotness.adoc[The New Hotness] * xref:infra-git-repo.adoc[Infrastructure Git Repos] * xref:infra-hostrename.adoc[Infrastructure Host Rename] -* xref:infra_handover.adoc[Initiative Handover] * xref:infra-raidmismatch.adoc[Infrastructure Raid Mismatch Count] * xref:infra-repo.adoc[Infrastructure DNF Repo] * xref:infra-retiremachine.adoc[Infrastructure retire machine] +* xref:infra_handover.adoc[Initiative Handover] * xref:ipa.adoc[IPA infrastructure] * xref:ipsilon.adoc[Ipsilon Infrastructure] * xref:iscsi.adoc[iSCSI] * xref:kerneltest-harness.adoc[Kerneltest-harness] * xref:kickstarts.adoc[Kickstart Infrastructure] * xref:koji-archive.adoc[Koji Archive] -* xref:virt-image.adoc[Kpartx Notes] +* xref:koji-builder-setup.adoc[Setup Koji Builder] * xref:koji.adoc[Koji Infrastructure] * xref:koschei.adoc[Koschei] * xref:layered-image-buildsys.adoc[Layered Image Build System] -* xref:virt-notes.adoc[Libvirt Notes] -* xref:syslog.adoc[Log Infrastructure] -* xref:publictest-dev-stg-production.adoc[Machine Classes] * xref:mailman.adoc[Mailman Infrastructure] * xref:massupgrade.adoc[Mass Upgrade Infrastructure] * xref:mastermirror.adoc[Master Mirror Infrastructure] @@ -144,10 +142,12 @@ xref:developer_guide:sops.adoc[Developing Standard Operating Procedures]. * xref:message-tagging-service.adoc[Message Tagging Service] * xref:mini_initiatives.adoc[Mini initiative Process] * xref:mirrorhiding.adoc[Mirror Hiding Infrastructure] +* xref:mirrormanager-S3-EC2-netblocks.adoc[Amazon Web Services Mirrors] * xref:mirrormanager.adoc[MirrorManager Infrastructure] * xref:mote.adoc[mote] * xref:nagios.adoc[Nagios] * xref:netapp.adoc[Netapp Infrastructure] +* xref:new-virtual-hosts.adoc[Virtual Host Addition] * xref:nonhumanaccounts.adoc[Non-human Accounts Infrastructure] * xref:ocp4:sops.adoc[Openshift SOPs] * xref:odcs.adoc[On Demand Compose Service] @@ -159,27 +159,29 @@ xref:developer_guide:sops.adoc[Developing Standard Operating Procedures]. * xref:pdc.adoc[PDC] * xref:pesign-upgrade.adoc[Pesign upgrades/reboots] * xref:planetsubgroup.adoc[Planet Subgroup Infrastructure] +* xref:publictest-dev-stg-production.adoc[Machine Classes] * xref:rabbitmq.adoc[RabbitMQ] * xref:rdiff-backup.adoc[rdiff-backup] +* xref:registry.adoc[Container registry] * xref:requestforresources.adoc[Request for resources] * xref:resultsdb.adoc[ResultsDB] * xref:retrace.adoc[Retrace] * xref:scmadmin.adoc[SCM Admin] * xref:selinux.adoc[SELinux Infrastructure] -* xref:koji-builder-setup.adoc[Setup Koji Builder] * xref:sigul-upgrade.adoc[Sigul servers upgrades/reboots] +* xref:sop_hardware_troubleshooting_power.adoc[Hardware Troubleshoot Power Issue SOP] * xref:sshaccess.adoc[SSH Access Infrastructure] * xref:sshknownhosts.adoc[SSH known hosts Infrastructure] * xref:ssl-certificates.adoc[SSL Certificates] * xref:staging.adoc[Staging] -* xref:hotness.adoc[The New Hotness] -* xref:2-factor.adoc[Two factor auth] +* xref:status-fedora.adoc[Fedora Status Service] +* xref:syslog.adoc[Log Infrastructure] +* xref:tickets.adoc[How to handle new tickets in fedora-infrastructure] * xref:unbound.adoc[Unbound Notes] -* xref:new-virtual-hosts.adoc[Virtual Host Addition] +* xref:virt-image.adoc[Kpartx Notes] +* xref:virt-notes.adoc[Libvirt Notes] * xref:voting.adoc[Voting Infrastructure] * xref:waiverdb.adoc[WaiverDB] -* xref:fedorawebsites.adoc[Websites Release] -* xref:fedmsg-websocket.adoc[WebSocket] * xref:wcidff.adoc[What Can I Do For Fedora] * xref:wiki.adoc[Wiki Infrastructure] * xref:zabbix.adoc[Zabbix Infrastructure] diff --git a/modules/sysadmin_guide/pages/sop_hardware_troubleshooting_power.adoc b/modules/sysadmin_guide/pages/sop_hardware_troubleshooting_power.adoc new file mode 100644 index 0000000..481476e --- /dev/null +++ b/modules/sysadmin_guide/pages/sop_hardware_troubleshooting_power.adoc @@ -0,0 +1,88 @@ +== Hardware Troubleshooting Power Issue + + +=== Overview +This SOP shows some of the steps required to troubleshoot and diagnose a power issue with one of our servers. A ticket was opened Infra Ticket: https://pagure.io/fedora-infrastructure/issue/11950 + +Symptoms: +- This server is not responding at all, and will not power on. +- To get to mgmt of RDU2-CC devices it’s a bit trickier than IAD2. We have a private management vlan there, but it’s only reachable via cloud-noc-os01.rdu-cc.fedoraproject.org. I usually use the ‘sshuttle’ package/command/app to transparently forward my traffic to devices on that network. That looks something like: `sshuttle 172.23.1.0/24 -r cloud-noc-os01.rdu-cc.fedoraproject.org` + + The devices are all in the 172.23.1 network. There’s a list of them in `ansible-private/docs/rdu-networks.txt` but this host is: `172.23.1.105`. + In the Bitwarden Vault, the management password can be obtained. +- Logs show issues with voltages not being in the correct range. +- At RDU2-CC we have a contact: `James Gibson`. + + +=== Contact Information + +Owner:: + Fedora Infrastructure Team +Contact:: + #fedora-admin, sysadmin-main +Purpose:: + Provide basic orientation and introduction to the sysadmin group + + +=== Requirements + +- sshuttle to access the network at RDU2-CC +- Bitwarden Vault Access - Access to the vault is under discussion. For now, consult the sysadmin-main team for the login credentials. +- Access to ansible-private repo. + + +=== Troubleshooting Steps + +.Connect to the management VLAN for the RDU2-CC network: +This is only required because this server is not in IAD2 datacenter. Use sshuttle to make a connection to the 172.23.1.0/24 (from your laptop directly, not from the batcave01 to the management network). `sshuttle 172.23.1.0/24 -r cloud-noc-os01.rdu-cc.fedoraproject.org` + +.SSH to the batcave01 and retrieve the ip address for this machine +Ssh to the batcave01, access the ansible-private repo and read the IP address for this machine from the `docs/rdu-networks.txt` + +.Open the Management Console +With the IP address, visit https://IP in browser to access the idrac management console. Like so: https://172.23.1.105/ + +.Retrieve the username and password from Bitwarden +This is a prod machine so use the username and password from Bitwarden to login. + +.Once Logged in, retrieve the service tag for this server +Get the service tag: XXXXXXX its on the summary page on the management console. This is required in order to prove to Dell tech support that the server is under warranty. + +.Open a tech support ticket with Dell +Open a ticket with tech support chat: https://www.dell.com/support/incidents-online/en-ie/ContactUs/Dynamic?spestate + +.Collect logs from the server for Dell +https://www.dell.com/support/kbdoc/en-us/000126308/export-a-supportassist-collection-via-idrac9 how to collect logs for tech support. + +.Dell requested firmware updates on the idrac and server, along with reseat of OCP card to be carried out. +Contacted James Gibson internally and opened a ticket in servicenow. Requested that he arrange a trip to the datacenter in order to reseat this OCP card. +Updated the firmware on the idrac itself successfully, but failed to update the firmware on the server obviously as it wont turn on. + +.OCP reseat carried out +James finally managed to get out to the rdu-2 data center and carry out this work. Reseating the OCP had no effect, however he did troubleshoot further and removed one PSU, and still rebooting cycle, reattached and removed the other, and the server is booting fine. So we think we have identified a faulty PSU. + +.Request to reupload logs +First request was to get the zip TSR logs generated and forwarded to Dell. +Use the following site to upload the TSR as it might be too big to attach to email https://tdm.dell.com/file-upload +This requires a service request, so be sure to ask the Dell technician for a service request number in order to use this form. + +.Swap PSU1 with PSU2 +Dell requested the following check be carried out: +Please Swap PSU1 with PSU2 and check if the server will power up. +if the issue persisit, test PSU2 on slot 1 and confirm +Once completed collect logs and share so we can proceed with action. + +.Both PSUs seem functional +James Gibson, swapped the PSU units in this server on Friday, and the server is powering on as normal. So appears both PSU units are in fact working, perhaps something wrong with the chassis the units are going into ? Informed Dell just waiting on update to see what to troubleshoot next. + +.Dell suggest use different power point to plug hardware into +Since both ports has been test, I'm thinking this could be an external issue or a configuration issue. +Are the PSUs set to redundant? +When plugged at the same time, are them being plug to the same outlet/UPS? +If so, can we test by plugging them to different outlets/UPS ? + +.This appears to have resolved our issue. +Forwarded information to James Gibson to see what he thinks. +We have moved the power to different power points, with the 2nd PSU reattached and the server appears to be working correctly now. +Closed the ticket with Dell. +