diff --git a/modules/sysadmin_guide/nav.adoc b/modules/sysadmin_guide/nav.adoc index c719cf7..9d548b7 100644 --- a/modules/sysadmin_guide/nav.adoc +++ b/modules/sysadmin_guide/nav.adoc @@ -83,7 +83,7 @@ ** xref:openqa.adoc[OpenQA Infrastructure - SOP] ** xref:openshift.adoc[OpenShift - SOP] ** xref:openvpn.adoc[OpenVPN - SOP] -** xref:outage.adoc[outage - SOP in review ] +** xref:outage.adoc[Outage Infrastructure - SOP] ** xref:packagedatabase.adoc[packagedatabase - SOP in review ] ** xref:packagereview.adoc[packagereview - SOP in review ] ** xref:pagure.adoc[pagure - SOP in review ] diff --git a/modules/sysadmin_guide/pages/outage.adoc b/modules/sysadmin_guide/pages/outage.adoc index b2719cd..9d0e0b2 100644 --- a/modules/sysadmin_guide/pages/outage.adoc +++ b/modules/sysadmin_guide/pages/outage.adoc @@ -4,64 +4,26 @@ What to do when there's an outage or when you're planning to take an outage. == Contents -____ -[arabic] -. Contact Information -. Users (No Access) -____ - -____ -[arabic] -. Planned Outage -____ - -____ -[arabic] -. Contacts -____ - -____ -[arabic, start=2] -. Unplanned Outage -____ - -____ -[arabic] -. Check first -. Reporting or participating in an outage -____ - --- -____ -[arabic, start=5] -. Infrastructure Members (Admin Access) -____ --- -____ -[arabic] -. Planned Outage -____ -____ -[arabic] -. Planning -. Preparations -. Outage -. Post outage cleanup -____ - -____ -[arabic, start=2] -. Unplanned Outage -____ -____ -[arabic] -. Determine Severity -. First Steps -. Fix it -. Escalate -. The Resolution -. The Aftermath -____ +* <<_contact_information>> +* <<_users_no_access>> +** <<_planned_outage>> +*** <<_contacts>> +** <<_unplanned_outage>> +*** <<_check_first>> +*** <<_reporting_or_participating_in_an_outage>> +* <<_infrastructure_members_admin_access>> +** <<_planned_outage>> +*** <<_planning>> +*** <<_preparations>> +*** <<_outage>> +*** <<_post_outage_cleanup>> +** <<_unplanned_outage>> +*** <<_determine_severity>> +*** <<_first_steps>> +*** <<_fix_it>> +*** <<_escalate>> +*** <<_the_resolution>> +*** <<_the_aftermath>> == Contact Information @@ -75,13 +37,10 @@ Servers:: Any Purpose:: This SOP is generic for any outage -Emergency::: - https://admin.fedoraproject.org/pager == Users (No Access) [NOTE] -.Note ==== Don't have shell access? Doesn't matter. Stop by and stay in #fedora-admin if you have any expertise in what is going on, please @@ -100,7 +59,7 @@ a koji outage, let someone know. ==== Contacts Pretty much all coordination occurs in #fedora-admin on -irc.freenode.net. Stop by there to watch more about what's going on. +https://libera.chat/[libera.chat]. Stop by there to watch more about what's going on. Just stay on topic. === Unplanned Outage @@ -119,36 +78,34 @@ reported outage that may be causing and/or related to your issue. ==== Reporting or participating in an outage If you think you've found an outage, get as much information as you can -about it at a glance. Copy any errors you get to http://pastebin.ca/. +about it at a glance. Copy any errors you get to https://paste.centos.org/. Use the following guidelines: -Don't be general.:: +Don't be general:: * BAD: "The wiki is acting slow" * Good: "Whenever I try to save https://fedoraproject.org/wiki/Infrastructure, I get a proxy error after 60 seconds" -Don't report an outage that's already been reported.:: + +Don't report an outage that's already been reported:: * BAD: "/join #fedora-admin; Is the build system broken?" * Good: "/join #fedora-admin; wait a minute or two; I noticed I can't submit builds, here's the error I get:" -Don't suggest drastic or needless changes during an outage (send it to -the list):: + +Don't suggest drastic or needless changes during an outage (send it to the list):: * "Why don't you just use lighttpd?" * "You could try limiting MaxRequestsPerChild in Apache" Don't get off topic or be too chatty:: * "Transformers was awesome, but yeah, I think you guys know what to do next" -Do research the technologies we're using and answer questions that may -come up.:: - * BAD: "Can't you just fix it?" - * {blank} - + - Good: "Hey guys, I think this is what you're looking for:;; - http://httpd.apache.org/docs/2.2/mod/mod_mime.html#addencoding" -If no one can be contacted after 10 minutes or so please see the section -below called Determine Severity to determine whether or not someone -should get paged. +Do research the technologies we're using and answer questions that may come up:: + * BAD: "Can't you just fix it?" + * Good: "Hey guys, I think this is what you're looking for: + http://httpd.apache.org/docs/2.2/mod/mod_mime.html#addencoding" + +Please try to contact OnCall first. This could be done by typing `.oncall` +in #fedora-admin channel. == Infrastructure Members (Admin Access) @@ -189,7 +146,7 @@ reporting. https://admin.fedoraproject.org/nagios/ Prior to beginning an outage to any monitored service on http://status.fedoraproject.org please push an update to reflect the -outage (see status-fedora SOP). +outage (see xref:status-fedora.adoc[status-fedora SOP]). Report all information in #fedora-admin. Coordination is extremely important, it's rare for our group to meet in person and IRC is our only @@ -210,10 +167,8 @@ Once the services are restored, an update to the status dashboard should be pushed to show the services are restored. [IMPORTANT] -.Important ==== -Additionally update any SOP's that may have changed in the course of the -outage +Additionally update any SOP's that may have changed in the course of the outage ==== === Unplanned Outage @@ -228,8 +183,7 @@ let the team know. Messes can always be cleaned up after the outage. Some outages require immediate fixing, some don't. A page should never go out because someone can't sign the cla. Most of our admins are in US -time, use your best judgment. If it's bad enough to warrant an emergency -page, page one of the admins at: https://admin.fedoraproject.org/pager +time, use your best judgment. Use the following as loose guidelines, just use your best judgment. @@ -248,10 +202,10 @@ slashdot. After an outage has been verified, acknowledge the outage in nagios: https://admin.fedoraproject.org/nagios/, update the related system on -the status dashboard (see the status-fedora SOP) and verify changes at -http://status.fedoraproject.org, then head in to #fedora-admin to figure -out who is around and coordinate the next course of action. Consult any -relevent SOP's for corrective actions. +the status dashboard (see the xref:status-fedora.adoc[status-fedora SOP]) +and verify changes at http://status.fedoraproject.org, then head in to +#fedora-admin to figure out who is around and coordinate the next course +of action. Consult any relevent SOP's for corrective actions. ==== Fix it @@ -263,10 +217,9 @@ just don't be stupid about it. Can't fix it? Don't wait, Escalate! All of the team members have expertise with some areas of our environment and weaknesses in other areas. Never be afraid to tap another team member. Sometimes it's -required, sometimes it's not. The last layer of defense is to page -someone. At present our team is small enough that a full escalation path -wouldn't do much good. Consult the contact information on each SOP for -more information. +required, sometimes it's not. At present our team is small enough that +a full escalation path wouldn't do much good. Consult the contact +information on each SOP for more information. ==== The Resolution @@ -286,7 +239,6 @@ fedora-infrastructure-list. . What was the root cause? [IMPORTANT] -.Important ==== Number 4 is especially important. If a kernel build keeps failing because of issues with koji caused by a database failure caused by a