Review outage SOP
Signed-off-by: Michal Konečný <mkonecny@redhat.com>
This commit is contained in:
parent
3b483e7cd3
commit
eb2eab1f60
2 changed files with 45 additions and 93 deletions
|
@ -83,7 +83,7 @@
|
|||
** xref:openqa.adoc[OpenQA Infrastructure - SOP]
|
||||
** xref:openshift.adoc[OpenShift - SOP]
|
||||
** xref:openvpn.adoc[OpenVPN - SOP]
|
||||
** xref:outage.adoc[outage - SOP in review ]
|
||||
** xref:outage.adoc[Outage Infrastructure - SOP]
|
||||
** xref:packagedatabase.adoc[packagedatabase - SOP in review ]
|
||||
** xref:packagereview.adoc[packagereview - SOP in review ]
|
||||
** xref:pagure.adoc[pagure - SOP in review ]
|
||||
|
|
|
@ -4,64 +4,26 @@ What to do when there's an outage or when you're planning to take an
|
|||
outage.
|
||||
|
||||
== Contents
|
||||
____
|
||||
[arabic]
|
||||
. Contact Information
|
||||
. Users (No Access)
|
||||
____
|
||||
|
||||
____
|
||||
[arabic]
|
||||
. Planned Outage
|
||||
____
|
||||
|
||||
____
|
||||
[arabic]
|
||||
. Contacts
|
||||
____
|
||||
|
||||
____
|
||||
[arabic, start=2]
|
||||
. Unplanned Outage
|
||||
____
|
||||
|
||||
____
|
||||
[arabic]
|
||||
. Check first
|
||||
. Reporting or participating in an outage
|
||||
____
|
||||
|
||||
--
|
||||
____
|
||||
[arabic, start=5]
|
||||
. Infrastructure Members (Admin Access)
|
||||
____
|
||||
--
|
||||
____
|
||||
[arabic]
|
||||
. Planned Outage
|
||||
____
|
||||
____
|
||||
[arabic]
|
||||
. Planning
|
||||
. Preparations
|
||||
. Outage
|
||||
. Post outage cleanup
|
||||
____
|
||||
|
||||
____
|
||||
[arabic, start=2]
|
||||
. Unplanned Outage
|
||||
____
|
||||
____
|
||||
[arabic]
|
||||
. Determine Severity
|
||||
. First Steps
|
||||
. Fix it
|
||||
. Escalate
|
||||
. The Resolution
|
||||
. The Aftermath
|
||||
____
|
||||
* <<_contact_information>>
|
||||
* <<_users_no_access>>
|
||||
** <<_planned_outage>>
|
||||
*** <<_contacts>>
|
||||
** <<_unplanned_outage>>
|
||||
*** <<_check_first>>
|
||||
*** <<_reporting_or_participating_in_an_outage>>
|
||||
* <<_infrastructure_members_admin_access>>
|
||||
** <<_planned_outage>>
|
||||
*** <<_planning>>
|
||||
*** <<_preparations>>
|
||||
*** <<_outage>>
|
||||
*** <<_post_outage_cleanup>>
|
||||
** <<_unplanned_outage>>
|
||||
*** <<_determine_severity>>
|
||||
*** <<_first_steps>>
|
||||
*** <<_fix_it>>
|
||||
*** <<_escalate>>
|
||||
*** <<_the_resolution>>
|
||||
*** <<_the_aftermath>>
|
||||
|
||||
== Contact Information
|
||||
|
||||
|
@ -75,13 +37,10 @@ Servers::
|
|||
Any
|
||||
Purpose::
|
||||
This SOP is generic for any outage
|
||||
Emergency:::
|
||||
https://admin.fedoraproject.org/pager
|
||||
|
||||
== Users (No Access)
|
||||
|
||||
[NOTE]
|
||||
.Note
|
||||
====
|
||||
Don't have shell access? Doesn't matter. Stop by and stay in
|
||||
#fedora-admin if you have any expertise in what is going on, please
|
||||
|
@ -100,7 +59,7 @@ a koji outage, let someone know.
|
|||
==== Contacts
|
||||
|
||||
Pretty much all coordination occurs in #fedora-admin on
|
||||
irc.freenode.net. Stop by there to watch more about what's going on.
|
||||
https://libera.chat/[libera.chat]. Stop by there to watch more about what's going on.
|
||||
Just stay on topic.
|
||||
|
||||
=== Unplanned Outage
|
||||
|
@ -119,36 +78,34 @@ reported outage that may be causing and/or related to your issue.
|
|||
==== Reporting or participating in an outage
|
||||
|
||||
If you think you've found an outage, get as much information as you can
|
||||
about it at a glance. Copy any errors you get to http://pastebin.ca/.
|
||||
about it at a glance. Copy any errors you get to https://paste.centos.org/.
|
||||
Use the following guidelines:
|
||||
|
||||
Don't be general.::
|
||||
Don't be general::
|
||||
* BAD: "The wiki is acting slow"
|
||||
* Good: "Whenever I try to save
|
||||
https://fedoraproject.org/wiki/Infrastructure, I get a proxy error
|
||||
after 60 seconds"
|
||||
Don't report an outage that's already been reported.::
|
||||
|
||||
Don't report an outage that's already been reported::
|
||||
* BAD: "/join #fedora-admin; Is the build system broken?"
|
||||
* Good: "/join #fedora-admin; wait a minute or two; I noticed I can't
|
||||
submit builds, here's the error I get:"
|
||||
Don't suggest drastic or needless changes during an outage (send it to
|
||||
the list)::
|
||||
|
||||
Don't suggest drastic or needless changes during an outage (send it to the list)::
|
||||
* "Why don't you just use lighttpd?"
|
||||
* "You could try limiting MaxRequestsPerChild in Apache"
|
||||
Don't get off topic or be too chatty::
|
||||
* "Transformers was awesome, but yeah, I think you guys know what to
|
||||
do next"
|
||||
Do research the technologies we're using and answer questions that may
|
||||
come up.::
|
||||
* BAD: "Can't you just fix it?"
|
||||
* {blank}
|
||||
+
|
||||
Good: "Hey guys, I think this is what you're looking for:;;
|
||||
http://httpd.apache.org/docs/2.2/mod/mod_mime.html#addencoding"
|
||||
|
||||
If no one can be contacted after 10 minutes or so please see the section
|
||||
below called Determine Severity to determine whether or not someone
|
||||
should get paged.
|
||||
Do research the technologies we're using and answer questions that may come up::
|
||||
* BAD: "Can't you just fix it?"
|
||||
* Good: "Hey guys, I think this is what you're looking for:
|
||||
http://httpd.apache.org/docs/2.2/mod/mod_mime.html#addencoding"
|
||||
|
||||
Please try to contact OnCall first. This could be done by typing `.oncall`
|
||||
in #fedora-admin channel.
|
||||
|
||||
== Infrastructure Members (Admin Access)
|
||||
|
||||
|
@ -189,7 +146,7 @@ reporting. https://admin.fedoraproject.org/nagios/
|
|||
|
||||
Prior to beginning an outage to any monitored service on
|
||||
http://status.fedoraproject.org please push an update to reflect the
|
||||
outage (see status-fedora SOP).
|
||||
outage (see xref:status-fedora.adoc[status-fedora SOP]).
|
||||
|
||||
Report all information in #fedora-admin. Coordination is extremely
|
||||
important, it's rare for our group to meet in person and IRC is our only
|
||||
|
@ -210,10 +167,8 @@ Once the services are restored, an update to the status dashboard should
|
|||
be pushed to show the services are restored.
|
||||
|
||||
[IMPORTANT]
|
||||
.Important
|
||||
====
|
||||
Additionally update any SOP's that may have changed in the course of the
|
||||
outage
|
||||
Additionally update any SOP's that may have changed in the course of the outage
|
||||
====
|
||||
|
||||
=== Unplanned Outage
|
||||
|
@ -228,8 +183,7 @@ let the team know. Messes can always be cleaned up after the outage.
|
|||
|
||||
Some outages require immediate fixing, some don't. A page should never
|
||||
go out because someone can't sign the cla. Most of our admins are in US
|
||||
time, use your best judgment. If it's bad enough to warrant an emergency
|
||||
page, page one of the admins at: https://admin.fedoraproject.org/pager
|
||||
time, use your best judgment.
|
||||
|
||||
Use the following as loose guidelines, just use your best judgment.
|
||||
|
||||
|
@ -248,10 +202,10 @@ slashdot.
|
|||
|
||||
After an outage has been verified, acknowledge the outage in nagios:
|
||||
https://admin.fedoraproject.org/nagios/, update the related system on
|
||||
the status dashboard (see the status-fedora SOP) and verify changes at
|
||||
http://status.fedoraproject.org, then head in to #fedora-admin to figure
|
||||
out who is around and coordinate the next course of action. Consult any
|
||||
relevent SOP's for corrective actions.
|
||||
the status dashboard (see the xref:status-fedora.adoc[status-fedora SOP])
|
||||
and verify changes at http://status.fedoraproject.org, then head in to
|
||||
#fedora-admin to figure out who is around and coordinate the next course
|
||||
of action. Consult any relevent SOP's for corrective actions.
|
||||
|
||||
==== Fix it
|
||||
|
||||
|
@ -263,10 +217,9 @@ just don't be stupid about it.
|
|||
Can't fix it? Don't wait, Escalate! All of the team members have
|
||||
expertise with some areas of our environment and weaknesses in other
|
||||
areas. Never be afraid to tap another team member. Sometimes it's
|
||||
required, sometimes it's not. The last layer of defense is to page
|
||||
someone. At present our team is small enough that a full escalation path
|
||||
wouldn't do much good. Consult the contact information on each SOP for
|
||||
more information.
|
||||
required, sometimes it's not. At present our team is small enough that
|
||||
a full escalation path wouldn't do much good. Consult the contact
|
||||
information on each SOP for more information.
|
||||
|
||||
==== The Resolution
|
||||
|
||||
|
@ -286,7 +239,6 @@ fedora-infrastructure-list.
|
|||
. What was the root cause?
|
||||
|
||||
[IMPORTANT]
|
||||
.Important
|
||||
====
|
||||
Number 4 is especially important. If a kernel build keeps failing
|
||||
because of issues with koji caused by a database failure caused by a
|
||||
|
|
Loading…
Add table
Add a link
Reference in a new issue