Review outage SOP
Signed-off-by: Michal Konečný <mkonecny@redhat.com>
This commit is contained in:
parent
3b483e7cd3
commit
eb2eab1f60
2 changed files with 45 additions and 93 deletions
|
@ -83,7 +83,7 @@
|
||||||
** xref:openqa.adoc[OpenQA Infrastructure - SOP]
|
** xref:openqa.adoc[OpenQA Infrastructure - SOP]
|
||||||
** xref:openshift.adoc[OpenShift - SOP]
|
** xref:openshift.adoc[OpenShift - SOP]
|
||||||
** xref:openvpn.adoc[OpenVPN - SOP]
|
** xref:openvpn.adoc[OpenVPN - SOP]
|
||||||
** xref:outage.adoc[outage - SOP in review ]
|
** xref:outage.adoc[Outage Infrastructure - SOP]
|
||||||
** xref:packagedatabase.adoc[packagedatabase - SOP in review ]
|
** xref:packagedatabase.adoc[packagedatabase - SOP in review ]
|
||||||
** xref:packagereview.adoc[packagereview - SOP in review ]
|
** xref:packagereview.adoc[packagereview - SOP in review ]
|
||||||
** xref:pagure.adoc[pagure - SOP in review ]
|
** xref:pagure.adoc[pagure - SOP in review ]
|
||||||
|
|
|
@ -4,64 +4,26 @@ What to do when there's an outage or when you're planning to take an
|
||||||
outage.
|
outage.
|
||||||
|
|
||||||
== Contents
|
== Contents
|
||||||
____
|
* <<_contact_information>>
|
||||||
[arabic]
|
* <<_users_no_access>>
|
||||||
. Contact Information
|
** <<_planned_outage>>
|
||||||
. Users (No Access)
|
*** <<_contacts>>
|
||||||
____
|
** <<_unplanned_outage>>
|
||||||
|
*** <<_check_first>>
|
||||||
____
|
*** <<_reporting_or_participating_in_an_outage>>
|
||||||
[arabic]
|
* <<_infrastructure_members_admin_access>>
|
||||||
. Planned Outage
|
** <<_planned_outage>>
|
||||||
____
|
*** <<_planning>>
|
||||||
|
*** <<_preparations>>
|
||||||
____
|
*** <<_outage>>
|
||||||
[arabic]
|
*** <<_post_outage_cleanup>>
|
||||||
. Contacts
|
** <<_unplanned_outage>>
|
||||||
____
|
*** <<_determine_severity>>
|
||||||
|
*** <<_first_steps>>
|
||||||
____
|
*** <<_fix_it>>
|
||||||
[arabic, start=2]
|
*** <<_escalate>>
|
||||||
. Unplanned Outage
|
*** <<_the_resolution>>
|
||||||
____
|
*** <<_the_aftermath>>
|
||||||
|
|
||||||
____
|
|
||||||
[arabic]
|
|
||||||
. Check first
|
|
||||||
. Reporting or participating in an outage
|
|
||||||
____
|
|
||||||
|
|
||||||
--
|
|
||||||
____
|
|
||||||
[arabic, start=5]
|
|
||||||
. Infrastructure Members (Admin Access)
|
|
||||||
____
|
|
||||||
--
|
|
||||||
____
|
|
||||||
[arabic]
|
|
||||||
. Planned Outage
|
|
||||||
____
|
|
||||||
____
|
|
||||||
[arabic]
|
|
||||||
. Planning
|
|
||||||
. Preparations
|
|
||||||
. Outage
|
|
||||||
. Post outage cleanup
|
|
||||||
____
|
|
||||||
|
|
||||||
____
|
|
||||||
[arabic, start=2]
|
|
||||||
. Unplanned Outage
|
|
||||||
____
|
|
||||||
____
|
|
||||||
[arabic]
|
|
||||||
. Determine Severity
|
|
||||||
. First Steps
|
|
||||||
. Fix it
|
|
||||||
. Escalate
|
|
||||||
. The Resolution
|
|
||||||
. The Aftermath
|
|
||||||
____
|
|
||||||
|
|
||||||
== Contact Information
|
== Contact Information
|
||||||
|
|
||||||
|
@ -75,13 +37,10 @@ Servers::
|
||||||
Any
|
Any
|
||||||
Purpose::
|
Purpose::
|
||||||
This SOP is generic for any outage
|
This SOP is generic for any outage
|
||||||
Emergency:::
|
|
||||||
https://admin.fedoraproject.org/pager
|
|
||||||
|
|
||||||
== Users (No Access)
|
== Users (No Access)
|
||||||
|
|
||||||
[NOTE]
|
[NOTE]
|
||||||
.Note
|
|
||||||
====
|
====
|
||||||
Don't have shell access? Doesn't matter. Stop by and stay in
|
Don't have shell access? Doesn't matter. Stop by and stay in
|
||||||
#fedora-admin if you have any expertise in what is going on, please
|
#fedora-admin if you have any expertise in what is going on, please
|
||||||
|
@ -100,7 +59,7 @@ a koji outage, let someone know.
|
||||||
==== Contacts
|
==== Contacts
|
||||||
|
|
||||||
Pretty much all coordination occurs in #fedora-admin on
|
Pretty much all coordination occurs in #fedora-admin on
|
||||||
irc.freenode.net. Stop by there to watch more about what's going on.
|
https://libera.chat/[libera.chat]. Stop by there to watch more about what's going on.
|
||||||
Just stay on topic.
|
Just stay on topic.
|
||||||
|
|
||||||
=== Unplanned Outage
|
=== Unplanned Outage
|
||||||
|
@ -119,36 +78,34 @@ reported outage that may be causing and/or related to your issue.
|
||||||
==== Reporting or participating in an outage
|
==== Reporting or participating in an outage
|
||||||
|
|
||||||
If you think you've found an outage, get as much information as you can
|
If you think you've found an outage, get as much information as you can
|
||||||
about it at a glance. Copy any errors you get to http://pastebin.ca/.
|
about it at a glance. Copy any errors you get to https://paste.centos.org/.
|
||||||
Use the following guidelines:
|
Use the following guidelines:
|
||||||
|
|
||||||
Don't be general.::
|
Don't be general::
|
||||||
* BAD: "The wiki is acting slow"
|
* BAD: "The wiki is acting slow"
|
||||||
* Good: "Whenever I try to save
|
* Good: "Whenever I try to save
|
||||||
https://fedoraproject.org/wiki/Infrastructure, I get a proxy error
|
https://fedoraproject.org/wiki/Infrastructure, I get a proxy error
|
||||||
after 60 seconds"
|
after 60 seconds"
|
||||||
Don't report an outage that's already been reported.::
|
|
||||||
|
Don't report an outage that's already been reported::
|
||||||
* BAD: "/join #fedora-admin; Is the build system broken?"
|
* BAD: "/join #fedora-admin; Is the build system broken?"
|
||||||
* Good: "/join #fedora-admin; wait a minute or two; I noticed I can't
|
* Good: "/join #fedora-admin; wait a minute or two; I noticed I can't
|
||||||
submit builds, here's the error I get:"
|
submit builds, here's the error I get:"
|
||||||
Don't suggest drastic or needless changes during an outage (send it to
|
|
||||||
the list)::
|
Don't suggest drastic or needless changes during an outage (send it to the list)::
|
||||||
* "Why don't you just use lighttpd?"
|
* "Why don't you just use lighttpd?"
|
||||||
* "You could try limiting MaxRequestsPerChild in Apache"
|
* "You could try limiting MaxRequestsPerChild in Apache"
|
||||||
Don't get off topic or be too chatty::
|
Don't get off topic or be too chatty::
|
||||||
* "Transformers was awesome, but yeah, I think you guys know what to
|
* "Transformers was awesome, but yeah, I think you guys know what to
|
||||||
do next"
|
do next"
|
||||||
Do research the technologies we're using and answer questions that may
|
|
||||||
come up.::
|
|
||||||
* BAD: "Can't you just fix it?"
|
|
||||||
* {blank}
|
|
||||||
+
|
|
||||||
Good: "Hey guys, I think this is what you're looking for:;;
|
|
||||||
http://httpd.apache.org/docs/2.2/mod/mod_mime.html#addencoding"
|
|
||||||
|
|
||||||
If no one can be contacted after 10 minutes or so please see the section
|
Do research the technologies we're using and answer questions that may come up::
|
||||||
below called Determine Severity to determine whether or not someone
|
* BAD: "Can't you just fix it?"
|
||||||
should get paged.
|
* Good: "Hey guys, I think this is what you're looking for:
|
||||||
|
http://httpd.apache.org/docs/2.2/mod/mod_mime.html#addencoding"
|
||||||
|
|
||||||
|
Please try to contact OnCall first. This could be done by typing `.oncall`
|
||||||
|
in #fedora-admin channel.
|
||||||
|
|
||||||
== Infrastructure Members (Admin Access)
|
== Infrastructure Members (Admin Access)
|
||||||
|
|
||||||
|
@ -189,7 +146,7 @@ reporting. https://admin.fedoraproject.org/nagios/
|
||||||
|
|
||||||
Prior to beginning an outage to any monitored service on
|
Prior to beginning an outage to any monitored service on
|
||||||
http://status.fedoraproject.org please push an update to reflect the
|
http://status.fedoraproject.org please push an update to reflect the
|
||||||
outage (see status-fedora SOP).
|
outage (see xref:status-fedora.adoc[status-fedora SOP]).
|
||||||
|
|
||||||
Report all information in #fedora-admin. Coordination is extremely
|
Report all information in #fedora-admin. Coordination is extremely
|
||||||
important, it's rare for our group to meet in person and IRC is our only
|
important, it's rare for our group to meet in person and IRC is our only
|
||||||
|
@ -210,10 +167,8 @@ Once the services are restored, an update to the status dashboard should
|
||||||
be pushed to show the services are restored.
|
be pushed to show the services are restored.
|
||||||
|
|
||||||
[IMPORTANT]
|
[IMPORTANT]
|
||||||
.Important
|
|
||||||
====
|
====
|
||||||
Additionally update any SOP's that may have changed in the course of the
|
Additionally update any SOP's that may have changed in the course of the outage
|
||||||
outage
|
|
||||||
====
|
====
|
||||||
|
|
||||||
=== Unplanned Outage
|
=== Unplanned Outage
|
||||||
|
@ -228,8 +183,7 @@ let the team know. Messes can always be cleaned up after the outage.
|
||||||
|
|
||||||
Some outages require immediate fixing, some don't. A page should never
|
Some outages require immediate fixing, some don't. A page should never
|
||||||
go out because someone can't sign the cla. Most of our admins are in US
|
go out because someone can't sign the cla. Most of our admins are in US
|
||||||
time, use your best judgment. If it's bad enough to warrant an emergency
|
time, use your best judgment.
|
||||||
page, page one of the admins at: https://admin.fedoraproject.org/pager
|
|
||||||
|
|
||||||
Use the following as loose guidelines, just use your best judgment.
|
Use the following as loose guidelines, just use your best judgment.
|
||||||
|
|
||||||
|
@ -248,10 +202,10 @@ slashdot.
|
||||||
|
|
||||||
After an outage has been verified, acknowledge the outage in nagios:
|
After an outage has been verified, acknowledge the outage in nagios:
|
||||||
https://admin.fedoraproject.org/nagios/, update the related system on
|
https://admin.fedoraproject.org/nagios/, update the related system on
|
||||||
the status dashboard (see the status-fedora SOP) and verify changes at
|
the status dashboard (see the xref:status-fedora.adoc[status-fedora SOP])
|
||||||
http://status.fedoraproject.org, then head in to #fedora-admin to figure
|
and verify changes at http://status.fedoraproject.org, then head in to
|
||||||
out who is around and coordinate the next course of action. Consult any
|
#fedora-admin to figure out who is around and coordinate the next course
|
||||||
relevent SOP's for corrective actions.
|
of action. Consult any relevent SOP's for corrective actions.
|
||||||
|
|
||||||
==== Fix it
|
==== Fix it
|
||||||
|
|
||||||
|
@ -263,10 +217,9 @@ just don't be stupid about it.
|
||||||
Can't fix it? Don't wait, Escalate! All of the team members have
|
Can't fix it? Don't wait, Escalate! All of the team members have
|
||||||
expertise with some areas of our environment and weaknesses in other
|
expertise with some areas of our environment and weaknesses in other
|
||||||
areas. Never be afraid to tap another team member. Sometimes it's
|
areas. Never be afraid to tap another team member. Sometimes it's
|
||||||
required, sometimes it's not. The last layer of defense is to page
|
required, sometimes it's not. At present our team is small enough that
|
||||||
someone. At present our team is small enough that a full escalation path
|
a full escalation path wouldn't do much good. Consult the contact
|
||||||
wouldn't do much good. Consult the contact information on each SOP for
|
information on each SOP for more information.
|
||||||
more information.
|
|
||||||
|
|
||||||
==== The Resolution
|
==== The Resolution
|
||||||
|
|
||||||
|
@ -286,7 +239,6 @@ fedora-infrastructure-list.
|
||||||
. What was the root cause?
|
. What was the root cause?
|
||||||
|
|
||||||
[IMPORTANT]
|
[IMPORTANT]
|
||||||
.Important
|
|
||||||
====
|
====
|
||||||
Number 4 is especially important. If a kernel build keeps failing
|
Number 4 is especially important. If a kernel build keeps failing
|
||||||
because of issues with koji caused by a database failure caused by a
|
because of issues with koji caused by a database failure caused by a
|
||||||
|
|
Loading…
Add table
Add a link
Reference in a new issue