infra-docs-fpo/modules/ROOT/pages/sle.adoc
Kevin Fenzi cc6d4b0750 Drop IRC and replace with matrix is all our docs.
Since we are moving to matrix, lets drop reference to irc.
I may have missed a few of these and I left the Zodbot SOP alone for now
until we replace it with the new matrix one.

Signed-off-by: Kevin Fenzi <kevin@scrye.com>
2023-12-20 11:15:46 -08:00

162 lines
7.3 KiB
Text

:experimental:
:toc:
= Service Level Expectations
The infrastructure team does not have any formal agreement or contract regarding
the availability of its different services. However, we do try our best to keep
services running, and as a result, you can have some expectations as to what
we will do to this extent.
== Primary Business Hours
Fedora Infrastructure is a community team, involving volunteers as well as
people employed by Red Hat to work on Fedora.
However, despite the help of volunteers, primary business hours are mostly
aligned with the work schedule of Red Hat. Normal hours should be seen as
Monday through Friday from 1000 UTC to 2300 UTC, excluding US/EU national holidays
and a 2 weeks end of year closure affecting staffing and response times.
Services outside of primary business hours are done on call and depend on
the availability of staff.
== Roles and Responsibilities
=== Fedora Infrastructure to Community
* To have staff present and available in appropriate communication channels to answer
questions during primary hours.
* Interact with community members with respect and courtesy.
* Work with community members to get accurate and thorough documentation of
incidents, problems, or feature requests.
* Resolve reported problems as soon as acknowledged if possible.
* Clearly communicate estimated resolution times.
* Move items which can not be resolved within a reasonable time to future
feature requests or close out.
=== Community Members to Fedora Infrastructure
* Provide full and detailed reports of the problem or requested service.
* Provide clear and complete contact information and times when available.
* Leave alternative contacts who can also be available in case of vacation
or other emergencies.
* When contacted by Fedora IT, respond back within 5 business days.
=== Fedora Infrastructure to Fedora Infrastructure
* Have a clear schedule of reachable hours.
* Set and take regular vacation time to be rested.
* Rotate through days on-call in matrix and tickets.
* If adding a new service, be available outside of normal business hours to
help debug problems.
* Follow procedures and checklists when adding or updating services.
* Help with regular audits of the documentation
== Definition of Service Priorities
The general design of service priorities is that of concentric circles, where
items rely on services in their own circle or a circle below them.
. *Critical* services are ones which Fedora Infrastructure will work to be available
24x7 with a 52 week coverage if an unplanned outage occurs.
Services will be configured to be highly available with an estimated
planned/unplanned uptime of 95%. Response time should be within 1 hour during business
hours. Outside business hours this will be addressed when the Fedora infra staff is
available.
. *Important* services are ones which Fedora Infrastructure will work to be
available 24x7 with a 50 week coverage. Response time should be within a day
during business days. Outside business days this will be addressed when the
Fedora infra staff is available.
. *Normal* services are ones which Fedora Infrastructure will work to be
available during primary work hours. Problems outside of these hours will
be looked at as people are available. The services may be available
outside of these but are of a lower priority than important services.
. *Low priority* services are ones which are not critical or important for
the primary function of Fedora Infrastructure. They will be worked on and
looked at during primary business hours.
. *Third Party* services are ones which Fedora Infrastructure has outsourced
tools and services to. Uptimes, service hours, and coverage are dictated
by the third party. Depending on the type of problem, Fedora Infrastructure
will act as an intermediary, or in the case of tools like retrace and COPR,
direct the user to talk with the service owners.
. *Deprecated* services are ones which Fedora Infrastructure are no longer
putting resources into. This may be because the project has completed its
mission, the upstream software is dead, or the original reasons for the
service no longer exists. Problems with these services will be looked at
during primary business hours. Responses may be mostly "Will Not Fix".
== Limitations on Support
* Some services that are associated with Fedora are provided by third
parties. Changes and outages which affect them are outside the control
of Fedora Infrastructure.
* Fedora Infrastructure will prioritize issues and requests that affect
multiple people or teams over a smaller group or individual.
* Fedora Infrastructure has limited budget and hours. Requests and features
will be prioritized to fit within those.
* Fedora Infrastructure is bound by the laws and regulations of the United
States of America. This means that certain requests, changes and problems
are outside the ability of members to deal with.
== Glossary
* **Planned outage**: A planned outage is one that is announced sufficiently
ahead of time to allow most users to plan around it.
* **Unplanned outage**: An outage that occurs suddenly without proper
allowance for users to plan around it.
* **Scheduled outage**: An outage which has been scheduled to occur, but may
not have been announced with enough time for users to plan around it.
* **High Availability**: Systems are available during specified operating
hours with any unplanned outages 'masked' by other tools.
* **Continuous Operations**: Systems are available 24 hours a day, 7 days
a week, with no scheduled outages. Unplanned outages are possible during
this time.
* **Continuous Availability**: Systems or applications are available 24x7
with no planned or unplanned outages. This is a combination of high
availability and continuous operations.
* **Level of availability**:
[options=header]
|===
|Percentage | Max outage time per day
| 90% | 144.0 minutes
| 95% | 72.0 minutes
| 99% | 14.4 minutes
| 99.9% | 1.4 minutes
|===
* **Committed Hours of Availability**: Hours that an organization will have
staff available to help deal with issues with systems, services, and
applications. Also known as "Regular Business Hours"
* **Outage Hours**: Total number of hours of outage considered normal for
calculating achieved availability.
* **Response Time**: The time between the users notification of the problem
and when the help desk will begin to work on that problem.
* **Resolution Update**: The frequency of updates to tickets
== Estimated Time of Resolution:
By priority Levels:
* **Emergency**: Problems which are site wide, and affect the core functions
of the project. These problems are priority and should be solved as soon as possible.
Estimated time of resolution is within hours.
* **Urgent**: Problems which affect multiple functions and groups in the
project. These problems will be solved when there is no emergency going on.
Estimated time of resolution is within a day.
* **Normal**: Problems which affect a single user from performing needed
duties. These problems will be looked at when staff is available.
Estimated time resolution is within a week.
* **Low**: A request for service, instruction, information that has no
immediate impact on services. Those problems are lowest priority.
Estimated time of resolution is within a month.