arc/docs/monitoring_metrics/faq.rst

Frequently Asked Questions
==========================

Here are a list of questions and answers that should help you get start with
monitoring with zabbix and prometheus.

How do I access zabbix?
-----------------------

How do I access zabbix when I'm a community member?
---------------------------------------------------

How do I access Prometheus?
---------------------------
Prometheus is running in the application monitoring namespace, standart routing applies,
i.e.: https://prometheus-route-application-monitoring.app.os.stg.fedoraproject.org/graph

To access it you need to have account in the openshift it is running in.

How do I access Prometheus when I'm a community member?
-------------------------------------------------------
You shouldn't access prometheus directly, unless you are maintaining an application in openshift.

Data from prometheus can be exported and viewed in Grafana or Zabbix, meaning we can
give access to a more limited public view through dashboards in one of these.

Do you have a 5 minutes guide on how to use prometheus?
-------------------------------------------------------

In other words, do you have some how-tos/links I should read to understand/get
started with prometheus?
* quick introduction to the stack we are running: https://www.youtube.com/watch?v=-37OPXXhrTw
* to get idea on how to use it, look at sample queries: https://prometheus.io/docs/prometheus/latest/querying/examples/
* for instrumenation, look at the libraries in https://github.com/prometheus/

How do I get basic HW (disk, cpu, memory, network...) monitoring for a host?
----------------------------------------------------------------------------

How do I monitor a list of services?
------------------------------------
  - pagure.io and src.fp.o have two different list of services to monitor
    they partly overlap but aren't exactly the same, how can I monitor them?

  - For prometheus, metrics exported are usually done by instrumentation,
  meaning if i.e. pagure was instrumented to export /metrics endpoint,
  you just need to make sure you are collecting them, either because they run in openshift,
  and you configured appropriate ServiceMonitor or PodMonitor objects,
  or if outside of openshift, it is in additional scrape configuration of prometheus.
  Because collected metrics are labeled, it is simple to distinguish which belong where.

How do I get alerted for a service not running?
-----------------------------------------------

- Prometheus supports configuring rules for alert-manager that can then notify through various services.
  You can learn about the configuration here: https://prometheus.io/docs/alerting/latest/configuration/#configuration-file
  The rules specifying when to alert are done in prometheus itself : https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/
  You can specify them in CRDs in your project in simmilar fashion as with Service Monitor
  To use IRC, there needs to be a separate gateway installed in a sidecar: https://github.com/google/alertmanager-irc-relay

How can I tune the alerts?
--------------------------

As in, who gets alerted? When? How?

How do I ask for the service to be restarted <X> times before being alerted?
----------------------------------------------------------------------------

- In prometheus you can't. It is assumed you are using kubernetes that would manage something like this for you.

How do I monitor rabbitmq queues?
---------------------------------

- In prometheus, according to https://www.rabbitmq.com/prometheus.html#overview-prometheus
  you just need to make sure you are collecting the exported metrics.

How do we alert about checks not passing to people outside of our teams?
------------------------------------------------------------------------
  -> the OSCI team is interesting in having notifications/monitoring for the CI
     queues in rabbitmq

How can we chain a prometheus instance to ours? 
-----------------------------------------------
This allows to consolidate in a single instance monitoring coming from different
instances. This can be done with configuring federation in additional scrape configs: https://prometheus.io/docs/prometheus/latest/federation/

Can we monitor rabbitmq queues in prometheus?
---------------------------------------------

How can I monitor the performances of my application?
-----------------------------------------------------

Number of requests served? Number of 500 errors? Number of DB connections?

With prometheus, you need to instrument your application and configure prometheus t collect its metrics.

How do I ack an alert so it stops alerting?
-------------------------------------------

With prometheus and Alertmanager, there is no way to just ACK an alert,
it is assumed that something more high-level like opsgenie would take care of actually
interacting with regular human ops people.

For small enough teams, just using silence on alert in alertmanager could be enough.

There is a sidecar that serves to provide a little bit more features to the barebones alerting.
like https://github.com/prymitive/kthxbye.

How do I pre-emptively stop a check before I start working on an outage?
------------------------------------------------------------------------

In other words: I know that I'll cause an outage while working on <service>, how
do I turn off the checks for this service to avoid notifying admins while I'm
working on it?

In Prometheus and Alertmanager there are Silences, where you can set a time when certain alerts wouldn't
be firing. You are able to create and remove these through rest api,
Add a FAQ page for the monitoring search Signed-off-by: Pierre-Yves Chibon <pingou@pingoured.fr> 2021-03-25 10:41:06 +01:00			`Frequently Asked Questions`
			`==========================`

			`Here are a list of questions and answers that should help you get start with`
			`monitoring with zabbix and prometheus.`

			`How do I access zabbix?`
			`-----------------------`

			`How do I access zabbix when I'm a community member?`
			`---------------------------------------------------`

			`How do I access Prometheus?`
			`---------------------------`
Added basic prometheus info to FAQ. 2021-03-31 17:15:48 +02:00			`Prometheus is running in the application monitoring namespace, standart routing applies,`
			`i.e.: https://prometheus-route-application-monitoring.app.os.stg.fedoraproject.org/graph`

			`To access it you need to have account in the openshift it is running in.`
Add a FAQ page for the monitoring search Signed-off-by: Pierre-Yves Chibon <pingou@pingoured.fr> 2021-03-25 10:41:06 +01:00
			`How do I access Prometheus when I'm a community member?`
			`-------------------------------------------------------`
Added basic prometheus info to FAQ. 2021-03-31 17:15:48 +02:00			`You shouldn't access prometheus directly, unless you are maintaining an application in openshift.`

			`Data from prometheus can be exported and viewed in Grafana or Zabbix, meaning we can`
			`give access to a more limited public view through dashboards in one of these.`
Add a FAQ page for the monitoring search Signed-off-by: Pierre-Yves Chibon <pingou@pingoured.fr> 2021-03-25 10:41:06 +01:00
			`Do you have a 5 minutes guide on how to use prometheus?`
			`-------------------------------------------------------`

			`In other words, do you have some how-tos/links I should read to understand/get`
			`started with prometheus?`
Added basic prometheus info to FAQ. 2021-03-31 17:15:48 +02:00			`* quick introduction to the stack we are running: https://www.youtube.com/watch?v=-37OPXXhrTw`
			`* to get idea on how to use it, look at sample queries: https://prometheus.io/docs/prometheus/latest/querying/examples/`
			`* for instrumenation, look at the libraries in https://github.com/prometheus/`
Add a FAQ page for the monitoring search Signed-off-by: Pierre-Yves Chibon <pingou@pingoured.fr> 2021-03-25 10:41:06 +01:00
			`How do I get basic HW (disk, cpu, memory, network...) monitoring for a host?`
			`----------------------------------------------------------------------------`

			`How do I monitor a list of services?`
			`------------------------------------`
			`- pagure.io and src.fp.o have two different list of services to monitor`
			`they partly overlap but aren't exactly the same, how can I monitor them?`

Added basic prometheus info to FAQ. 2021-03-31 17:15:48 +02:00			`- For prometheus, metrics exported are usually done by instrumentation,`
			`meaning if i.e. pagure was instrumented to export /metrics endpoint,`
			`you just need to make sure you are collecting them, either because they run in openshift,`
			`and you configured appropriate ServiceMonitor or PodMonitor objects,`
			`or if outside of openshift, it is in additional scrape configuration of prometheus.`
			`Because collected metrics are labeled, it is simple to distinguish which belong where.`
Add a FAQ page for the monitoring search Signed-off-by: Pierre-Yves Chibon <pingou@pingoured.fr> 2021-03-25 10:41:06 +01:00
			`How do I get alerted for a service not running?`
			`-----------------------------------------------`

Added basic prometheus info to FAQ. 2021-03-31 17:15:48 +02:00			`- Prometheus supports configuring rules for alert-manager that can then notify through various services.`
			`You can learn about the configuration here: https://prometheus.io/docs/alerting/latest/configuration/#configuration-file`
			`The rules specifying when to alert are done in prometheus itself : https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/`
			`You can specify them in CRDs in your project in simmilar fashion as with Service Monitor`
			`To use IRC, there needs to be a separate gateway installed in a sidecar: https://github.com/google/alertmanager-irc-relay`

Add a FAQ page for the monitoring search Signed-off-by: Pierre-Yves Chibon <pingou@pingoured.fr> 2021-03-25 10:41:06 +01:00			`How can I tune the alerts?`
			`--------------------------`

			`As in, who gets alerted? When? How?`

			`How do I ask for the service to be restarted <X> times before being alerted?`
			`----------------------------------------------------------------------------`

Added basic prometheus info to FAQ. 2021-03-31 17:15:48 +02:00			`- In prometheus you can't. It is assumed you are using kubernetes that would manage something like this for you.`
Add a FAQ page for the monitoring search Signed-off-by: Pierre-Yves Chibon <pingou@pingoured.fr> 2021-03-25 10:41:06 +01:00
			`How do I monitor rabbitmq queues?`
			`---------------------------------`

Added basic prometheus info to FAQ. 2021-03-31 17:15:48 +02:00			`- In prometheus, according to https://www.rabbitmq.com/prometheus.html#overview-prometheus`
			`you just need to make sure you are collecting the exported metrics.`

Add a FAQ page for the monitoring search Signed-off-by: Pierre-Yves Chibon <pingou@pingoured.fr> 2021-03-25 10:41:06 +01:00			`How do we alert about checks not passing to people outside of our teams?`
			`------------------------------------------------------------------------`
			`-> the OSCI team is interesting in having notifications/monitoring for the CI`
			`queues in rabbitmq`

			`How can we chain a prometheus instance to ours?`
			`-----------------------------------------------`
			`This allows to consolidate in a single instance monitoring coming from different`
Added basic prometheus info to FAQ. 2021-03-31 17:15:48 +02:00			`instances. This can be done with configuring federation in additional scrape configs: https://prometheus.io/docs/prometheus/latest/federation/`
Add a FAQ page for the monitoring search Signed-off-by: Pierre-Yves Chibon <pingou@pingoured.fr> 2021-03-25 10:41:06 +01:00
			`Can we monitor rabbitmq queues in prometheus?`
			`---------------------------------------------`

			`How can I monitor the performances of my application?`
			`-----------------------------------------------------`

			`Number of requests served? Number of 500 errors? Number of DB connections?`

Added basic prometheus info to FAQ. 2021-03-31 17:15:48 +02:00			`With prometheus, you need to instrument your application and configure prometheus t collect its metrics.`
Add a FAQ page for the monitoring search Signed-off-by: Pierre-Yves Chibon <pingou@pingoured.fr> 2021-03-25 10:41:06 +01:00
Add another couple of questions to the FAQ for Monitoring & Metrics Signed-off-by: Pierre-Yves Chibon <pingou@pingoured.fr> 2021-03-25 16:22:23 +01:00			`How do I ack an alert so it stops alerting?`
			`-------------------------------------------`

Added basic prometheus info to FAQ. 2021-03-31 17:15:48 +02:00			`With prometheus and Alertmanager, there is no way to just ACK an alert,`
			`it is assumed that something more high-level like opsgenie would take care of actually`
			`interacting with regular human ops people.`

			`For small enough teams, just using silence on alert in alertmanager could be enough.`

			`There is a sidecar that serves to provide a little bit more features to the barebones alerting.`
			`like https://github.com/prymitive/kthxbye.`

Add another couple of questions to the FAQ for Monitoring & Metrics Signed-off-by: Pierre-Yves Chibon <pingou@pingoured.fr> 2021-03-25 16:22:23 +01:00			`How do I pre-emptively stop a check before I start working on an outage?`
			`------------------------------------------------------------------------`

			`In other words: I know that I'll cause an outage while working on <service>, how`
			`do I turn off the checks for this service to avoid notifying admins while I'm`
			`working on it?`
Added basic prometheus info to FAQ. 2021-03-31 17:15:48 +02:00
			`In Prometheus and Alertmanager there are Silences, where you can set a time when certain alerts wouldn't`
			`be firing. You are able to create and remove these through rest api,`