Added basic prometheus info to FAQ.

2021-03-31 17:15:48 +02:00 · 2021-03-31 17:15:48 +02:00 · f96af0744a
commit f96af0744a
parent 4c04cb3539
1 changed files with 41 additions and 1 deletions
--- a/docs/monitoring_metrics/faq.rst
+++ b/docs/monitoring_metrics/faq.rst
@ -12,15 +12,26 @@ How do I access zabbix when I'm a community member?

 How do I access Prometheus?
 ---------------------------
+Prometheus is running in the application monitoring namespace, standart routing applies,
+i.e.: https://prometheus-route-application-monitoring.app.os.stg.fedoraproject.org/graph
+
+To access it you need to have account in the openshift it is running in.

 How do I access Prometheus when I'm a community member?
 -------------------------------------------------------
+You shouldn't access prometheus directly, unless you are maintaining an application in openshift.
+
+Data from prometheus can be exported and viewed in Grafana or Zabbix, meaning we can
+give access to a more limited public view through dashboards in one of these.

 Do you have a 5 minutes guide on how to use prometheus?
 -------------------------------------------------------

 In other words, do you have some how-tos/links I should read to understand/get
 started with prometheus?
+* quick introduction to the stack we are running: https://www.youtube.com/watch?v=-37OPXXhrTw
+* to get idea on how to use it, look at sample queries: https://prometheus.io/docs/prometheus/latest/querying/examples/
+* for instrumenation, look at the libraries in https://github.com/prometheus/

 How do I get basic HW (disk, cpu, memory, network...) monitoring for a host?
 ----------------------------------------------------------------------------
@ -30,10 +41,22 @@ How do I monitor a list of services?
  - pagure.io and src.fp.o have two different list of services to monitor
    they partly overlap but aren't exactly the same, how can I monitor them?

+  - For prometheus, metrics exported are usually done by instrumentation,
+  meaning if i.e. pagure was instrumented to export /metrics endpoint,
+  you just need to make sure you are collecting them, either because they run in openshift,
+  and you configured appropriate ServiceMonitor or PodMonitor objects,
+  or if outside of openshift, it is in additional scrape configuration of prometheus.
+  Because collected metrics are labeled, it is simple to distinguish which belong where.

 How do I get alerted for a service not running?
 -----------------------------------------------

+- Prometheus supports configuring rules for alert-manager that can then notify through various services.
+  You can learn about the configuration here: https://prometheus.io/docs/alerting/latest/configuration/#configuration-file
+  The rules specifying when to alert are done in prometheus itself : https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/
+  You can specify them in CRDs in your project in simmilar fashion as with Service Monitor
+  To use IRC, there needs to be a separate gateway installed in a sidecar: https://github.com/google/alertmanager-irc-relay
+
 How can I tune the alerts?
 --------------------------

@ -42,10 +65,14 @@ As in, who gets alerted? When? How?
 How do I ask for the service to be restarted <X> times before being alerted?
 ----------------------------------------------------------------------------

+- In prometheus you can't. It is assumed you are using kubernetes that would manage something like this for you.

 How do I monitor rabbitmq queues?
 ---------------------------------

+- In prometheus, according to https://www.rabbitmq.com/prometheus.html#overview-prometheus
+  you just need to make sure you are collecting the exported metrics.
+
 How do we alert about checks not passing to people outside of our teams?
 ------------------------------------------------------------------------
  -> the OSCI team is interesting in having notifications/monitoring for the CI
@ -54,7 +81,7 @@ How do we alert about checks not passing to people outside of our teams?
 How can we chain a prometheus instance to ours? 
 -----------------------------------------------
 This allows to consolidate in a single instance monitoring coming from different
-instances
+instances. This can be done with configuring federation in additional scrape configs: https://prometheus.io/docs/prometheus/latest/federation/

 Can we monitor rabbitmq queues in prometheus?
 ---------------------------------------------
@ -64,13 +91,26 @@ How can I monitor the performances of my application?

 Number of requests served? Number of 500 errors? Number of DB connections?

+With prometheus, you need to instrument your application and configure prometheus t collect its metrics.

 How do I ack an alert so it stops alerting?
 -------------------------------------------

+With prometheus and Alertmanager, there is no way to just ACK an alert,
+it is assumed that something more high-level like opsgenie would take care of actually
+interacting with regular human ops people.
+
+For small enough teams, just using silence on alert in alertmanager could be enough.
+
+There is a sidecar that serves to provide a little bit more features to the barebones alerting.
+like https://github.com/prymitive/kthxbye.
+
 How do I pre-emptively stop a check before I start working on an outage?
 ------------------------------------------------------------------------

 In other words: I know that I'll cause an outage while working on <service>, how
 do I turn off the checks for this service to avoid notifying admins while I'm
 working on it?
+
+In Prometheus and Alertmanager there are Silences, where you can set a time when certain alerts wouldn't
+be firing. You are able to create and remove these through rest api,