From f96af0744a3c6d1c92ebf4291a46a119bf1e5c03 Mon Sep 17 00:00:00 2001 From: Adam Saleh Date: Wed, 31 Mar 2021 17:15:48 +0200 Subject: [PATCH] Added basic prometheus info to FAQ. --- docs/monitoring_metrics/faq.rst | 42 ++++++++++++++++++++++++++++++++- 1 file changed, 41 insertions(+), 1 deletion(-) diff --git a/docs/monitoring_metrics/faq.rst b/docs/monitoring_metrics/faq.rst index 958b869..6fe2e24 100644 --- a/docs/monitoring_metrics/faq.rst +++ b/docs/monitoring_metrics/faq.rst @@ -12,15 +12,26 @@ How do I access zabbix when I'm a community member? How do I access Prometheus? --------------------------- +Prometheus is running in the application monitoring namespace, standart routing applies, +i.e.: https://prometheus-route-application-monitoring.app.os.stg.fedoraproject.org/graph + +To access it you need to have account in the openshift it is running in. How do I access Prometheus when I'm a community member? ------------------------------------------------------- +You shouldn't access prometheus directly, unless you are maintaining an application in openshift. + +Data from prometheus can be exported and viewed in Grafana or Zabbix, meaning we can +give access to a more limited public view through dashboards in one of these. Do you have a 5 minutes guide on how to use prometheus? ------------------------------------------------------- In other words, do you have some how-tos/links I should read to understand/get started with prometheus? +* quick introduction to the stack we are running: https://www.youtube.com/watch?v=-37OPXXhrTw +* to get idea on how to use it, look at sample queries: https://prometheus.io/docs/prometheus/latest/querying/examples/ +* for instrumenation, look at the libraries in https://github.com/prometheus/ How do I get basic HW (disk, cpu, memory, network...) monitoring for a host? ---------------------------------------------------------------------------- @@ -30,10 +41,22 @@ How do I monitor a list of services? - pagure.io and src.fp.o have two different list of services to monitor they partly overlap but aren't exactly the same, how can I monitor them? + - For prometheus, metrics exported are usually done by instrumentation, + meaning if i.e. pagure was instrumented to export /metrics endpoint, + you just need to make sure you are collecting them, either because they run in openshift, + and you configured appropriate ServiceMonitor or PodMonitor objects, + or if outside of openshift, it is in additional scrape configuration of prometheus. + Because collected metrics are labeled, it is simple to distinguish which belong where. How do I get alerted for a service not running? ----------------------------------------------- +- Prometheus supports configuring rules for alert-manager that can then notify through various services. + You can learn about the configuration here: https://prometheus.io/docs/alerting/latest/configuration/#configuration-file + The rules specifying when to alert are done in prometheus itself : https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/ + You can specify them in CRDs in your project in simmilar fashion as with Service Monitor + To use IRC, there needs to be a separate gateway installed in a sidecar: https://github.com/google/alertmanager-irc-relay + How can I tune the alerts? -------------------------- @@ -42,10 +65,14 @@ As in, who gets alerted? When? How? How do I ask for the service to be restarted times before being alerted? ---------------------------------------------------------------------------- +- In prometheus you can't. It is assumed you are using kubernetes that would manage something like this for you. How do I monitor rabbitmq queues? --------------------------------- +- In prometheus, according to https://www.rabbitmq.com/prometheus.html#overview-prometheus + you just need to make sure you are collecting the exported metrics. + How do we alert about checks not passing to people outside of our teams? ------------------------------------------------------------------------ -> the OSCI team is interesting in having notifications/monitoring for the CI @@ -54,7 +81,7 @@ How do we alert about checks not passing to people outside of our teams? How can we chain a prometheus instance to ours? ----------------------------------------------- This allows to consolidate in a single instance monitoring coming from different -instances +instances. This can be done with configuring federation in additional scrape configs: https://prometheus.io/docs/prometheus/latest/federation/ Can we monitor rabbitmq queues in prometheus? --------------------------------------------- @@ -64,13 +91,26 @@ How can I monitor the performances of my application? Number of requests served? Number of 500 errors? Number of DB connections? +With prometheus, you need to instrument your application and configure prometheus t collect its metrics. How do I ack an alert so it stops alerting? ------------------------------------------- +With prometheus and Alertmanager, there is no way to just ACK an alert, +it is assumed that something more high-level like opsgenie would take care of actually +interacting with regular human ops people. + +For small enough teams, just using silence on alert in alertmanager could be enough. + +There is a sidecar that serves to provide a little bit more features to the barebones alerting. +like https://github.com/prymitive/kthxbye. + How do I pre-emptively stop a check before I start working on an outage? ------------------------------------------------------------------------ In other words: I know that I'll cause an outage while working on , how do I turn off the checks for this service to avoid notifying admins while I'm working on it? + +In Prometheus and Alertmanager there are Silences, where you can set a time when certain alerts wouldn't +be firing. You are able to create and remove these through rest api,