2021-03-25 10:41:06 +01:00
|
|
|
Frequently Asked Questions
|
|
|
|
==========================
|
|
|
|
|
|
|
|
Here are a list of questions and answers that should help you get start with
|
|
|
|
monitoring with zabbix and prometheus.
|
|
|
|
|
|
|
|
How do I access zabbix?
|
|
|
|
-----------------------
|
|
|
|
|
|
|
|
How do I access zabbix when I'm a community member?
|
|
|
|
---------------------------------------------------
|
|
|
|
|
|
|
|
How do I access Prometheus?
|
|
|
|
---------------------------
|
2021-03-31 17:15:48 +02:00
|
|
|
Prometheus is running in the application monitoring namespace, standart routing applies,
|
|
|
|
i.e.: https://prometheus-route-application-monitoring.app.os.stg.fedoraproject.org/graph
|
|
|
|
|
|
|
|
To access it you need to have account in the openshift it is running in.
|
2021-03-25 10:41:06 +01:00
|
|
|
|
|
|
|
How do I access Prometheus when I'm a community member?
|
|
|
|
-------------------------------------------------------
|
2021-03-31 17:15:48 +02:00
|
|
|
You shouldn't access prometheus directly, unless you are maintaining an application in openshift.
|
|
|
|
|
|
|
|
Data from prometheus can be exported and viewed in Grafana or Zabbix, meaning we can
|
|
|
|
give access to a more limited public view through dashboards in one of these.
|
2021-03-25 10:41:06 +01:00
|
|
|
|
|
|
|
Do you have a 5 minutes guide on how to use prometheus?
|
|
|
|
-------------------------------------------------------
|
|
|
|
|
|
|
|
In other words, do you have some how-tos/links I should read to understand/get
|
|
|
|
started with prometheus?
|
2021-03-31 17:15:48 +02:00
|
|
|
* quick introduction to the stack we are running: https://www.youtube.com/watch?v=-37OPXXhrTw
|
|
|
|
* to get idea on how to use it, look at sample queries: https://prometheus.io/docs/prometheus/latest/querying/examples/
|
|
|
|
* for instrumenation, look at the libraries in https://github.com/prometheus/
|
2021-03-25 10:41:06 +01:00
|
|
|
|
|
|
|
How do I get basic HW (disk, cpu, memory, network...) monitoring for a host?
|
|
|
|
----------------------------------------------------------------------------
|
|
|
|
|
|
|
|
How do I monitor a list of services?
|
|
|
|
------------------------------------
|
|
|
|
- pagure.io and src.fp.o have two different list of services to monitor
|
|
|
|
they partly overlap but aren't exactly the same, how can I monitor them?
|
|
|
|
|
2021-03-31 17:15:48 +02:00
|
|
|
- For prometheus, metrics exported are usually done by instrumentation,
|
|
|
|
meaning if i.e. pagure was instrumented to export /metrics endpoint,
|
|
|
|
you just need to make sure you are collecting them, either because they run in openshift,
|
|
|
|
and you configured appropriate ServiceMonitor or PodMonitor objects,
|
|
|
|
or if outside of openshift, it is in additional scrape configuration of prometheus.
|
|
|
|
Because collected metrics are labeled, it is simple to distinguish which belong where.
|
2021-03-25 10:41:06 +01:00
|
|
|
|
|
|
|
How do I get alerted for a service not running?
|
|
|
|
-----------------------------------------------
|
|
|
|
|
2021-03-31 17:15:48 +02:00
|
|
|
- Prometheus supports configuring rules for alert-manager that can then notify through various services.
|
|
|
|
You can learn about the configuration here: https://prometheus.io/docs/alerting/latest/configuration/#configuration-file
|
|
|
|
The rules specifying when to alert are done in prometheus itself : https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/
|
|
|
|
You can specify them in CRDs in your project in simmilar fashion as with Service Monitor
|
|
|
|
To use IRC, there needs to be a separate gateway installed in a sidecar: https://github.com/google/alertmanager-irc-relay
|
|
|
|
|
2021-03-25 10:41:06 +01:00
|
|
|
How can I tune the alerts?
|
|
|
|
--------------------------
|
|
|
|
|
|
|
|
As in, who gets alerted? When? How?
|
|
|
|
|
|
|
|
How do I ask for the service to be restarted <X> times before being alerted?
|
|
|
|
----------------------------------------------------------------------------
|
|
|
|
|
2021-03-31 17:15:48 +02:00
|
|
|
- In prometheus you can't. It is assumed you are using kubernetes that would manage something like this for you.
|
2021-03-25 10:41:06 +01:00
|
|
|
|
|
|
|
How do I monitor rabbitmq queues?
|
|
|
|
---------------------------------
|
|
|
|
|
2021-03-31 17:15:48 +02:00
|
|
|
- In prometheus, according to https://www.rabbitmq.com/prometheus.html#overview-prometheus
|
|
|
|
you just need to make sure you are collecting the exported metrics.
|
|
|
|
|
2021-03-25 10:41:06 +01:00
|
|
|
How do we alert about checks not passing to people outside of our teams?
|
|
|
|
------------------------------------------------------------------------
|
|
|
|
-> the OSCI team is interesting in having notifications/monitoring for the CI
|
|
|
|
queues in rabbitmq
|
|
|
|
|
|
|
|
How can we chain a prometheus instance to ours?
|
|
|
|
-----------------------------------------------
|
|
|
|
This allows to consolidate in a single instance monitoring coming from different
|
2021-03-31 17:15:48 +02:00
|
|
|
instances. This can be done with configuring federation in additional scrape configs: https://prometheus.io/docs/prometheus/latest/federation/
|
2021-03-25 10:41:06 +01:00
|
|
|
|
|
|
|
Can we monitor rabbitmq queues in prometheus?
|
|
|
|
---------------------------------------------
|
|
|
|
|
|
|
|
How can I monitor the performances of my application?
|
|
|
|
-----------------------------------------------------
|
|
|
|
|
|
|
|
Number of requests served? Number of 500 errors? Number of DB connections?
|
|
|
|
|
2021-03-31 17:15:48 +02:00
|
|
|
With prometheus, you need to instrument your application and configure prometheus t collect its metrics.
|
2021-03-25 10:41:06 +01:00
|
|
|
|
2021-03-25 16:22:23 +01:00
|
|
|
How do I ack an alert so it stops alerting?
|
|
|
|
-------------------------------------------
|
|
|
|
|
2021-03-31 17:15:48 +02:00
|
|
|
With prometheus and Alertmanager, there is no way to just ACK an alert,
|
|
|
|
it is assumed that something more high-level like opsgenie would take care of actually
|
|
|
|
interacting with regular human ops people.
|
|
|
|
|
|
|
|
For small enough teams, just using silence on alert in alertmanager could be enough.
|
|
|
|
|
|
|
|
There is a sidecar that serves to provide a little bit more features to the barebones alerting.
|
|
|
|
like https://github.com/prymitive/kthxbye.
|
|
|
|
|
2021-03-25 16:22:23 +01:00
|
|
|
How do I pre-emptively stop a check before I start working on an outage?
|
|
|
|
------------------------------------------------------------------------
|
|
|
|
|
|
|
|
In other words: I know that I'll cause an outage while working on <service>, how
|
|
|
|
do I turn off the checks for this service to avoid notifying admins while I'm
|
|
|
|
working on it?
|
2021-03-31 17:15:48 +02:00
|
|
|
|
|
|
|
In Prometheus and Alertmanager there are Silences, where you can set a time when certain alerts wouldn't
|
|
|
|
be firing. You are able to create and remove these through rest api,
|