Added basic prometheus info to FAQ.
This commit is contained in:
parent
4c04cb3539
commit
f96af0744a
1 changed files with 41 additions and 1 deletions
|
@ -12,15 +12,26 @@ How do I access zabbix when I'm a community member?
|
|||
|
||||
How do I access Prometheus?
|
||||
---------------------------
|
||||
Prometheus is running in the application monitoring namespace, standart routing applies,
|
||||
i.e.: https://prometheus-route-application-monitoring.app.os.stg.fedoraproject.org/graph
|
||||
|
||||
To access it you need to have account in the openshift it is running in.
|
||||
|
||||
How do I access Prometheus when I'm a community member?
|
||||
-------------------------------------------------------
|
||||
You shouldn't access prometheus directly, unless you are maintaining an application in openshift.
|
||||
|
||||
Data from prometheus can be exported and viewed in Grafana or Zabbix, meaning we can
|
||||
give access to a more limited public view through dashboards in one of these.
|
||||
|
||||
Do you have a 5 minutes guide on how to use prometheus?
|
||||
-------------------------------------------------------
|
||||
|
||||
In other words, do you have some how-tos/links I should read to understand/get
|
||||
started with prometheus?
|
||||
* quick introduction to the stack we are running: https://www.youtube.com/watch?v=-37OPXXhrTw
|
||||
* to get idea on how to use it, look at sample queries: https://prometheus.io/docs/prometheus/latest/querying/examples/
|
||||
* for instrumenation, look at the libraries in https://github.com/prometheus/
|
||||
|
||||
How do I get basic HW (disk, cpu, memory, network...) monitoring for a host?
|
||||
----------------------------------------------------------------------------
|
||||
|
@ -30,10 +41,22 @@ How do I monitor a list of services?
|
|||
- pagure.io and src.fp.o have two different list of services to monitor
|
||||
they partly overlap but aren't exactly the same, how can I monitor them?
|
||||
|
||||
- For prometheus, metrics exported are usually done by instrumentation,
|
||||
meaning if i.e. pagure was instrumented to export /metrics endpoint,
|
||||
you just need to make sure you are collecting them, either because they run in openshift,
|
||||
and you configured appropriate ServiceMonitor or PodMonitor objects,
|
||||
or if outside of openshift, it is in additional scrape configuration of prometheus.
|
||||
Because collected metrics are labeled, it is simple to distinguish which belong where.
|
||||
|
||||
How do I get alerted for a service not running?
|
||||
-----------------------------------------------
|
||||
|
||||
- Prometheus supports configuring rules for alert-manager that can then notify through various services.
|
||||
You can learn about the configuration here: https://prometheus.io/docs/alerting/latest/configuration/#configuration-file
|
||||
The rules specifying when to alert are done in prometheus itself : https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/
|
||||
You can specify them in CRDs in your project in simmilar fashion as with Service Monitor
|
||||
To use IRC, there needs to be a separate gateway installed in a sidecar: https://github.com/google/alertmanager-irc-relay
|
||||
|
||||
How can I tune the alerts?
|
||||
--------------------------
|
||||
|
||||
|
@ -42,10 +65,14 @@ As in, who gets alerted? When? How?
|
|||
How do I ask for the service to be restarted <X> times before being alerted?
|
||||
----------------------------------------------------------------------------
|
||||
|
||||
- In prometheus you can't. It is assumed you are using kubernetes that would manage something like this for you.
|
||||
|
||||
How do I monitor rabbitmq queues?
|
||||
---------------------------------
|
||||
|
||||
- In prometheus, according to https://www.rabbitmq.com/prometheus.html#overview-prometheus
|
||||
you just need to make sure you are collecting the exported metrics.
|
||||
|
||||
How do we alert about checks not passing to people outside of our teams?
|
||||
------------------------------------------------------------------------
|
||||
-> the OSCI team is interesting in having notifications/monitoring for the CI
|
||||
|
@ -54,7 +81,7 @@ How do we alert about checks not passing to people outside of our teams?
|
|||
How can we chain a prometheus instance to ours?
|
||||
-----------------------------------------------
|
||||
This allows to consolidate in a single instance monitoring coming from different
|
||||
instances
|
||||
instances. This can be done with configuring federation in additional scrape configs: https://prometheus.io/docs/prometheus/latest/federation/
|
||||
|
||||
Can we monitor rabbitmq queues in prometheus?
|
||||
---------------------------------------------
|
||||
|
@ -64,13 +91,26 @@ How can I monitor the performances of my application?
|
|||
|
||||
Number of requests served? Number of 500 errors? Number of DB connections?
|
||||
|
||||
With prometheus, you need to instrument your application and configure prometheus t collect its metrics.
|
||||
|
||||
How do I ack an alert so it stops alerting?
|
||||
-------------------------------------------
|
||||
|
||||
With prometheus and Alertmanager, there is no way to just ACK an alert,
|
||||
it is assumed that something more high-level like opsgenie would take care of actually
|
||||
interacting with regular human ops people.
|
||||
|
||||
For small enough teams, just using silence on alert in alertmanager could be enough.
|
||||
|
||||
There is a sidecar that serves to provide a little bit more features to the barebones alerting.
|
||||
like https://github.com/prymitive/kthxbye.
|
||||
|
||||
How do I pre-emptively stop a check before I start working on an outage?
|
||||
------------------------------------------------------------------------
|
||||
|
||||
In other words: I know that I'll cause an outage while working on <service>, how
|
||||
do I turn off the checks for this service to avoid notifying admins while I'm
|
||||
working on it?
|
||||
|
||||
In Prometheus and Alertmanager there are Silences, where you can set a time when certain alerts wouldn't
|
||||
be firing. You are able to create and remove these through rest api,
|
||||
|
|
Loading…
Add table
Add a link
Reference in a new issue