183 lines
8.4 KiB
ReStructuredText
183 lines
8.4 KiB
ReStructuredText
Frequently Asked Questions
|
|
==========================
|
|
|
|
Here are a list of questions and answers that should help you get start with monitoring
|
|
with zabbix and prometheus.
|
|
|
|
How do I access zabbix?
|
|
-----------------------
|
|
|
|
1. First obtain Kerberos ticket with kinit:
|
|
|
|
.. code-block::
|
|
|
|
$ kinit myusername@FEDORAPROJECT.ORG
|
|
Password for myusername@FEDORAPROJECT.ORG:
|
|
|
|
2. Login to https://zabbix.stg.fedoraproject.org/zabbix.php?action=dashboard.view to see
|
|
dashboard
|
|
3. If you need to be added in special privilege group (to see specific systems metrics),
|
|
Open a PR in <path-to-inventory> with your FAS id in the list under the group and ask
|
|
sysadmin of that groups to +1.
|
|
|
|
How do I access zabbix when I'm a community member?
|
|
---------------------------------------------------
|
|
|
|
1. First obtain Kerberos ticket with kinit:
|
|
|
|
.. code-block::
|
|
|
|
$ kinit myusername@FEDORAPROJECT.ORG
|
|
Password for myusername@FEDORAPROJECT.ORG:
|
|
|
|
2. Login to https://zabbix.stg.fedoraproject.org/zabbix.php?action=dashboard.view to see
|
|
guest/public dashboard
|
|
|
|
How do I access Prometheus?
|
|
---------------------------
|
|
|
|
Prometheus is running in the application monitoring namespace, standard routing applies,
|
|
i.e.: https://prometheus-route-application-monitoring.app.os.stg.fedoraproject.org/graph
|
|
|
|
To access it you need to have account in the openshift it is running in.
|
|
|
|
How do I access Prometheus when I'm a community member?
|
|
-------------------------------------------------------
|
|
|
|
You shouldn't access prometheus directly, unless you are maintaining an application in
|
|
openshift.
|
|
|
|
Data from prometheus can be exported and viewed in Grafana or Zabbix, meaning we can
|
|
give access to a more limited public view through dashboards in one of these.
|
|
|
|
Do you have a 5 minutes guide on how to use prometheus?
|
|
-------------------------------------------------------
|
|
|
|
In other words, do you have some how-tos/links I should read to understand/get started
|
|
with prometheus?
|
|
|
|
- quick introduction to the stack we are running:
|
|
https://www.youtube.com/watch?v=-37OPXXhrTw
|
|
- to get idea on how to use it, look at sample queries:
|
|
https://prometheus.io/docs/prometheus/latest/querying/examples/
|
|
- for instrumentation, look at the libraries in https://github.com/prometheus/
|
|
|
|
How do I get basic HW (disk, cpu, memory, network...) monitoring for a host?
|
|
----------------------------------------------------------------------------
|
|
|
|
There are out of the box template for most of basic monitoring requirement that can be
|
|
seen on the web UI once you run the zabbix-agent-role against the node. if you want to
|
|
send any custom metrics, we recommend zabbix-sender. Zabbix sender is a command line
|
|
utility that may be used to send performance data to zabbix server for processing.
|
|
Adding the zabbix sender command in crontab is one way of continuously sending data to
|
|
server that can processed on server side (in your web UI). See
|
|
https://www.zabbix.com/documentation/current/manpages/zabbix_sender
|
|
|
|
How do I monitor a list of services?
|
|
------------------------------------
|
|
|
|
- pagure.io and src.fp.o have two different list of services to monitor
|
|
they partly overlap but aren't exactly the same, how can I monitor them?
|
|
- For prometheus, metrics exported are usually done by instrumentation, meaning if i.e.
|
|
pagure was instrumented to export /metrics endpoint, you just need to make sure you
|
|
are collecting them, either because they run in openshift, and you configured
|
|
appropriate ServiceMonitor or PodMonitor objects, or if outside of openshift, it is in
|
|
additional scrape configuration of prometheus. Because collected metrics are labeled,
|
|
it is simple to distinguish which belong where.
|
|
- For Zabbix, if you want to send any custom metrics, we recommend zabbix-sender. Zabbix
|
|
sender is a command line utility that may be used to send performance data to zabbix
|
|
server for processing. Adding the zabbix sender command in crontab is one way of
|
|
continuously sending data to server that can processed on server side (in your web
|
|
UI). See https://www.zabbix.com/documentation/current/manpages/zabbix_sender
|
|
|
|
How do I get alerted for a service not running?
|
|
-----------------------------------------------
|
|
|
|
- Prometheus supports configuring rules for alert-manager that can then notify through
|
|
various services. You can learn about the configuration here:
|
|
https://prometheus.io/docs/alerting/latest/configuration/#configuration-file The rules
|
|
specifying when to alert are done in prometheus itself :
|
|
https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/ You can
|
|
specify them in CRDs in your project in simmilar fashion as with Service Monitor To
|
|
use IRC, there needs to be a separate gateway installed in a sidecar:
|
|
https://github.com/google/alertmanager-irc-relay
|
|
- In Zabbix, you can set custom alerting for yourself (or for groups through web UI).
|
|
Follow https://www.zabbix.com/documentation/5.0/manual/config/triggers/trigger
|
|
|
|
How can I tune the alerts?
|
|
--------------------------
|
|
|
|
As in, who gets alerted? When? How?
|
|
|
|
- In Zabbix, we will have different groups with different configurations. When you are
|
|
added in that group, you will receive notifications relevant to that group (you can
|
|
change what alerting you want for the group once you have access to that). You can
|
|
filter down the alerting even more for yourself in web UI. Follow this tutorial:
|
|
https://www.zabbix.com/documentation/5.0/manual/config/triggers/trigger If you want to
|
|
tweak how you receive your alerts, follow
|
|
https://www.zabbix.com/documentation/5.0/manual/config/notifications/media
|
|
|
|
How do I ask for the service to be restarted <X> times before being alerted?
|
|
----------------------------------------------------------------------------
|
|
|
|
- In prometheus you can't. It is assumed you are using kubernetes that would manage
|
|
something like this for you.
|
|
- In zabbix, <TODO>, you can do events based on triggers and there are event correlation
|
|
options but yet to figure out this customization
|
|
|
|
How do I monitor rabbitmq queues?
|
|
---------------------------------
|
|
|
|
- In prometheus, according to
|
|
https://www.rabbitmq.com/prometheus.html#overview-prometheus you just need to make
|
|
sure you are collecting the exported metrics.
|
|
- In Zabbix, according to https://www.zabbix.com/integrations/rabbitmq, there is a way
|
|
to build push data to zabbix that can be processed on server side
|
|
|
|
How do we alert about checks not passing to people outside of our teams?
|
|
------------------------------------------------------------------------
|
|
|
|
-> the OSCI team is interesting in having notifications/monitoring for the CI
|
|
queues in rabbitmq
|
|
|
|
How can we chain a prometheus instance to ours?
|
|
-----------------------------------------------
|
|
|
|
This allows to consolidate in a single instance monitoring coming from different
|
|
instances. This can be done with configuring federation in additional scrape configs:
|
|
https://prometheus.io/docs/prometheus/latest/federation/
|
|
|
|
How can I monitor the performances of my application?
|
|
-----------------------------------------------------
|
|
|
|
Number of requests served? Number of 500 errors? Number of DB connections?
|
|
|
|
With prometheus, you need to instrument your application and configure prometheus t
|
|
collect its metrics.
|
|
|
|
How do I ack an alert so it stops alerting?
|
|
-------------------------------------------
|
|
|
|
With prometheus and Alertmanager, there is no way to just ACK an alert, it is assumed
|
|
that something more high-level like opsgenie would take care of actually interacting
|
|
with regular human ops people.
|
|
|
|
For small enough teams, just using silence on alert in alertmanager could be enough.
|
|
|
|
There is a sidecar that serves to provide a little bit more features to the barebones
|
|
alerting. like https://github.com/prymitive/kthxbye.
|
|
|
|
- In Zabbix, you can acknowledge the problem and it will stop alerting. Follow
|
|
https://www.zabbix.com/documentation/current/manual/acknowledges
|
|
|
|
How do I pre-emptively stop a check before I start working on an outage?
|
|
------------------------------------------------------------------------
|
|
|
|
In other words: I know that I'll cause an outage while working on <service>, how do I
|
|
turn off the checks for this service to avoid notifying admins while I'm working on it?
|
|
|
|
In Prometheus and Alertmanager there are Silences, where you can set a time when certain
|
|
alerts wouldn't be firing. You are able to create and remove these through rest api,
|
|
|
|
- In Zabbix, simplest way is to stop zabbix agent (or custom sender) on the system and
|
|
ack on server side that it's not reachable.
|