Signed-off-by: siddharthvipul <siddharthvipul1@gmail.com>
This commit is contained in:
siddharthvipul 2021-04-14 15:47:14 +05:30
parent 72503f7dee
commit 7efca65b49

View file

@ -6,13 +6,27 @@ monitoring with zabbix and prometheus.
How do I access zabbix?
-----------------------
1. First obtain Kerberos ticket with kinit:
```
$ kinit myusername@FEDORAPROJECT.ORG
Password for myusername@FEDORAPROJECT.ORG:
```
2. Login to `https://zabbix.stg.fedoraproject.org/zabbix.php?action=dashboard.view` to see dashboard
3. If you need to be added in special privilege group (to see specific systems metrics), Open a PR in <path-to-inventory> with your FAS id in the list under the group and ask sysadmin of that groups to +1.
How do I access zabbix when I'm a community member?
---------------------------------------------------
1. First obtain Kerberos ticket with kinit:
```
$ kinit myusername@FEDORAPROJECT.ORG
Password for myusername@FEDORAPROJECT.ORG:
```
2. Login to `https://zabbix.stg.fedoraproject.org/zabbix.php?action=dashboard.view` to see guest/public dashboard
How do I access Prometheus?
---------------------------
Prometheus is running in the application monitoring namespace, standart routing applies,
Prometheus is running in the application monitoring namespace, standard routing applies,
i.e.: https://prometheus-route-application-monitoring.app.os.stg.fedoraproject.org/graph
To access it you need to have account in the openshift it is running in.
@ -31,10 +45,15 @@ In other words, do you have some how-tos/links I should read to understand/get
started with prometheus?
* quick introduction to the stack we are running: https://www.youtube.com/watch?v=-37OPXXhrTw
* to get idea on how to use it, look at sample queries: https://prometheus.io/docs/prometheus/latest/querying/examples/
* for instrumenation, look at the libraries in https://github.com/prometheus/
* for instrumentation, look at the libraries in https://github.com/prometheus/
How do I get basic HW (disk, cpu, memory, network...) monitoring for a host?
----------------------------------------------------------------------------
There are out of the box template for most of basic monitoring requirement that
can be seen on the web UI once you run the zabbix-agent-role against the node.
if you want to send any custom metrics, we recommend zabbix-sender. Zabbix sender is a command line utility that may be used to send performance data to zabbix server for processing.
adding the zabbix sender command in crontab is one way of continuously sending
data to server that can processed on server side (in your web UI). See https://www.zabbix.com/documentation/current/manpages/zabbix_sender
How do I monitor a list of services?
------------------------------------
@ -47,6 +66,7 @@ How do I monitor a list of services?
and you configured appropriate ServiceMonitor or PodMonitor objects,
or if outside of openshift, it is in additional scrape configuration of prometheus.
Because collected metrics are labeled, it is simple to distinguish which belong where.
- For Zabbix, if you want to send any custom metrics, we recommend zabbix-sender. Zabbix sender is a command line utility that may be used to send performance data to zabbix server for processing. Adding the zabbix sender command in crontab is one way of continuously sending data to server that can processed on server side (in your web UI). See https://www.zabbix.com/documentation/current/manpages/zabbix_sender
How do I get alerted for a service not running?
-----------------------------------------------
@ -57,15 +77,27 @@ How do I get alerted for a service not running?
You can specify them in CRDs in your project in simmilar fashion as with Service Monitor
To use IRC, there needs to be a separate gateway installed in a sidecar: https://github.com/google/alertmanager-irc-relay
- In Zabbix, you can set custom alerting for yourself (or for groups through
web UI). Follow https://www.zabbix.com/documentation/5.0/manual/config/triggers/trigger
How can I tune the alerts?
--------------------------
As in, who gets alerted? When? How?
- In Zabbix, we will have different groups with different configurations. When
you are added in that group, you will receive notifications relevant to that
group (you can change what alerting you want for the group once you have
access to that). You can filter down the alerting even more for yourself in
web UI. Follow this tutorial: https://www.zabbix.com/documentation/5.0/manual/config/triggers/trigger
If you want to tweak how you receive your alerts, follow https://www.zabbix.com/documentation/5.0/manual/config/notifications/media
How do I ask for the service to be restarted <X> times before being alerted?
----------------------------------------------------------------------------
- In prometheus you can't. It is assumed you are using kubernetes that would manage something like this for you.
- In zabbix, <TODO>, you can do events based on triggers and there are event
correlation options but yet to figure out this customization
How do I monitor rabbitmq queues?
---------------------------------
@ -73,12 +105,15 @@ How do I monitor rabbitmq queues?
- In prometheus, according to https://www.rabbitmq.com/prometheus.html#overview-prometheus
you just need to make sure you are collecting the exported metrics.
- In Zabbix, according to https://www.zabbix.com/integrations/rabbitmq, there
is a way to build push data to zabbix that can be processed on server side
How do we alert about checks not passing to people outside of our teams?
------------------------------------------------------------------------
-> the OSCI team is interesting in having notifications/monitoring for the CI
queues in rabbitmq
How can we chain a prometheus instance to ours?
How can we chain a prometheus instance to ours?
-----------------------------------------------
This allows to consolidate in a single instance monitoring coming from different
instances. This can be done with configuring federation in additional scrape configs: https://prometheus.io/docs/prometheus/latest/federation/
@ -105,6 +140,8 @@ For small enough teams, just using silence on alert in alertmanager could be eno
There is a sidecar that serves to provide a little bit more features to the barebones alerting.
like https://github.com/prymitive/kthxbye.
- In Zabbix, you can acknowledge the problem and it will stop alerting. Follow https://www.zabbix.com/documentation/current/manual/acknowledges
How do I pre-emptively stop a check before I start working on an outage?
------------------------------------------------------------------------
@ -114,3 +151,6 @@ working on it?
In Prometheus and Alertmanager there are Silences, where you can set a time when certain alerts wouldn't
be firing. You are able to create and remove these through rest api,
- In Zabbix, simplest way is to stop zabbix agent (or custom sender) on the system and ack on
server side that it's not reachable.