Added more prometheus documentation

This commit is contained in:
Adam Saleh 2021-04-14 16:26:21 +02:00
parent ed508e7b9b
commit 646a390c9a
2 changed files with 124 additions and 1 deletions

View file

@ -10,6 +10,7 @@ This way, the merics will be scraped into the configured prometheus and correctl
As an example, lets look at ServiceMonitor for bodhi:
::
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
@ -30,6 +31,7 @@ machinery at our disposal, see `Matcher <https://v1-17.docs.kubernetes.io/docs/r
To manage alerting, you can create an alerting rule:
::
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
@ -79,3 +81,32 @@ As part of the proof of concept we have instrumented Bodhi application,
to collect data through prometheus_client python library:
https://github.com/fedora-infra/bodhi/pull/4079
Notes on alerting
-----------------
To be be notified of alerts, you need to be subscribed to recievers that
have been configured in alertmanager.
The configuration of the rules you want to alert on can be done in the namspace of your application.
For example:
::
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
labels:
monitoring-key: cpe
name: prometheus-application-monitoring-rules
spec:
groups:
- name: general.rules
rules:
- alert: AlertBodhi500Status
annotations:
summary: Alerting on too many server errors
expr: (100*sum(rate(pyramid_request_count{namespace="bodhi", path_info_pattern=~".*[^healthz]", status="500"}[20m]))/sum(rate(pyramid_request_count{namespace="bodhi", path_info_pattern=~".*[^healthz]"}[20m])))>1
labels:
severity: high
would alert if there is more than 1% responses with 500 status code.

View file

@ -112,4 +112,96 @@ https://github.com/timescale/promscale
remote_write:
- url: "http://promscale:9201/write"
remote_read:
- url: "http://promscale:9201/read"
- url: "http://promscale:9201/read"
Notes on auxialiary services
----------------------------
As prometheus is primarily targeted to collect metrics from
services that have beein instrumented to expose them, if you don't
your service is not instrumented, or it is not a service,
i.e. a batch-job, you need an adapter to help you with the metrics collection.
There are two services that help with this.
* `blackbox exporter <https://github.com/prometheus/blackbox_exporter>`_ to monitor services that have not been instruented based on querying public a.p.i.
* `push gateqay <https://prometheus.io/docs/practices/pushing/#should-i-be-using-the-pushgateway>`_ that helps collect information from batch-jobs
Maintaining the push-gateway can be relegated to the application developer,
as it is lightweight, and by colloecting metrics from the namespace it is running in,
the data will be correctly labeled.
With blackbox exporter, it can be beneficial to have it running as prometheus side-car,
in simmilar fashion, as we configure oauth-proxy, adding this to the containers section
of the prometheus definition:
::
- name: blackbox-exporter
volumeMounts:
- name: configmap-blackbox
mountPath: /etc/blackbox-config
- mountPath: /etc/tls/private
name: secret-prometheus-k8s-tls
image: quay.io/prometheus/blackbox-exporter:4.4
args:
- '--config.file=/etc/blackbox-config/blackbox.yml'
ports:
- containerPort: 9115
name: blackbox
We can then instruct what is to be monitored through the configmap-blackbox, you can find `relevant examples <https://github.com/prometheus/blackbox_exporter/blob/master/example.yml>` in the project repo.
Beause blackox exporter is in the sam epod, we need to use the additional-scrape-config to add it in.
Notes on alerting
-----------------
Prometheus as is, can have rules configured that trigger alerts, once
a specific query evaluates to true. The definition of the rule is explained in the companion docs
for prometheus for developers and can be created in the namespace of the running application.
Here, we need to focus what happens with alert after prometheus realizes it should fire it,
based on a rule.
In prometheus crd definition, there is a section about the alert-manager that is supposed to
manage the forwarding of these alerts.
::
alerting:
alertmanagers:
- bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
name: alertmanager-service
namespace: application-monitoring
port: web
scheme: https
tlsConfig:
caFile: /var/run/secrets/kubernetes.io/serviceaccount/service-ca.crt
serverName: alertmanager-service.application-monitoring.svc
We already have alertmanager running and configured by the alertmanager-operator.
Alertmanager itself is really simplistic with a simple ui and api, that alows for silencing an
alert for a given ammount of time.
It it is expected that the actual user-interaction is happening elsewhere,
either through services like OpsGenie, or through i.e. `integration with zabbix <https://devopy.io/setting-up-zabbix-alertmanager-integration/>`_
More of a build-it yourself solution is to use i.e. https://karma-dashboard.io/,
but we haven't tried any of these as the part of our POC.
To be able to be notified of the alert, you need to have the `correct reciever configuration <https://prometheus.io/docs/alerting/latest/configuration/#email_config>`_ in the alertmanagers secret:
::
global:
resolve_timeout: 5m
route:
group_by: ['job']
group_wait: 10s
group_interval: 10s
repeat_interval: 30m
receiver: 'email'
receivers:
- name: 'email'
email_configs:
- to: 'asaleh@redhat.com'