Added more prometheus documentation

2021-04-14 16:26:21 +02:00 · 2021-04-14 16:26:21 +02:00 · 646a390c9a
commit 646a390c9a
parent ed508e7b9b
2 changed files with 124 additions and 1 deletions
--- a/docs/monitoring_metrics/prometheus_for_dev.rst
+++ b/docs/monitoring_metrics/prometheus_for_dev.rst
@ -10,6 +10,7 @@ This way, the merics will be scraped into the configured prometheus and correctl
 As an example, lets look at ServiceMonitor for bodhi:

 ::
+
  apiVersion: monitoring.coreos.com/v1
  kind: ServiceMonitor
  metadata:
@ -30,6 +31,7 @@ machinery at our disposal, see `Matcher <https://v1-17.docs.kubernetes.io/docs/r
 To manage alerting, you can create an alerting rule:

 ::
+
    apiVersion: monitoring.coreos.com/v1
    kind: PrometheusRule
    metadata:
@ -79,3 +81,32 @@ As part of the proof of concept we have instrumented Bodhi application,
 to collect data through prometheus_client python library:
 https://github.com/fedora-infra/bodhi/pull/4079

+Notes on alerting
+-----------------
+
+To be be notified of alerts, you need to be subscribed to recievers that
+have been configured in alertmanager.
+
+The configuration of the rules you want to alert on can be done in the namspace of your application.
+For example:
+
+::
+
+apiVersion: monitoring.coreos.com/v1
+kind: PrometheusRule
+metadata:
+  labels:
+    monitoring-key: cpe
+  name: prometheus-application-monitoring-rules
+spec:
+  groups:
+    - name: general.rules
+      rules:
+        - alert: AlertBodhi500Status
+          annotations:
+            summary: Alerting on too many server errors
+          expr: (100*sum(rate(pyramid_request_count{namespace="bodhi", path_info_pattern=~".*[^healthz]", status="500"}[20m]))/sum(rate(pyramid_request_count{namespace="bodhi", path_info_pattern=~".*[^healthz]"}[20m])))>1
+          labels:
+            severity: high
+
+would alert if there is more than 1% responses with 500 status code.
--- a/docs/monitoring_metrics/prometheus_for_ops.rst
+++ b/docs/monitoring_metrics/prometheus_for_ops.rst
@ -112,4 +112,96 @@ https://github.com/timescale/promscale
    remote_write:
    - url: "http://promscale:9201/write"
    remote_read:
-    - url: "http://promscale:9201/read"
+    - url: "http://promscale:9201/read"
+
+Notes on auxialiary services
+----------------------------
+
+As prometheus is primarily targeted to collect metrics from
+services that have beein instrumented to expose them, if you don't
+your service is not instrumented, or it is not a service,
+i.e. a batch-job, you need an adapter to help you with the metrics collection. 
+
+There are two services that help with this.
+
+* `blackbox exporter <https://github.com/prometheus/blackbox_exporter>`_ to monitor services that have not been instruented based on querying public a.p.i. 
+* `push gateqay <https://prometheus.io/docs/practices/pushing/#should-i-be-using-the-pushgateway>`_ that helps collect information from batch-jobs
+
+Maintaining the push-gateway can be relegated to the application developer,
+as it is lightweight, and by colloecting metrics from the namespace it is running in,
+the data will be correctly labeled.
+
+With blackbox exporter, it can be beneficial to have it running as prometheus side-car,
+in simmilar fashion, as we configure oauth-proxy, adding this to the containers section
+of the prometheus definition:
+
+::
+
+    - name: blackbox-exporter
+      volumeMounts:
+        - name: configmap-blackbox
+          mountPath: /etc/blackbox-config
+        - mountPath: /etc/tls/private
+          name: secret-prometheus-k8s-tls
+      image: quay.io/prometheus/blackbox-exporter:4.4
+      args:
+        - '--config.file=/etc/blackbox-config/blackbox.yml'
+      ports:
+        - containerPort: 9115
+          name: blackbox
+
+We can then instruct what is to be monitored through the configmap-blackbox, you can find `relevant examples <https://github.com/prometheus/blackbox_exporter/blob/master/example.yml>` in the project repo.
+Beause blackox exporter is in the sam epod, we need to use the additional-scrape-config to add it in.
+
+Notes on alerting
+-----------------
+
+Prometheus as is, can have rules configured that trigger alerts, once
+a specific query evaluates to true. The definition of the rule is explained in the companion docs
+for prometheus for developers and can be created in the namespace of the running application.
+
+Here, we need to focus what happens with alert after prometheus realizes it should fire it,
+based on a rule.
+
+In prometheus crd definition, there is a section about the alert-manager that is supposed to
+manage the forwarding of these alerts.
+
+::
+
+  alerting:
+    alertmanagers:
+      - bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
+        name: alertmanager-service
+        namespace: application-monitoring
+        port: web
+        scheme: https
+        tlsConfig:
+          caFile: /var/run/secrets/kubernetes.io/serviceaccount/service-ca.crt
+          serverName: alertmanager-service.application-monitoring.svc
+
+We already have alertmanager running and configured by the alertmanager-operator.
+Alertmanager itself is really simplistic with a simple ui and api, that alows for silencing an
+alert for a given ammount of time.
+
+It it is expected that the actual user-interaction is happening elsewhere,
+either through services like OpsGenie, or through i.e. `integration with zabbix <https://devopy.io/setting-up-zabbix-alertmanager-integration/>`_
+
+More of a build-it yourself solution is to use i.e. https://karma-dashboard.io/,
+but we haven't tried any of these as the part of our POC.
+
+To be able to be notified of the alert, you need to have the `correct reciever configuration <https://prometheus.io/docs/alerting/latest/configuration/#email_config>`_ in the alertmanagers secret:
+
+::
+
+global:
+  resolve_timeout: 5m
+route:
+  group_by: ['job']
+  group_wait: 10s
+  group_interval: 10s
+  repeat_interval: 30m
+  receiver: 'email'
+receivers:
+- name: 'email'
+  email_configs:
+  - to: 'asaleh@redhat.com'