Added more prometheus documentation

2021-04-14 16:26:21 +02:00 · 2021-04-14 16:26:21 +02:00 · 646a390c9a
commit 646a390c9a
parent ed508e7b9b
2 changed files with 124 additions and 1 deletions
--- a/docs/monitoring_metrics/prometheus_for_dev.rst
+++ b/docs/monitoring_metrics/prometheus_for_dev.rst
@ -10,6 +10,7 @@ This way, the merics will be scraped into the configured prometheus and correctl
 As an example, lets look at ServiceMonitor for bodhi:
 ::
  apiVersion: monitoring.coreos.com/v1
  kind: ServiceMonitor
  metadata:
@ -30,6 +31,7 @@ machinery at our disposal, see `Matcher <https://v1-17.docs.kubernetes.io/docs/r
 To manage alerting, you can create an alerting rule:
 ::
    apiVersion: monitoring.coreos.com/v1
    kind: PrometheusRule
    metadata:
@ -79,3 +81,32 @@ As part of the proof of concept we have instrumented Bodhi application,
 to collect data through prometheus_client python library:
 https://github.com/fedora-infra/bodhi/pull/4079
 Notes on alerting
 -----------------
 To be be notified of alerts, you need to be subscribed to recievers that
 have been configured in alertmanager.
 The configuration of the rules you want to alert on can be done in the namspace of your application.
 For example:
 ::
 apiVersion: monitoring.coreos.com/v1
 kind: PrometheusRule
 metadata:
  labels:
    monitoring-key: cpe
  name: prometheus-application-monitoring-rules
 spec:
  groups:
    - name: general.rules
      rules:
        - alert: AlertBodhi500Status
          annotations:
            summary: Alerting on too many server errors
          expr: (100*sum(rate(pyramid_request_count{namespace="bodhi", path_info_pattern=~".*[^healthz]", status="500"}[20m]))/sum(rate(pyramid_request_count{namespace="bodhi", path_info_pattern=~".*[^healthz]"}[20m])))>1
          labels:
            severity: high
 would alert if there is more than 1% responses with 500 status code.
--- a/docs/monitoring_metrics/prometheus_for_ops.rst
+++ b/docs/monitoring_metrics/prometheus_for_ops.rst
@ -113,3 +113,95 @@ https://github.com/timescale/promscale
    - url: "http://promscale:9201/write"
    remote_read:
    - url: "http://promscale:9201/read"
 Notes on auxialiary services
 ----------------------------
 As prometheus is primarily targeted to collect metrics from
 services that have beein instrumented to expose them, if you don't
 your service is not instrumented, or it is not a service,
 i.e. a batch-job, you need an adapter to help you with the metrics collection. 
 There are two services that help with this.
 * `blackbox exporter <https://github.com/prometheus/blackbox_exporter>`_ to monitor services that have not been instruented based on querying public a.p.i. 
 * `push gateqay <https://prometheus.io/docs/practices/pushing/#should-i-be-using-the-pushgateway>`_ that helps collect information from batch-jobs
 Maintaining the push-gateway can be relegated to the application developer,
 as it is lightweight, and by colloecting metrics from the namespace it is running in,
 the data will be correctly labeled.
 With blackbox exporter, it can be beneficial to have it running as prometheus side-car,
 in simmilar fashion, as we configure oauth-proxy, adding this to the containers section
 of the prometheus definition:
 ::
    - name: blackbox-exporter
      volumeMounts:
        - name: configmap-blackbox
          mountPath: /etc/blackbox-config
        - mountPath: /etc/tls/private
          name: secret-prometheus-k8s-tls
      image: quay.io/prometheus/blackbox-exporter:4.4
      args:
        - '--config.file=/etc/blackbox-config/blackbox.yml'
      ports:
        - containerPort: 9115
          name: blackbox
 We can then instruct what is to be monitored through the configmap-blackbox, you can find `relevant examples <https://github.com/prometheus/blackbox_exporter/blob/master/example.yml>` in the project repo.
 Beause blackox exporter is in the sam epod, we need to use the additional-scrape-config to add it in.
 Notes on alerting
 -----------------
 Prometheus as is, can have rules configured that trigger alerts, once
 a specific query evaluates to true. The definition of the rule is explained in the companion docs
 for prometheus for developers and can be created in the namespace of the running application.
 Here, we need to focus what happens with alert after prometheus realizes it should fire it,
 based on a rule.
 In prometheus crd definition, there is a section about the alert-manager that is supposed to
 manage the forwarding of these alerts.
 ::
  alerting:
    alertmanagers:
      - bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
        name: alertmanager-service
        namespace: application-monitoring
        port: web
        scheme: https
        tlsConfig:
          caFile: /var/run/secrets/kubernetes.io/serviceaccount/service-ca.crt
          serverName: alertmanager-service.application-monitoring.svc
 We already have alertmanager running and configured by the alertmanager-operator.
 Alertmanager itself is really simplistic with a simple ui and api, that alows for silencing an
 alert for a given ammount of time.
 It it is expected that the actual user-interaction is happening elsewhere,
 either through services like OpsGenie, or through i.e. `integration with zabbix <https://devopy.io/setting-up-zabbix-alertmanager-integration/>`_
 More of a build-it yourself solution is to use i.e. https://karma-dashboard.io/,
 but we haven't tried any of these as the part of our POC.
 To be able to be notified of the alert, you need to have the `correct reciever configuration <https://prometheus.io/docs/alerting/latest/configuration/#email_config>`_ in the alertmanagers secret:
 ::
 global:
  resolve_timeout: 5m
 route:
  group_by: ['job']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 30m
  receiver: 'email'
 receivers:
 - name: 'email'
  email_configs:
  - to: 'asaleh@redhat.com'