Prometheus WIP

2021-04-08 09:59:31 +02:00 · 2021-04-08 09:59:31 +02:00 · 72503f7dee
commit 72503f7dee
parent f96af0744a
2 changed files with 99 additions and 0 deletions
--- a/docs/monitoring_metrics/index.rst
+++ b/docs/monitoring_metrics/index.rst
@ -32,5 +32,6 @@ In process we want to be able to answer the questions posed in the latest mailin
 .. toctree::
    :maxdepth: 1

+    prometheus
    faq

--- a/docs/monitoring_metrics/prometheus.rst
+++ b/docs/monitoring_metrics/prometheus.rst
@ -0,0 +1,98 @@
+Monitoring / Metrics with Prometheus
+========================
+
+For deployment, we used combination for configuration of prometheus operator and application-monitoring operator.
+
+Beware, most of the deployment notes could be mostly obsolete in really short time.
+The POC was done on OpenShift 3.11, which limited us in using older version of prometheus operator,
+as well as the no longer maintained application-monitoring operator.
+
+In openshift 4.x that we plan to use in the near future, there is  supported way integrated in the openshift deployment:
+
+* https://docs.openshift.com/container-platform/4.7/monitoring/understanding-the-monitoring-stack.html
+* https://docs.openshift.com/container-platform/4.7/monitoring/configuring-the-monitoring-stack.html#configuring-the-monitoring-stack
+* https://docs.openshift.com/container-platform/4.7/monitoring/enabling-monitoring-for-user-defined-projects.html
+
+The supported stack is more limited, especially w.r.t. adding user defined pod- and service-monitors, but even if we would want to
+run additional prometheus instances, we should be able to skip the instalation of the necessary operators, as all of them should already be present.
+
+
+Notes on operator deployment
+-------------------
+
+The deployment in question was done by configuring the CRDs, roles and rolebinding and operator setup:
+
+The definitions are as follows:
+- https://github.com/prometheus-operator/prometheus-operator/tree/v0.38.3/example/prometheus-operator-crd
+- https://github.com/prometheus-operator/prometheus-operator/tree/v0.38.3/example/rbac/prometheus-operator-crd
+- https://github.com/prometheus-operator/prometheus-operator/tree/v0.38.3/example/rbac/prometheus-operator
+
+Once the operator is correctly running, you just define a prometheus crd and it will create prometheus deployment for you.
+
+The POC lives in https://pagure.io/fedora-infra/ansible/blob/main/f/playbooks/openshift-apps/application-monitoring.yml
+
+Notes on application monitoring self-service
+---------------------------------
+
+To get the application monitored in the given namespace, the namespace must have the correct label applied,
+an in the namespace there needs to be either PodMonitor or ServiceMonitor CRD setup,
+that points towards the service or pod that exports metrics.
+
+This way, the merics will be scraped into the configured prometheus and correctly labeled.
+
+As an example, lets look at ServiceMonitor for bodhi:
+
+::
+  apiVersion: monitoring.coreos.com/v1
+  kind: ServiceMonitor
+  metadata:
+    labels:
+      monitoring-key: cpe
+    name: bodhi-service
+    namespace: bodhi
+  spec:
+    endpoints:
+      - path: /metrics
+    selector:
+      matchLabels:
+        service: web
+
+In this example, we are only targetting the service wit label service:web, but we have the entire matching
+machinery at our disposal, see `Matcher <https://v1-17.docs.kubernetes.io/docs/reference/generated/kubernetes-api/v1.17/#labelselector-v1-meta>`_ .
+
+To manage alerting, you can create an alerting rule:
+
+::
+    apiVersion: monitoring.coreos.com/v1
+    kind: PrometheusRule
+    metadata:
+      labels:
+      monitoring-key: cpe
+      name: bodhi-rules
+    spec:
+    spec:
+  groups:
+    - name: general.rules
+      rules:
+        - alert: DeadMansSwitch
+          annotations:
+            description: >-
+              This is a DeadMansSwitch meant to ensure that the entire Alerting
+              pipeline is functional.
+            summary: Alerting DeadMansSwitch
+          expr: vector(1)
+          labels:
+            severity: none
+
+This would create a alert, that will always fire, to serve as a check the alerting works.
+You should be able to see it in alert manager.
+
+To have an alert that actually does something, you should set expr to something else than vector(1).
+For example, to alert on rate of 500 responses of a service going over 5/s in past 10 minutes:
+
+sum(rate(pyramid_request_count{job="bodhi-web", status="500"}[10m])) > 5
+
+The alerts themselves would be the routed for further processing and notification according to rules in alertmanager,
+these are not available to change from the developers namespaces.
+
+The managing and acknowledging of the alerts can be done in alert-manager in rudimentary fashion.