arc/docs/monitoring_metrics/prometheus.rst

Monitoring / Metrics with Prometheus
========================

For deployment, we used combination for configuration of prometheus operator and application-monitoring operator.

Beware, most of the deployment notes could be mostly obsolete in really short time.
The POC was done on OpenShift 3.11, which limited us in using older version of prometheus operator,
as well as the no longer maintained application-monitoring operator.

In openshift 4.x that we plan to use in the near future, there is  supported way integrated in the openshift deployment:

* https://docs.openshift.com/container-platform/4.7/monitoring/understanding-the-monitoring-stack.html
* https://docs.openshift.com/container-platform/4.7/monitoring/configuring-the-monitoring-stack.html#configuring-the-monitoring-stack
* https://docs.openshift.com/container-platform/4.7/monitoring/enabling-monitoring-for-user-defined-projects.html

The supported stack is more limited, especially w.r.t. adding user defined pod- and service-monitors, but even if we would want to
run additional prometheus instances, we should be able to skip the instalation of the necessary operators, as all of them should already be present.


Notes on operator deployment
-------------------

The deployment in question was done by configuring the CRDs, roles and rolebinding and operator setup:

The definitions are as follows:
- https://github.com/prometheus-operator/prometheus-operator/tree/v0.38.3/example/prometheus-operator-crd
- https://github.com/prometheus-operator/prometheus-operator/tree/v0.38.3/example/rbac/prometheus-operator-crd
- https://github.com/prometheus-operator/prometheus-operator/tree/v0.38.3/example/rbac/prometheus-operator

Once the operator is correctly running, you just define a prometheus crd and it will create prometheus deployment for you.

The POC lives in https://pagure.io/fedora-infra/ansible/blob/main/f/playbooks/openshift-apps/application-monitoring.yml

Notes on application monitoring self-service
---------------------------------

To get the application monitored in the given namespace, the namespace must have the correct label applied,
an in the namespace there needs to be either PodMonitor or ServiceMonitor CRD setup,
that points towards the service or pod that exports metrics.

This way, the merics will be scraped into the configured prometheus and correctly labeled.

As an example, lets look at ServiceMonitor for bodhi:

::
  apiVersion: monitoring.coreos.com/v1
  kind: ServiceMonitor
  metadata:
    labels:
      monitoring-key: cpe
    name: bodhi-service
    namespace: bodhi
  spec:
    endpoints:
      - path: /metrics
    selector:
      matchLabels:
        service: web

In this example, we are only targetting the service wit label service:web, but we have the entire matching
machinery at our disposal, see `Matcher <https://v1-17.docs.kubernetes.io/docs/reference/generated/kubernetes-api/v1.17/#labelselector-v1-meta>`_ .

To manage alerting, you can create an alerting rule:

::
    apiVersion: monitoring.coreos.com/v1
    kind: PrometheusRule
    metadata:
      labels:
      monitoring-key: cpe
      name: bodhi-rules
    spec:
    spec:
  groups:
    - name: general.rules
      rules:
        - alert: DeadMansSwitch
          annotations:
            description: >-
              This is a DeadMansSwitch meant to ensure that the entire Alerting
              pipeline is functional.
            summary: Alerting DeadMansSwitch
          expr: vector(1)
          labels:
            severity: none

This would create a alert, that will always fire, to serve as a check the alerting works.
You should be able to see it in alert manager.

To have an alert that actually does something, you should set expr to something else than vector(1).
For example, to alert on rate of 500 responses of a service going over 5/s in past 10 minutes:

sum(rate(pyramid_request_count{job="bodhi-web", status="500"}[10m])) > 5

The alerts themselves would be the routed for further processing and notification according to rules in alertmanager,
these are not available to change from the developers namespaces.

The managing and acknowledging of the alerts can be done in alert-manager in rudimentary fashion.
Prometheus WIP 2021-04-08 09:59:31 +02:00			`Monitoring / Metrics with Prometheus`
			`========================`

			`For deployment, we used combination for configuration of prometheus operator and application-monitoring operator.`

			`Beware, most of the deployment notes could be mostly obsolete in really short time.`
			`The POC was done on OpenShift 3.11, which limited us in using older version of prometheus operator,`
			`as well as the no longer maintained application-monitoring operator.`

			`In openshift 4.x that we plan to use in the near future, there is supported way integrated in the openshift deployment:`

			`* https://docs.openshift.com/container-platform/4.7/monitoring/understanding-the-monitoring-stack.html`
			`* https://docs.openshift.com/container-platform/4.7/monitoring/configuring-the-monitoring-stack.html#configuring-the-monitoring-stack`
			`* https://docs.openshift.com/container-platform/4.7/monitoring/enabling-monitoring-for-user-defined-projects.html`

			`The supported stack is more limited, especially w.r.t. adding user defined pod- and service-monitors, but even if we would want to`
			`run additional prometheus instances, we should be able to skip the instalation of the necessary operators, as all of them should already be present.`


			`Notes on operator deployment`
			`-------------------`

			`The deployment in question was done by configuring the CRDs, roles and rolebinding and operator setup:`

			`The definitions are as follows:`
			`- https://github.com/prometheus-operator/prometheus-operator/tree/v0.38.3/example/prometheus-operator-crd`
			`- https://github.com/prometheus-operator/prometheus-operator/tree/v0.38.3/example/rbac/prometheus-operator-crd`
			`- https://github.com/prometheus-operator/prometheus-operator/tree/v0.38.3/example/rbac/prometheus-operator`

			`Once the operator is correctly running, you just define a prometheus crd and it will create prometheus deployment for you.`

			`The POC lives in https://pagure.io/fedora-infra/ansible/blob/main/f/playbooks/openshift-apps/application-monitoring.yml`

			`Notes on application monitoring self-service`
			`---------------------------------`

			`To get the application monitored in the given namespace, the namespace must have the correct label applied,`
			`an in the namespace there needs to be either PodMonitor or ServiceMonitor CRD setup,`
			`that points towards the service or pod that exports metrics.`

			`This way, the merics will be scraped into the configured prometheus and correctly labeled.`

			`As an example, lets look at ServiceMonitor for bodhi:`

			`::`
			`apiVersion: monitoring.coreos.com/v1`
			`kind: ServiceMonitor`
			`metadata:`
			`labels:`
			`monitoring-key: cpe`
			`name: bodhi-service`
			`namespace: bodhi`
			`spec:`
			`endpoints:`
			`- path: /metrics`
			`selector:`
			`matchLabels:`
			`service: web`

			`In this example, we are only targetting the service wit label service:web, but we have the entire matching`
			machinery at our disposal, see `Matcher <https://v1-17.docs.kubernetes.io/docs/reference/generated/kubernetes-api/v1.17/#labelselector-v1-meta>`_ .

			`To manage alerting, you can create an alerting rule:`

			`::`
			`apiVersion: monitoring.coreos.com/v1`
			`kind: PrometheusRule`
			`metadata:`
			`labels:`
			`monitoring-key: cpe`
			`name: bodhi-rules`
			`spec:`
			`spec:`
			`groups:`
			`- name: general.rules`
			`rules:`
			`- alert: DeadMansSwitch`
			`annotations:`
			`description: >-`
			`This is a DeadMansSwitch meant to ensure that the entire Alerting`
			`pipeline is functional.`
			`summary: Alerting DeadMansSwitch`
			`expr: vector(1)`
			`labels:`
			`severity: none`

			`This would create a alert, that will always fire, to serve as a check the alerting works.`
			`You should be able to see it in alert manager.`

			`To have an alert that actually does something, you should set expr to something else than vector(1).`
			`For example, to alert on rate of 500 responses of a service going over 5/s in past 10 minutes:`

			`sum(rate(pyramid_request_count{job="bodhi-web", status="500"}[10m])) > 5`

			`The alerts themselves would be the routed for further processing and notification according to rules in alertmanager,`
			`these are not available to change from the developers namespaces.`

			`The managing and acknowledging of the alerts can be done in alert-manager in rudimentary fashion.`