arc/docs/monitoring_metrics/prometheus_for_dev.rst

Notes on application monitoring self-service
============================================

To get the application monitored in the given namespace, the namespace must have the
correct label applied, an in the namespace there needs to be either PodMonitor or
ServiceMonitor CRD setup, that points towards the service or pod that exports metrics.

This way, the merics will be scraped into the configured prometheus and correctly
labeled.

As an example, lets look at ServiceMonitor for bodhi:

.. code-block::

    apiVersion: monitoring.coreos.com/v1
    kind: ServiceMonitor
    metadata:
      labels:
        monitoring-key: cpe
      name: bodhi-service
      namespace: bodhi
    spec:
      endpoints:
        - path: /metrics
      selector:
        matchLabels:
          service: web

In this example, we are only targetting the service wit label service:web, but we have
the entire matching machinery at our disposal, see `Matcher
<https://v1-17.docs.kubernetes.io/docs/reference/generated/kubernetes-api/v1.17/#labelselector-v1-meta>`_
.

To manage alerting, you can create an alerting rule:

.. code-block::

      apiVersion: monitoring.coreos.com/v1
      kind: PrometheusRule
      metadata:
        labels:
        monitoring-key: cpe
        name: bodhi-rules
      spec:
      spec:
    groups:
      - name: general.rules
        rules:
          - alert: DeadMansSwitch
            annotations:
              description: >-
                This is a DeadMansSwitch meant to ensure that the entire Alerting
                pipeline is functional.
              summary: Alerting DeadMansSwitch
            expr: vector(1)
            labels:
              severity: none

This would create a alert, that will always fire, to serve as a check the alerting
works. You should be able to see it in alert manager.

To have an alert that actually does something, you should set expr to something else
than vector(1). For example, to alert on rate of 500 responses of a service going over
5/s in past 10 minutes:

sum(rate(pyramid_request_count{job="bodhi-web", status="500"}[10m])) > 5

The alerts themselves would be the routed for further processing and notification
according to rules in alertmanager, these are not available to change from the
developers namespaces.

The managing and acknowledging of the alerts can be done in alert-manager in rudimentary
fashion.

Notes on instrumenting the application
======================================

Prometheus expects applications to scrape metrics from to be services, with '/metrics'
endpoint exposed with metrics in correct format.

There are libraries that help with this for many different languages, confusingly called
client-libraries, eve though they usually export metrics as a http server endpoint:
https://prometheus.io/docs/instrumenting/clientlibs/

As part of the proof of concept we have instrumented Bodhi application, to collect data
through prometheus_client python library:
https://github.com/fedora-infra/bodhi/pull/4079

Notes on alerting
=================

To be be notified of alerts, you need to be subscribed to recievers that have been
configured in alertmanager.

The configuration of the rules you want to alert on can be done in the namspace of your
application. For example:

.. code-block::

    apiVersion: monitoring.coreos.com/v1
    kind: PrometheusRule
    metadata:
      labels:
        monitoring-key: cpe
      name: prometheus-application-monitoring-rules
    spec:
      groups:
        - name: general.rules
          rules:
            - alert: AlertBodhi500Status
              annotations:
                summary: Alerting on too many server errors
              expr: (100*sum(rate(pyramid_request_count{namespace="bodhi", path_info_pattern=~".*[^healthz]", status="500"}[20m]))/sum(rate(pyramid_request_count{namespace="bodhi", path_info_pattern=~".*[^healthz]"}[20m])))>1
              labels:
                severity: high

would alert if there is more than 1% responses with 500 status code.
Prometheus WIP 2021-04-08 09:59:31 +02:00			`Notes on application monitoring self-service`
fix parsing errors and sphinx warnings Signed-off-by: Ryan Lerch <rlerch@redhat.com> 2023-11-16 08:02:56 +10:00			`============================================`
Prometheus WIP 2021-04-08 09:59:31 +02:00
fix parsing errors and sphinx warnings Signed-off-by: Ryan Lerch <rlerch@redhat.com> 2023-11-16 08:02:56 +10:00			`To get the application monitored in the given namespace, the namespace must have the`
			`correct label applied, an in the namespace there needs to be either PodMonitor or`
			`ServiceMonitor CRD setup, that points towards the service or pod that exports metrics.`
Prometheus WIP 2021-04-08 09:59:31 +02:00
fix parsing errors and sphinx warnings Signed-off-by: Ryan Lerch <rlerch@redhat.com> 2023-11-16 08:02:56 +10:00			`This way, the merics will be scraped into the configured prometheus and correctly`
			`labeled.`
Prometheus WIP 2021-04-08 09:59:31 +02:00
			`As an example, lets look at ServiceMonitor for bodhi:`

fix parsing errors and sphinx warnings Signed-off-by: Ryan Lerch <rlerch@redhat.com> 2023-11-16 08:02:56 +10:00			`.. code-block::`
Added more prometheus documentation 2021-04-14 16:26:21 +02:00
Prometheus WIP 2021-04-08 09:59:31 +02:00			`apiVersion: monitoring.coreos.com/v1`
fix parsing errors and sphinx warnings Signed-off-by: Ryan Lerch <rlerch@redhat.com> 2023-11-16 08:02:56 +10:00			`kind: ServiceMonitor`
Prometheus WIP 2021-04-08 09:59:31 +02:00			`metadata:`
			`labels:`
fix parsing errors and sphinx warnings Signed-off-by: Ryan Lerch <rlerch@redhat.com> 2023-11-16 08:02:56 +10:00			`monitoring-key: cpe`
			`name: bodhi-service`
			`namespace: bodhi`
Prometheus WIP 2021-04-08 09:59:31 +02:00			`spec:`
fix parsing errors and sphinx warnings Signed-off-by: Ryan Lerch <rlerch@redhat.com> 2023-11-16 08:02:56 +10:00			`endpoints:`
			`- path: /metrics`
			`selector:`
			`matchLabels:`
			`service: web`

			`In this example, we are only targetting the service wit label service:web, but we have`
			the entire matching machinery at our disposal, see `Matcher
			<https://v1-17.docs.kubernetes.io/docs/reference/generated/kubernetes-api/v1.17/#labelselector-v1-meta>`_
			`.`

			`To manage alerting, you can create an alerting rule:`

			`.. code-block::`

			`apiVersion: monitoring.coreos.com/v1`
			`kind: PrometheusRule`
			`metadata:`
			`labels:`
			`monitoring-key: cpe`
			`name: bodhi-rules`
			`spec:`
			`spec:`
			`groups:`
			`- name: general.rules`
			`rules:`
			`- alert: DeadMansSwitch`
			`annotations:`
			`description: >-`
			`This is a DeadMansSwitch meant to ensure that the entire Alerting`
			`pipeline is functional.`
			`summary: Alerting DeadMansSwitch`
			`expr: vector(1)`
			`labels:`
			`severity: none`

			`This would create a alert, that will always fire, to serve as a check the alerting`
			`works. You should be able to see it in alert manager.`

			`To have an alert that actually does something, you should set expr to something else`
			`than vector(1). For example, to alert on rate of 500 responses of a service going over`
			`5/s in past 10 minutes:`
Prometheus WIP 2021-04-08 09:59:31 +02:00
			`sum(rate(pyramid_request_count{job="bodhi-web", status="500"}[10m])) > 5`

fix parsing errors and sphinx warnings Signed-off-by: Ryan Lerch <rlerch@redhat.com> 2023-11-16 08:02:56 +10:00			`The alerts themselves would be the routed for further processing and notification`
			`according to rules in alertmanager, these are not available to change from the`
			`developers namespaces.`
Prometheus WIP 2021-04-08 09:59:31 +02:00
fix parsing errors and sphinx warnings Signed-off-by: Ryan Lerch <rlerch@redhat.com> 2023-11-16 08:02:56 +10:00			`The managing and acknowledging of the alerts can be done in alert-manager in rudimentary`
			`fashion.`
Notes on Prometheus research. 2021-04-14 11:46:27 +02:00
			`Notes on instrumenting the application`
fix parsing errors and sphinx warnings Signed-off-by: Ryan Lerch <rlerch@redhat.com> 2023-11-16 08:02:56 +10:00			`======================================`
Notes on Prometheus research. 2021-04-14 11:46:27 +02:00
fix parsing errors and sphinx warnings Signed-off-by: Ryan Lerch <rlerch@redhat.com> 2023-11-16 08:02:56 +10:00			`Prometheus expects applications to scrape metrics from to be services, with '/metrics'`
			`endpoint exposed with metrics in correct format.`
Notes on Prometheus research. 2021-04-14 11:46:27 +02:00
fix parsing errors and sphinx warnings Signed-off-by: Ryan Lerch <rlerch@redhat.com> 2023-11-16 08:02:56 +10:00			`There are libraries that help with this for many different languages, confusingly called`
			`client-libraries, eve though they usually export metrics as a http server endpoint:`
Notes on Prometheus research. 2021-04-14 11:46:27 +02:00			`https://prometheus.io/docs/instrumenting/clientlibs/`

fix parsing errors and sphinx warnings Signed-off-by: Ryan Lerch <rlerch@redhat.com> 2023-11-16 08:02:56 +10:00			`As part of the proof of concept we have instrumented Bodhi application, to collect data`
			`through prometheus_client python library:`
Notes on Prometheus research. 2021-04-14 11:46:27 +02:00			`https://github.com/fedora-infra/bodhi/pull/4079`

Added more prometheus documentation 2021-04-14 16:26:21 +02:00			`Notes on alerting`
fix parsing errors and sphinx warnings Signed-off-by: Ryan Lerch <rlerch@redhat.com> 2023-11-16 08:02:56 +10:00			`=================`
Added more prometheus documentation 2021-04-14 16:26:21 +02:00
fix parsing errors and sphinx warnings Signed-off-by: Ryan Lerch <rlerch@redhat.com> 2023-11-16 08:02:56 +10:00			`To be be notified of alerts, you need to be subscribed to recievers that have been`
			`configured in alertmanager.`
Added more prometheus documentation 2021-04-14 16:26:21 +02:00
fix parsing errors and sphinx warnings Signed-off-by: Ryan Lerch <rlerch@redhat.com> 2023-11-16 08:02:56 +10:00			`The configuration of the rules you want to alert on can be done in the namspace of your`
			`application. For example:`
Added more prometheus documentation 2021-04-14 16:26:21 +02:00
fix parsing errors and sphinx warnings Signed-off-by: Ryan Lerch <rlerch@redhat.com> 2023-11-16 08:02:56 +10:00			`.. code-block::`
Added more prometheus documentation 2021-04-14 16:26:21 +02:00
Adjust formatting in prometheus_for_dev Signed-off-by: Pierre-Yves Chibon <pingou@pingoured.fr> 2021-04-19 11:47:43 +02:00			`apiVersion: monitoring.coreos.com/v1`
			`kind: PrometheusRule`
			`metadata:`
			`labels:`
			`monitoring-key: cpe`
			`name: prometheus-application-monitoring-rules`
			`spec:`
			`groups:`
			`- name: general.rules`
			`rules:`
			`- alert: AlertBodhi500Status`
			`annotations:`
			`summary: Alerting on too many server errors`
			`expr: (100sum(rate(pyramid_request_count{namespace="bodhi", path_info_pattern=~".[^healthz]", status="500"}[20m]))/sum(rate(pyramid_request_count{namespace="bodhi", path_info_pattern=~".*[^healthz]"}[20m])))>1`
			`labels:`
			`severity: high`

			`would alert if there is more than 1% responses with 500 status code.`