2021-04-08 09:59:31 +02:00
|
|
|
Notes on application monitoring self-service
|
2023-11-16 08:02:56 +10:00
|
|
|
============================================
|
2021-04-08 09:59:31 +02:00
|
|
|
|
2023-11-16 08:02:56 +10:00
|
|
|
To get the application monitored in the given namespace, the namespace must have the
|
|
|
|
correct label applied, an in the namespace there needs to be either PodMonitor or
|
|
|
|
ServiceMonitor CRD setup, that points towards the service or pod that exports metrics.
|
2021-04-08 09:59:31 +02:00
|
|
|
|
2023-11-16 08:02:56 +10:00
|
|
|
This way, the merics will be scraped into the configured prometheus and correctly
|
|
|
|
labeled.
|
2021-04-08 09:59:31 +02:00
|
|
|
|
|
|
|
As an example, lets look at ServiceMonitor for bodhi:
|
|
|
|
|
2023-11-16 08:02:56 +10:00
|
|
|
.. code-block::
|
2021-04-14 16:26:21 +02:00
|
|
|
|
2021-04-08 09:59:31 +02:00
|
|
|
apiVersion: monitoring.coreos.com/v1
|
2023-11-16 08:02:56 +10:00
|
|
|
kind: ServiceMonitor
|
2021-04-08 09:59:31 +02:00
|
|
|
metadata:
|
|
|
|
labels:
|
2023-11-16 08:02:56 +10:00
|
|
|
monitoring-key: cpe
|
|
|
|
name: bodhi-service
|
|
|
|
namespace: bodhi
|
2021-04-08 09:59:31 +02:00
|
|
|
spec:
|
2023-11-16 08:02:56 +10:00
|
|
|
endpoints:
|
|
|
|
- path: /metrics
|
|
|
|
selector:
|
|
|
|
matchLabels:
|
|
|
|
service: web
|
|
|
|
|
|
|
|
In this example, we are only targetting the service wit label service:web, but we have
|
|
|
|
the entire matching machinery at our disposal, see `Matcher
|
|
|
|
<https://v1-17.docs.kubernetes.io/docs/reference/generated/kubernetes-api/v1.17/#labelselector-v1-meta>`_
|
|
|
|
.
|
|
|
|
|
|
|
|
To manage alerting, you can create an alerting rule:
|
|
|
|
|
|
|
|
.. code-block::
|
|
|
|
|
|
|
|
apiVersion: monitoring.coreos.com/v1
|
|
|
|
kind: PrometheusRule
|
|
|
|
metadata:
|
|
|
|
labels:
|
|
|
|
monitoring-key: cpe
|
|
|
|
name: bodhi-rules
|
|
|
|
spec:
|
|
|
|
spec:
|
|
|
|
groups:
|
|
|
|
- name: general.rules
|
|
|
|
rules:
|
|
|
|
- alert: DeadMansSwitch
|
|
|
|
annotations:
|
|
|
|
description: >-
|
|
|
|
This is a DeadMansSwitch meant to ensure that the entire Alerting
|
|
|
|
pipeline is functional.
|
|
|
|
summary: Alerting DeadMansSwitch
|
|
|
|
expr: vector(1)
|
|
|
|
labels:
|
|
|
|
severity: none
|
|
|
|
|
|
|
|
This would create a alert, that will always fire, to serve as a check the alerting
|
|
|
|
works. You should be able to see it in alert manager.
|
|
|
|
|
|
|
|
To have an alert that actually does something, you should set expr to something else
|
|
|
|
than vector(1). For example, to alert on rate of 500 responses of a service going over
|
|
|
|
5/s in past 10 minutes:
|
2021-04-08 09:59:31 +02:00
|
|
|
|
|
|
|
sum(rate(pyramid_request_count{job="bodhi-web", status="500"}[10m])) > 5
|
|
|
|
|
2023-11-16 08:02:56 +10:00
|
|
|
The alerts themselves would be the routed for further processing and notification
|
|
|
|
according to rules in alertmanager, these are not available to change from the
|
|
|
|
developers namespaces.
|
2021-04-08 09:59:31 +02:00
|
|
|
|
2023-11-16 08:02:56 +10:00
|
|
|
The managing and acknowledging of the alerts can be done in alert-manager in rudimentary
|
|
|
|
fashion.
|
2021-04-14 11:46:27 +02:00
|
|
|
|
|
|
|
Notes on instrumenting the application
|
2023-11-16 08:02:56 +10:00
|
|
|
======================================
|
2021-04-14 11:46:27 +02:00
|
|
|
|
2023-11-16 08:02:56 +10:00
|
|
|
Prometheus expects applications to scrape metrics from to be services, with '/metrics'
|
|
|
|
endpoint exposed with metrics in correct format.
|
2021-04-14 11:46:27 +02:00
|
|
|
|
2023-11-16 08:02:56 +10:00
|
|
|
There are libraries that help with this for many different languages, confusingly called
|
|
|
|
client-libraries, eve though they usually export metrics as a http server endpoint:
|
2021-04-14 11:46:27 +02:00
|
|
|
https://prometheus.io/docs/instrumenting/clientlibs/
|
|
|
|
|
2023-11-16 08:02:56 +10:00
|
|
|
As part of the proof of concept we have instrumented Bodhi application, to collect data
|
|
|
|
through prometheus_client python library:
|
2021-04-14 11:46:27 +02:00
|
|
|
https://github.com/fedora-infra/bodhi/pull/4079
|
|
|
|
|
2021-04-14 16:26:21 +02:00
|
|
|
Notes on alerting
|
2023-11-16 08:02:56 +10:00
|
|
|
=================
|
2021-04-14 16:26:21 +02:00
|
|
|
|
2023-11-16 08:02:56 +10:00
|
|
|
To be be notified of alerts, you need to be subscribed to recievers that have been
|
|
|
|
configured in alertmanager.
|
2021-04-14 16:26:21 +02:00
|
|
|
|
2023-11-16 08:02:56 +10:00
|
|
|
The configuration of the rules you want to alert on can be done in the namspace of your
|
|
|
|
application. For example:
|
2021-04-14 16:26:21 +02:00
|
|
|
|
2023-11-16 08:02:56 +10:00
|
|
|
.. code-block::
|
2021-04-14 16:26:21 +02:00
|
|
|
|
2021-04-19 11:47:43 +02:00
|
|
|
apiVersion: monitoring.coreos.com/v1
|
|
|
|
kind: PrometheusRule
|
|
|
|
metadata:
|
|
|
|
labels:
|
|
|
|
monitoring-key: cpe
|
|
|
|
name: prometheus-application-monitoring-rules
|
|
|
|
spec:
|
|
|
|
groups:
|
|
|
|
- name: general.rules
|
|
|
|
rules:
|
|
|
|
- alert: AlertBodhi500Status
|
|
|
|
annotations:
|
|
|
|
summary: Alerting on too many server errors
|
|
|
|
expr: (100*sum(rate(pyramid_request_count{namespace="bodhi", path_info_pattern=~".*[^healthz]", status="500"}[20m]))/sum(rate(pyramid_request_count{namespace="bodhi", path_info_pattern=~".*[^healthz]"}[20m])))>1
|
|
|
|
labels:
|
|
|
|
severity: high
|
|
|
|
|
|
|
|
would alert if there is more than 1% responses with 500 status code.
|