99 lines
4.1 KiB
ReStructuredText
99 lines
4.1 KiB
ReStructuredText
|
Monitoring / Metrics with Prometheus
|
||
|
========================
|
||
|
|
||
|
For deployment, we used combination for configuration of prometheus operator and application-monitoring operator.
|
||
|
|
||
|
Beware, most of the deployment notes could be mostly obsolete in really short time.
|
||
|
The POC was done on OpenShift 3.11, which limited us in using older version of prometheus operator,
|
||
|
as well as the no longer maintained application-monitoring operator.
|
||
|
|
||
|
In openshift 4.x that we plan to use in the near future, there is supported way integrated in the openshift deployment:
|
||
|
|
||
|
* https://docs.openshift.com/container-platform/4.7/monitoring/understanding-the-monitoring-stack.html
|
||
|
* https://docs.openshift.com/container-platform/4.7/monitoring/configuring-the-monitoring-stack.html#configuring-the-monitoring-stack
|
||
|
* https://docs.openshift.com/container-platform/4.7/monitoring/enabling-monitoring-for-user-defined-projects.html
|
||
|
|
||
|
The supported stack is more limited, especially w.r.t. adding user defined pod- and service-monitors, but even if we would want to
|
||
|
run additional prometheus instances, we should be able to skip the instalation of the necessary operators, as all of them should already be present.
|
||
|
|
||
|
|
||
|
Notes on operator deployment
|
||
|
-------------------
|
||
|
|
||
|
The deployment in question was done by configuring the CRDs, roles and rolebinding and operator setup:
|
||
|
|
||
|
The definitions are as follows:
|
||
|
- https://github.com/prometheus-operator/prometheus-operator/tree/v0.38.3/example/prometheus-operator-crd
|
||
|
- https://github.com/prometheus-operator/prometheus-operator/tree/v0.38.3/example/rbac/prometheus-operator-crd
|
||
|
- https://github.com/prometheus-operator/prometheus-operator/tree/v0.38.3/example/rbac/prometheus-operator
|
||
|
|
||
|
Once the operator is correctly running, you just define a prometheus crd and it will create prometheus deployment for you.
|
||
|
|
||
|
The POC lives in https://pagure.io/fedora-infra/ansible/blob/main/f/playbooks/openshift-apps/application-monitoring.yml
|
||
|
|
||
|
Notes on application monitoring self-service
|
||
|
---------------------------------
|
||
|
|
||
|
To get the application monitored in the given namespace, the namespace must have the correct label applied,
|
||
|
an in the namespace there needs to be either PodMonitor or ServiceMonitor CRD setup,
|
||
|
that points towards the service or pod that exports metrics.
|
||
|
|
||
|
This way, the merics will be scraped into the configured prometheus and correctly labeled.
|
||
|
|
||
|
As an example, lets look at ServiceMonitor for bodhi:
|
||
|
|
||
|
::
|
||
|
apiVersion: monitoring.coreos.com/v1
|
||
|
kind: ServiceMonitor
|
||
|
metadata:
|
||
|
labels:
|
||
|
monitoring-key: cpe
|
||
|
name: bodhi-service
|
||
|
namespace: bodhi
|
||
|
spec:
|
||
|
endpoints:
|
||
|
- path: /metrics
|
||
|
selector:
|
||
|
matchLabels:
|
||
|
service: web
|
||
|
|
||
|
In this example, we are only targetting the service wit label service:web, but we have the entire matching
|
||
|
machinery at our disposal, see `Matcher <https://v1-17.docs.kubernetes.io/docs/reference/generated/kubernetes-api/v1.17/#labelselector-v1-meta>`_ .
|
||
|
|
||
|
To manage alerting, you can create an alerting rule:
|
||
|
|
||
|
::
|
||
|
apiVersion: monitoring.coreos.com/v1
|
||
|
kind: PrometheusRule
|
||
|
metadata:
|
||
|
labels:
|
||
|
monitoring-key: cpe
|
||
|
name: bodhi-rules
|
||
|
spec:
|
||
|
spec:
|
||
|
groups:
|
||
|
- name: general.rules
|
||
|
rules:
|
||
|
- alert: DeadMansSwitch
|
||
|
annotations:
|
||
|
description: >-
|
||
|
This is a DeadMansSwitch meant to ensure that the entire Alerting
|
||
|
pipeline is functional.
|
||
|
summary: Alerting DeadMansSwitch
|
||
|
expr: vector(1)
|
||
|
labels:
|
||
|
severity: none
|
||
|
|
||
|
This would create a alert, that will always fire, to serve as a check the alerting works.
|
||
|
You should be able to see it in alert manager.
|
||
|
|
||
|
To have an alert that actually does something, you should set expr to something else than vector(1).
|
||
|
For example, to alert on rate of 500 responses of a service going over 5/s in past 10 minutes:
|
||
|
|
||
|
sum(rate(pyramid_request_count{job="bodhi-web", status="500"}[10m])) > 5
|
||
|
|
||
|
The alerts themselves would be the routed for further processing and notification according to rules in alertmanager,
|
||
|
these are not available to change from the developers namespaces.
|
||
|
|
||
|
The managing and acknowledging of the alerts can be done in alert-manager in rudimentary fashion.
|