diff --git a/docs/monitoring_metrics/index.rst b/docs/monitoring_metrics/index.rst index d1ae92a..d91074b 100644 --- a/docs/monitoring_metrics/index.rst +++ b/docs/monitoring_metrics/index.rst @@ -32,5 +32,6 @@ In process we want to be able to answer the questions posed in the latest mailin .. toctree:: :maxdepth: 1 + prometheus faq diff --git a/docs/monitoring_metrics/prometheus.rst b/docs/monitoring_metrics/prometheus.rst new file mode 100644 index 0000000..f6bc601 --- /dev/null +++ b/docs/monitoring_metrics/prometheus.rst @@ -0,0 +1,98 @@ +Monitoring / Metrics with Prometheus +======================== + +For deployment, we used combination for configuration of prometheus operator and application-monitoring operator. + +Beware, most of the deployment notes could be mostly obsolete in really short time. +The POC was done on OpenShift 3.11, which limited us in using older version of prometheus operator, +as well as the no longer maintained application-monitoring operator. + +In openshift 4.x that we plan to use in the near future, there is supported way integrated in the openshift deployment: + +* https://docs.openshift.com/container-platform/4.7/monitoring/understanding-the-monitoring-stack.html +* https://docs.openshift.com/container-platform/4.7/monitoring/configuring-the-monitoring-stack.html#configuring-the-monitoring-stack +* https://docs.openshift.com/container-platform/4.7/monitoring/enabling-monitoring-for-user-defined-projects.html + +The supported stack is more limited, especially w.r.t. adding user defined pod- and service-monitors, but even if we would want to +run additional prometheus instances, we should be able to skip the instalation of the necessary operators, as all of them should already be present. + + +Notes on operator deployment +------------------- + +The deployment in question was done by configuring the CRDs, roles and rolebinding and operator setup: + +The definitions are as follows: +- https://github.com/prometheus-operator/prometheus-operator/tree/v0.38.3/example/prometheus-operator-crd +- https://github.com/prometheus-operator/prometheus-operator/tree/v0.38.3/example/rbac/prometheus-operator-crd +- https://github.com/prometheus-operator/prometheus-operator/tree/v0.38.3/example/rbac/prometheus-operator + +Once the operator is correctly running, you just define a prometheus crd and it will create prometheus deployment for you. + +The POC lives in https://pagure.io/fedora-infra/ansible/blob/main/f/playbooks/openshift-apps/application-monitoring.yml + +Notes on application monitoring self-service +--------------------------------- + +To get the application monitored in the given namespace, the namespace must have the correct label applied, +an in the namespace there needs to be either PodMonitor or ServiceMonitor CRD setup, +that points towards the service or pod that exports metrics. + +This way, the merics will be scraped into the configured prometheus and correctly labeled. + +As an example, lets look at ServiceMonitor for bodhi: + +:: + apiVersion: monitoring.coreos.com/v1 + kind: ServiceMonitor + metadata: + labels: + monitoring-key: cpe + name: bodhi-service + namespace: bodhi + spec: + endpoints: + - path: /metrics + selector: + matchLabels: + service: web + +In this example, we are only targetting the service wit label service:web, but we have the entire matching +machinery at our disposal, see `Matcher `_ . + +To manage alerting, you can create an alerting rule: + +:: + apiVersion: monitoring.coreos.com/v1 + kind: PrometheusRule + metadata: + labels: + monitoring-key: cpe + name: bodhi-rules + spec: + spec: + groups: + - name: general.rules + rules: + - alert: DeadMansSwitch + annotations: + description: >- + This is a DeadMansSwitch meant to ensure that the entire Alerting + pipeline is functional. + summary: Alerting DeadMansSwitch + expr: vector(1) + labels: + severity: none + +This would create a alert, that will always fire, to serve as a check the alerting works. +You should be able to see it in alert manager. + +To have an alert that actually does something, you should set expr to something else than vector(1). +For example, to alert on rate of 500 responses of a service going over 5/s in past 10 minutes: + +sum(rate(pyramid_request_count{job="bodhi-web", status="500"}[10m])) > 5 + +The alerts themselves would be the routed for further processing and notification according to rules in alertmanager, +these are not available to change from the developers namespaces. + +The managing and acknowledging of the alerts can be done in alert-manager in rudimentary fashion.