Added more prometheus documentation
This commit is contained in:
parent
ed508e7b9b
commit
646a390c9a
2 changed files with 124 additions and 1 deletions
|
@ -10,6 +10,7 @@ This way, the merics will be scraped into the configured prometheus and correctl
|
||||||
As an example, lets look at ServiceMonitor for bodhi:
|
As an example, lets look at ServiceMonitor for bodhi:
|
||||||
|
|
||||||
::
|
::
|
||||||
|
|
||||||
apiVersion: monitoring.coreos.com/v1
|
apiVersion: monitoring.coreos.com/v1
|
||||||
kind: ServiceMonitor
|
kind: ServiceMonitor
|
||||||
metadata:
|
metadata:
|
||||||
|
@ -30,6 +31,7 @@ machinery at our disposal, see `Matcher <https://v1-17.docs.kubernetes.io/docs/r
|
||||||
To manage alerting, you can create an alerting rule:
|
To manage alerting, you can create an alerting rule:
|
||||||
|
|
||||||
::
|
::
|
||||||
|
|
||||||
apiVersion: monitoring.coreos.com/v1
|
apiVersion: monitoring.coreos.com/v1
|
||||||
kind: PrometheusRule
|
kind: PrometheusRule
|
||||||
metadata:
|
metadata:
|
||||||
|
@ -79,3 +81,32 @@ As part of the proof of concept we have instrumented Bodhi application,
|
||||||
to collect data through prometheus_client python library:
|
to collect data through prometheus_client python library:
|
||||||
https://github.com/fedora-infra/bodhi/pull/4079
|
https://github.com/fedora-infra/bodhi/pull/4079
|
||||||
|
|
||||||
|
Notes on alerting
|
||||||
|
-----------------
|
||||||
|
|
||||||
|
To be be notified of alerts, you need to be subscribed to recievers that
|
||||||
|
have been configured in alertmanager.
|
||||||
|
|
||||||
|
The configuration of the rules you want to alert on can be done in the namspace of your application.
|
||||||
|
For example:
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
apiVersion: monitoring.coreos.com/v1
|
||||||
|
kind: PrometheusRule
|
||||||
|
metadata:
|
||||||
|
labels:
|
||||||
|
monitoring-key: cpe
|
||||||
|
name: prometheus-application-monitoring-rules
|
||||||
|
spec:
|
||||||
|
groups:
|
||||||
|
- name: general.rules
|
||||||
|
rules:
|
||||||
|
- alert: AlertBodhi500Status
|
||||||
|
annotations:
|
||||||
|
summary: Alerting on too many server errors
|
||||||
|
expr: (100*sum(rate(pyramid_request_count{namespace="bodhi", path_info_pattern=~".*[^healthz]", status="500"}[20m]))/sum(rate(pyramid_request_count{namespace="bodhi", path_info_pattern=~".*[^healthz]"}[20m])))>1
|
||||||
|
labels:
|
||||||
|
severity: high
|
||||||
|
|
||||||
|
would alert if there is more than 1% responses with 500 status code.
|
|
@ -113,3 +113,95 @@ https://github.com/timescale/promscale
|
||||||
- url: "http://promscale:9201/write"
|
- url: "http://promscale:9201/write"
|
||||||
remote_read:
|
remote_read:
|
||||||
- url: "http://promscale:9201/read"
|
- url: "http://promscale:9201/read"
|
||||||
|
|
||||||
|
Notes on auxialiary services
|
||||||
|
----------------------------
|
||||||
|
|
||||||
|
As prometheus is primarily targeted to collect metrics from
|
||||||
|
services that have beein instrumented to expose them, if you don't
|
||||||
|
your service is not instrumented, or it is not a service,
|
||||||
|
i.e. a batch-job, you need an adapter to help you with the metrics collection.
|
||||||
|
|
||||||
|
There are two services that help with this.
|
||||||
|
|
||||||
|
* `blackbox exporter <https://github.com/prometheus/blackbox_exporter>`_ to monitor services that have not been instruented based on querying public a.p.i.
|
||||||
|
* `push gateqay <https://prometheus.io/docs/practices/pushing/#should-i-be-using-the-pushgateway>`_ that helps collect information from batch-jobs
|
||||||
|
|
||||||
|
Maintaining the push-gateway can be relegated to the application developer,
|
||||||
|
as it is lightweight, and by colloecting metrics from the namespace it is running in,
|
||||||
|
the data will be correctly labeled.
|
||||||
|
|
||||||
|
With blackbox exporter, it can be beneficial to have it running as prometheus side-car,
|
||||||
|
in simmilar fashion, as we configure oauth-proxy, adding this to the containers section
|
||||||
|
of the prometheus definition:
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
- name: blackbox-exporter
|
||||||
|
volumeMounts:
|
||||||
|
- name: configmap-blackbox
|
||||||
|
mountPath: /etc/blackbox-config
|
||||||
|
- mountPath: /etc/tls/private
|
||||||
|
name: secret-prometheus-k8s-tls
|
||||||
|
image: quay.io/prometheus/blackbox-exporter:4.4
|
||||||
|
args:
|
||||||
|
- '--config.file=/etc/blackbox-config/blackbox.yml'
|
||||||
|
ports:
|
||||||
|
- containerPort: 9115
|
||||||
|
name: blackbox
|
||||||
|
|
||||||
|
We can then instruct what is to be monitored through the configmap-blackbox, you can find `relevant examples <https://github.com/prometheus/blackbox_exporter/blob/master/example.yml>` in the project repo.
|
||||||
|
Beause blackox exporter is in the sam epod, we need to use the additional-scrape-config to add it in.
|
||||||
|
|
||||||
|
Notes on alerting
|
||||||
|
-----------------
|
||||||
|
|
||||||
|
Prometheus as is, can have rules configured that trigger alerts, once
|
||||||
|
a specific query evaluates to true. The definition of the rule is explained in the companion docs
|
||||||
|
for prometheus for developers and can be created in the namespace of the running application.
|
||||||
|
|
||||||
|
Here, we need to focus what happens with alert after prometheus realizes it should fire it,
|
||||||
|
based on a rule.
|
||||||
|
|
||||||
|
In prometheus crd definition, there is a section about the alert-manager that is supposed to
|
||||||
|
manage the forwarding of these alerts.
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
alerting:
|
||||||
|
alertmanagers:
|
||||||
|
- bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
|
||||||
|
name: alertmanager-service
|
||||||
|
namespace: application-monitoring
|
||||||
|
port: web
|
||||||
|
scheme: https
|
||||||
|
tlsConfig:
|
||||||
|
caFile: /var/run/secrets/kubernetes.io/serviceaccount/service-ca.crt
|
||||||
|
serverName: alertmanager-service.application-monitoring.svc
|
||||||
|
|
||||||
|
We already have alertmanager running and configured by the alertmanager-operator.
|
||||||
|
Alertmanager itself is really simplistic with a simple ui and api, that alows for silencing an
|
||||||
|
alert for a given ammount of time.
|
||||||
|
|
||||||
|
It it is expected that the actual user-interaction is happening elsewhere,
|
||||||
|
either through services like OpsGenie, or through i.e. `integration with zabbix <https://devopy.io/setting-up-zabbix-alertmanager-integration/>`_
|
||||||
|
|
||||||
|
More of a build-it yourself solution is to use i.e. https://karma-dashboard.io/,
|
||||||
|
but we haven't tried any of these as the part of our POC.
|
||||||
|
|
||||||
|
To be able to be notified of the alert, you need to have the `correct reciever configuration <https://prometheus.io/docs/alerting/latest/configuration/#email_config>`_ in the alertmanagers secret:
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
global:
|
||||||
|
resolve_timeout: 5m
|
||||||
|
route:
|
||||||
|
group_by: ['job']
|
||||||
|
group_wait: 10s
|
||||||
|
group_interval: 10s
|
||||||
|
repeat_interval: 30m
|
||||||
|
receiver: 'email'
|
||||||
|
receivers:
|
||||||
|
- name: 'email'
|
||||||
|
email_configs:
|
||||||
|
- to: 'asaleh@redhat.com'
|
Loading…
Add table
Add a link
Reference in a new issue