Added more prometheus documentation
This commit is contained in:
parent
ed508e7b9b
commit
646a390c9a
2 changed files with 124 additions and 1 deletions
|
@ -10,6 +10,7 @@ This way, the merics will be scraped into the configured prometheus and correctl
|
|||
As an example, lets look at ServiceMonitor for bodhi:
|
||||
|
||||
::
|
||||
|
||||
apiVersion: monitoring.coreos.com/v1
|
||||
kind: ServiceMonitor
|
||||
metadata:
|
||||
|
@ -30,6 +31,7 @@ machinery at our disposal, see `Matcher <https://v1-17.docs.kubernetes.io/docs/r
|
|||
To manage alerting, you can create an alerting rule:
|
||||
|
||||
::
|
||||
|
||||
apiVersion: monitoring.coreos.com/v1
|
||||
kind: PrometheusRule
|
||||
metadata:
|
||||
|
@ -79,3 +81,32 @@ As part of the proof of concept we have instrumented Bodhi application,
|
|||
to collect data through prometheus_client python library:
|
||||
https://github.com/fedora-infra/bodhi/pull/4079
|
||||
|
||||
Notes on alerting
|
||||
-----------------
|
||||
|
||||
To be be notified of alerts, you need to be subscribed to recievers that
|
||||
have been configured in alertmanager.
|
||||
|
||||
The configuration of the rules you want to alert on can be done in the namspace of your application.
|
||||
For example:
|
||||
|
||||
::
|
||||
|
||||
apiVersion: monitoring.coreos.com/v1
|
||||
kind: PrometheusRule
|
||||
metadata:
|
||||
labels:
|
||||
monitoring-key: cpe
|
||||
name: prometheus-application-monitoring-rules
|
||||
spec:
|
||||
groups:
|
||||
- name: general.rules
|
||||
rules:
|
||||
- alert: AlertBodhi500Status
|
||||
annotations:
|
||||
summary: Alerting on too many server errors
|
||||
expr: (100*sum(rate(pyramid_request_count{namespace="bodhi", path_info_pattern=~".*[^healthz]", status="500"}[20m]))/sum(rate(pyramid_request_count{namespace="bodhi", path_info_pattern=~".*[^healthz]"}[20m])))>1
|
||||
labels:
|
||||
severity: high
|
||||
|
||||
would alert if there is more than 1% responses with 500 status code.
|
|
@ -112,4 +112,96 @@ https://github.com/timescale/promscale
|
|||
remote_write:
|
||||
- url: "http://promscale:9201/write"
|
||||
remote_read:
|
||||
- url: "http://promscale:9201/read"
|
||||
- url: "http://promscale:9201/read"
|
||||
|
||||
Notes on auxialiary services
|
||||
----------------------------
|
||||
|
||||
As prometheus is primarily targeted to collect metrics from
|
||||
services that have beein instrumented to expose them, if you don't
|
||||
your service is not instrumented, or it is not a service,
|
||||
i.e. a batch-job, you need an adapter to help you with the metrics collection.
|
||||
|
||||
There are two services that help with this.
|
||||
|
||||
* `blackbox exporter <https://github.com/prometheus/blackbox_exporter>`_ to monitor services that have not been instruented based on querying public a.p.i.
|
||||
* `push gateqay <https://prometheus.io/docs/practices/pushing/#should-i-be-using-the-pushgateway>`_ that helps collect information from batch-jobs
|
||||
|
||||
Maintaining the push-gateway can be relegated to the application developer,
|
||||
as it is lightweight, and by colloecting metrics from the namespace it is running in,
|
||||
the data will be correctly labeled.
|
||||
|
||||
With blackbox exporter, it can be beneficial to have it running as prometheus side-car,
|
||||
in simmilar fashion, as we configure oauth-proxy, adding this to the containers section
|
||||
of the prometheus definition:
|
||||
|
||||
::
|
||||
|
||||
- name: blackbox-exporter
|
||||
volumeMounts:
|
||||
- name: configmap-blackbox
|
||||
mountPath: /etc/blackbox-config
|
||||
- mountPath: /etc/tls/private
|
||||
name: secret-prometheus-k8s-tls
|
||||
image: quay.io/prometheus/blackbox-exporter:4.4
|
||||
args:
|
||||
- '--config.file=/etc/blackbox-config/blackbox.yml'
|
||||
ports:
|
||||
- containerPort: 9115
|
||||
name: blackbox
|
||||
|
||||
We can then instruct what is to be monitored through the configmap-blackbox, you can find `relevant examples <https://github.com/prometheus/blackbox_exporter/blob/master/example.yml>` in the project repo.
|
||||
Beause blackox exporter is in the sam epod, we need to use the additional-scrape-config to add it in.
|
||||
|
||||
Notes on alerting
|
||||
-----------------
|
||||
|
||||
Prometheus as is, can have rules configured that trigger alerts, once
|
||||
a specific query evaluates to true. The definition of the rule is explained in the companion docs
|
||||
for prometheus for developers and can be created in the namespace of the running application.
|
||||
|
||||
Here, we need to focus what happens with alert after prometheus realizes it should fire it,
|
||||
based on a rule.
|
||||
|
||||
In prometheus crd definition, there is a section about the alert-manager that is supposed to
|
||||
manage the forwarding of these alerts.
|
||||
|
||||
::
|
||||
|
||||
alerting:
|
||||
alertmanagers:
|
||||
- bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
|
||||
name: alertmanager-service
|
||||
namespace: application-monitoring
|
||||
port: web
|
||||
scheme: https
|
||||
tlsConfig:
|
||||
caFile: /var/run/secrets/kubernetes.io/serviceaccount/service-ca.crt
|
||||
serverName: alertmanager-service.application-monitoring.svc
|
||||
|
||||
We already have alertmanager running and configured by the alertmanager-operator.
|
||||
Alertmanager itself is really simplistic with a simple ui and api, that alows for silencing an
|
||||
alert for a given ammount of time.
|
||||
|
||||
It it is expected that the actual user-interaction is happening elsewhere,
|
||||
either through services like OpsGenie, or through i.e. `integration with zabbix <https://devopy.io/setting-up-zabbix-alertmanager-integration/>`_
|
||||
|
||||
More of a build-it yourself solution is to use i.e. https://karma-dashboard.io/,
|
||||
but we haven't tried any of these as the part of our POC.
|
||||
|
||||
To be able to be notified of the alert, you need to have the `correct reciever configuration <https://prometheus.io/docs/alerting/latest/configuration/#email_config>`_ in the alertmanagers secret:
|
||||
|
||||
::
|
||||
|
||||
global:
|
||||
resolve_timeout: 5m
|
||||
route:
|
||||
group_by: ['job']
|
||||
group_wait: 10s
|
||||
group_interval: 10s
|
||||
repeat_interval: 30m
|
||||
receiver: 'email'
|
||||
receivers:
|
||||
- name: 'email'
|
||||
email_configs:
|
||||
- to: 'asaleh@redhat.com'
|
Loading…
Add table
Add a link
Reference in a new issue