fix parsing errors and sphinx warnings

Signed-off-by: Ryan Lerch <rlerch@redhat.com>
This commit is contained in:
Ryan Lercho 2023-11-16 08:02:56 +10:00 committed by zlopez
parent 8fb9b2fdf0
commit ba720c3d77
98 changed files with 4799 additions and 4788 deletions

View file

@ -1,80 +1,97 @@
Monitoring / Metrics with Prometheus
========================
====================================
For deployment, we used combination for configuration of prometheus operator and application-monitoring operator.
For deployment, we used combination for configuration of prometheus operator and
application-monitoring operator.
Beware, most of the deployment notes could be mostly obsolete in really short time.
The POC was done on OpenShift 3.11, which limited us in using older version of prometheus operator,
as well as the no longer maintained application-monitoring operator.
Beware, most of the deployment notes could be mostly obsolete in really short time. The
POC was done on OpenShift 3.11, which limited us in using older version of prometheus
operator, as well as the no longer maintained application-monitoring operator.
In openshift 4.x that we plan to use in the near future, there is supported way integrated in the openshift deployment:
In openshift 4.x that we plan to use in the near future, there is supported way
integrated in the openshift deployment:
* https://docs.openshift.com/container-platform/4.7/monitoring/understanding-the-monitoring-stack.html
* https://docs.openshift.com/container-platform/4.7/monitoring/configuring-the-monitoring-stack.html#configuring-the-monitoring-stack
* https://docs.openshift.com/container-platform/4.7/monitoring/enabling-monitoring-for-user-defined-projects.html
The supported stack is more limited, especially w.r.t. adding user defined pod- and service-monitors, but even if we would want to
run additional prometheus instances, we should be able to skip the instalation of the necessary operators, as all of them should already be present.
- https://docs.openshift.com/container-platform/4.7/monitoring/understanding-the-monitoring-stack.html
- https://docs.openshift.com/container-platform/4.7/monitoring/configuring-the-monitoring-stack.html#configuring-the-monitoring-stack
- https://docs.openshift.com/container-platform/4.7/monitoring/enabling-monitoring-for-user-defined-projects.html
The supported stack is more limited, especially w.r.t. adding user defined pod- and
service-monitors, but even if we would want to run additional prometheus instances, we
should be able to skip the instalation of the necessary operators, as all of them should
already be present.
Notes on operator deployment
-------------------
----------------------------
Operator pattern is often used with kubernetes and openshift for more complex deployments.
Instead of applying all of the configuration to deploy your services, you deploy a special,
smaller service called operator, that has necessary permissions to deploy and configure the complex service.
Once the operator is running, instead of configuring the service itself with service-specific config-maps,
you create operator specific kubernetes objects, so-alled CRDs.
Operator pattern is often used with kubernetes and openshift for more complex
deployments. Instead of applying all of the configuration to deploy your services, you
deploy a special, smaller service called operator, that has necessary permissions to
deploy and configure the complex service. Once the operator is running, instead of
configuring the service itself with service-specific config-maps, you create operator
specific kubernetes objects, so-alled CRDs.
The deployment of the operator in question was done by configuring the CRDs, roles and rolebinding and operator setup:
The deployment of the operator in question was done by configuring the CRDs, roles and
rolebinding and operator setup:
The definitions are as follows:
- https://github.com/prometheus-operator/prometheus-operator/tree/v0.38.3/example/prometheus-operator-crd
- https://github.com/prometheus-operator/prometheus-operator/tree/v0.38.3/example/rbac/prometheus-operator-crd
- https://github.com/prometheus-operator/prometheus-operator/tree/v0.38.3/example/rbac/prometheus-operator
The definitions are as follows: -
https://github.com/prometheus-operator/prometheus-operator/tree/v0.38.3/example/prometheus-operator-crd
-
https://github.com/prometheus-operator/prometheus-operator/tree/v0.38.3/example/rbac/prometheus-operator-crd
-
https://github.com/prometheus-operator/prometheus-operator/tree/v0.38.3/example/rbac/prometheus-operator
Once the operator is correctly running, you just define a prometheus crd and it will create prometheus deployment for you.
The POC lives in https://pagure.io/fedora-infra/ansible/blob/main/f/playbooks/openshift-apps/application-monitoring.yml
Once the operator is correctly running, you just define a prometheus crd and it will
create prometheus deployment for you.
The POC lives in
https://pagure.io/fedora-infra/ansible/blob/main/f/playbooks/openshift-apps/application-monitoring.yml
Notes on application monitoring operator deployment
---------------------------------------------------
The application-monitoring operator was created to solve the integration of Prometheus, Alertmanager and Grafana.
After you configure it, it configures the relevant operators responsible for these services.
The application-monitoring operator was created to solve the integration of Prometheus,
Alertmanager and Grafana. After you configure it, it configures the relevant operators
responsible for these services.
The most interesting difference between configuring this shared operator,
compared to configuring these operators individually is that it configures some of the integrations,
The most interesting difference between configuring this shared operator, compared to
configuring these operators individually is that it configures some of the integrations,
and it integrates well with openshifts auth system through oauth proxy.
The biggest drawback is, that the application-monitoring operator is orphanned project,
but because it mostly configures other operators, it is relatively simple to just recreate
the configuration for both prometheus and alertmanager to be deployed,
and deploy the prometheus and alertmanager operators without the help or the application-monitoring operator.
but because it mostly configures other operators, it is relatively simple to just
recreate the configuration for both prometheus and alertmanager to be deployed, and
deploy the prometheus and alertmanager operators without the help or the
application-monitoring operator.
Notes on persistence
--------------------
Prometheus by default expects to have a writable /prometheus folder,
that can serve as persistent storage.
Prometheus by default expects to have a writable /prometheus folder, that can serve as
persistent storage.
For the persistent volume to work for this purpose, it has to
**needs to have POSIX-compliant filesystem**, and NFS we currently have configured is not.
This is discussed in the `operational aspects <https://prometheus.io/docs/prometheus/latest/storage/#operational-aspects>`_
of Prometheus documentation
For the persistent volume to work for this purpose, it has to **needs to have
POSIX-compliant filesystem**, and NFS we currently have configured is not. This is
discussed in the `operational aspects
<https://prometheus.io/docs/prometheus/latest/storage/#operational-aspects>`_ of
Prometheus documentation
The easiest supported way to have a POSIX-compliant `filesystem is to setup local-storage <https://docs.openshift.com/container-platform/3.11/install_config/configuring_local.html>`_
The easiest supported way to have a POSIX-compliant `filesystem is to setup
local-storage
<https://docs.openshift.com/container-platform/3.11/install_config/configuring_local.html>`_
in the cluster.
In 4.x versions of OpenShift `there is a local-storage-operator <https://docs.openshift.com/container-platform/4.7/storage/persistent_storage/persistent-storage-local.html>`_ for this purpose.
In 4.x versions of OpenShift `there is a local-storage-operator
<https://docs.openshift.com/container-platform/4.7/storage/persistent_storage/persistent-storage-local.html>`_
for this purpose.
This is the simplest way to have working persistence, but it prevents us to have multiple instanes
across openshift nodes, as the pod is using the underlying filesystem on the node.
This is the simplest way to have working persistence, but it prevents us to have
multiple instanes across openshift nodes, as the pod is using the underlying filesystem
on the node.
To ask the operator to create persisted prometheus, you specify in its configuration i.e.:
To ask the operator to create persisted prometheus, you specify in its configuration
i.e.:
::
.. code-block::
storage:
volumeClaimTemplate:
@ -87,27 +104,27 @@ To ask the operator to create persisted prometheus, you specify in its configura
By default retention is set to 24 hours and can be over-ridden
Notes on long term storage
--------------------
--------------------------
Usually, prometheus itself is setup to store its metrics for shorter ammount of time,
and it is expected that for longterm storage and analysis, there is some other storage solution,
such as influxdb or timescaledb.
and it is expected that for longterm storage and analysis, there is some other storage
solution, such as influxdb or timescaledb.
We are currently running a POC that sychronizes Prometheus with Timescaledb (running on Postgresql)
through a middleware service called `promscale <https://github.com/timescale/promscale>`_ .
We are currently running a POC that sychronizes Prometheus with Timescaledb (running on
Postgresql) through a middleware service called `promscale
<https://github.com/timescale/promscale>`_ .
Promscale just needs an access to a appropriate postgresql database:
and can be configured through PROMSCALE_DB_PASSWORD, PROMSCALE_DB_HOST.
Promscale just needs an access to a appropriate postgresql database: and can be
configured through PROMSCALE_DB_PASSWORD, PROMSCALE_DB_HOST.
By default it will ensure the database has timescaledb installed and configures its database
automatically.
By default it will ensure the database has timescaledb installed and configures its
database automatically.
We setup prometheus with directive to use promscale service as a backend:
https://github.com/timescale/promscale
::
.. code-block::
remote_write:
- url: "http://promscale:9201/write"
@ -117,24 +134,27 @@ https://github.com/timescale/promscale
Notes on auxialiary services
----------------------------
As prometheus is primarily targeted to collect metrics from
services that have beein instrumented to expose them, if your service is not instrumented,
or it is not a service, i.e. a batch-job, you need an adapter to help you with the metrics collection.
As prometheus is primarily targeted to collect metrics from services that have beein
instrumented to expose them, if your service is not instrumented, or it is not a
service, i.e. a batch-job, you need an adapter to help you with the metrics collection.
There are two services that help with this.
* `blackbox exporter <https://github.com/prometheus/blackbox_exporter>`_ to monitor services that have not been instrumented based on querying public a.p.i.
* `push gateway <https://prometheus.io/docs/practices/pushing/#should-i-be-using-the-pushgateway>`_ that helps collect information from batch-jobs
- `blackbox exporter <https://github.com/prometheus/blackbox_exporter>`_ to monitor
services that have not been instrumented based on querying public a.p.i.
- `push gateway
<https://prometheus.io/docs/practices/pushing/#should-i-be-using-the-pushgateway>`_
that helps collect information from batch-jobs
Maintaining the push-gateway can be relegated to the application developer,
as it is lightweight, and by colloecting metrics from the namespace it is running in,
the data will be correctly labeled.
Maintaining the push-gateway can be relegated to the application developer, as it is
lightweight, and by colloecting metrics from the namespace it is running in, the data
will be correctly labeled.
With blackbox exporter, it can be beneficial to have it running as prometheus side-car,
in simmilar fashion, as we configure oauth-proxy, adding this to the containers section
of the prometheus definition:
::
.. code-block::
- name: blackbox-exporter
volumeMounts:
@ -149,58 +169,65 @@ of the prometheus definition:
- containerPort: 9115
name: blackbox
We can then instruct what is to be monitored through the configmap-blackbox, you can find `relevant examples <https://github.com/prometheus/blackbox_exporter/blob/master/example.yml>` in the project repo.
Beause blackox exporter is in the same pod, we need to use the additional-scrape-config to add it in.
We can then instruct what is to be monitored through the configmap-blackbox, you can
find `relevant examples
<https://github.com/prometheus/blackbox_exporter/blob/master/example.yml>` in the
project repo. Beause blackox exporter is in the same pod, we need to use the
additional-scrape-config to add it in.
Notes on alerting
-----------------
Prometheus as is, can have rules configured that trigger alerts, once
a specific query evaluates to true. The definition of the rule is explained in the companion docs
for prometheus for developers and can be created in the namespace of the running application.
Prometheus as is, can have rules configured that trigger alerts, once a specific query
evaluates to true. The definition of the rule is explained in the companion docs for
prometheus for developers and can be created in the namespace of the running
application.
Here, we need to focus what happens with alert after prometheus realizes it should fire it,
based on a rule.
Here, we need to focus what happens with alert after prometheus realizes it should fire
it, based on a rule.
In prometheus crd definition, there is a section about the alert-manager that is supposed to
manage the forwarding of these alerts.
In prometheus crd definition, there is a section about the alert-manager that is
supposed to manage the forwarding of these alerts.
::
.. code-block::
alerting:
alertmanagers:
- bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
name: alertmanager-service
namespace: application-monitoring
port: web
scheme: https
tlsConfig:
caFile: /var/run/secrets/kubernetes.io/serviceaccount/service-ca.crt
serverName: alertmanager-service.application-monitoring.svc
alerting:
alertmanagers:
- bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
name: alertmanager-service
namespace: application-monitoring
port: web
scheme: https
tlsConfig:
caFile: /var/run/secrets/kubernetes.io/serviceaccount/service-ca.crt
serverName: alertmanager-service.application-monitoring.svc
We already have alertmanager running and configured by the alertmanager-operator.
Alertmanager itself is really simplistic with a simple ui and api, that allows for silencing an
alert for a given ammount of time.
Alertmanager itself is really simplistic with a simple ui and api, that allows for
silencing an alert for a given ammount of time.
It is expected that the actual user-interaction is happening elsewhere,
either through services like OpsGenie, or through i.e. `integration with zabbix <https://devopy.io/setting-up-zabbix-alertmanager-integration/>`_
It is expected that the actual user-interaction is happening elsewhere, either through
services like OpsGenie, or through i.e. `integration with zabbix
<https://devopy.io/setting-up-zabbix-alertmanager-integration/>`_
More of a build-it yourself solution is to use i.e. https://karma-dashboard.io/,
but we haven't tried any of these as the part of our POC.
More of a build-it yourself solution is to use i.e. https://karma-dashboard.io/, but we
haven't tried any of these as the part of our POC.
To be able to be notified of the alert, you need to have the `correct reciever configuration <https://prometheus.io/docs/alerting/latest/configuration/#email_config>`_ in the alertmanagers secret:
To be able to be notified of the alert, you need to have the `correct reciever
configuration <https://prometheus.io/docs/alerting/latest/configuration/#email_config>`_
in the alertmanagers secret:
::
.. code-block::
global:
resolve_timeout: 5m
route:
group_by: ['job']
group_wait: 10s
group_interval: 10s
repeat_interval: 30m
receiver: 'email'
receivers:
- name: 'email'
email_configs:
- to: 'asaleh@redhat.com'
global:
resolve_timeout: 5m
route:
group_by: ['job']
group_wait: 10s
group_interval: 10s
repeat_interval: 30m
receiver: 'email'
receivers:
- name: 'email'
email_configs:
- to: 'asaleh@redhat.com'