fix parsing errors and sphinx warnings
Signed-off-by: Ryan Lerch <rlerch@redhat.com>
This commit is contained in:
parent
8fb9b2fdf0
commit
ba720c3d77
98 changed files with 4799 additions and 4788 deletions
|
@ -1,80 +1,97 @@
|
|||
Monitoring / Metrics with Prometheus
|
||||
========================
|
||||
====================================
|
||||
|
||||
For deployment, we used combination for configuration of prometheus operator and application-monitoring operator.
|
||||
For deployment, we used combination for configuration of prometheus operator and
|
||||
application-monitoring operator.
|
||||
|
||||
Beware, most of the deployment notes could be mostly obsolete in really short time.
|
||||
The POC was done on OpenShift 3.11, which limited us in using older version of prometheus operator,
|
||||
as well as the no longer maintained application-monitoring operator.
|
||||
Beware, most of the deployment notes could be mostly obsolete in really short time. The
|
||||
POC was done on OpenShift 3.11, which limited us in using older version of prometheus
|
||||
operator, as well as the no longer maintained application-monitoring operator.
|
||||
|
||||
In openshift 4.x that we plan to use in the near future, there is supported way integrated in the openshift deployment:
|
||||
In openshift 4.x that we plan to use in the near future, there is supported way
|
||||
integrated in the openshift deployment:
|
||||
|
||||
* https://docs.openshift.com/container-platform/4.7/monitoring/understanding-the-monitoring-stack.html
|
||||
* https://docs.openshift.com/container-platform/4.7/monitoring/configuring-the-monitoring-stack.html#configuring-the-monitoring-stack
|
||||
* https://docs.openshift.com/container-platform/4.7/monitoring/enabling-monitoring-for-user-defined-projects.html
|
||||
|
||||
The supported stack is more limited, especially w.r.t. adding user defined pod- and service-monitors, but even if we would want to
|
||||
run additional prometheus instances, we should be able to skip the instalation of the necessary operators, as all of them should already be present.
|
||||
- https://docs.openshift.com/container-platform/4.7/monitoring/understanding-the-monitoring-stack.html
|
||||
- https://docs.openshift.com/container-platform/4.7/monitoring/configuring-the-monitoring-stack.html#configuring-the-monitoring-stack
|
||||
- https://docs.openshift.com/container-platform/4.7/monitoring/enabling-monitoring-for-user-defined-projects.html
|
||||
|
||||
The supported stack is more limited, especially w.r.t. adding user defined pod- and
|
||||
service-monitors, but even if we would want to run additional prometheus instances, we
|
||||
should be able to skip the instalation of the necessary operators, as all of them should
|
||||
already be present.
|
||||
|
||||
Notes on operator deployment
|
||||
-------------------
|
||||
----------------------------
|
||||
|
||||
Operator pattern is often used with kubernetes and openshift for more complex deployments.
|
||||
Instead of applying all of the configuration to deploy your services, you deploy a special,
|
||||
smaller service called operator, that has necessary permissions to deploy and configure the complex service.
|
||||
Once the operator is running, instead of configuring the service itself with service-specific config-maps,
|
||||
you create operator specific kubernetes objects, so-alled CRDs.
|
||||
Operator pattern is often used with kubernetes and openshift for more complex
|
||||
deployments. Instead of applying all of the configuration to deploy your services, you
|
||||
deploy a special, smaller service called operator, that has necessary permissions to
|
||||
deploy and configure the complex service. Once the operator is running, instead of
|
||||
configuring the service itself with service-specific config-maps, you create operator
|
||||
specific kubernetes objects, so-alled CRDs.
|
||||
|
||||
The deployment of the operator in question was done by configuring the CRDs, roles and rolebinding and operator setup:
|
||||
The deployment of the operator in question was done by configuring the CRDs, roles and
|
||||
rolebinding and operator setup:
|
||||
|
||||
The definitions are as follows:
|
||||
- https://github.com/prometheus-operator/prometheus-operator/tree/v0.38.3/example/prometheus-operator-crd
|
||||
- https://github.com/prometheus-operator/prometheus-operator/tree/v0.38.3/example/rbac/prometheus-operator-crd
|
||||
- https://github.com/prometheus-operator/prometheus-operator/tree/v0.38.3/example/rbac/prometheus-operator
|
||||
The definitions are as follows: -
|
||||
https://github.com/prometheus-operator/prometheus-operator/tree/v0.38.3/example/prometheus-operator-crd
|
||||
-
|
||||
https://github.com/prometheus-operator/prometheus-operator/tree/v0.38.3/example/rbac/prometheus-operator-crd
|
||||
-
|
||||
https://github.com/prometheus-operator/prometheus-operator/tree/v0.38.3/example/rbac/prometheus-operator
|
||||
|
||||
Once the operator is correctly running, you just define a prometheus crd and it will create prometheus deployment for you.
|
||||
|
||||
The POC lives in https://pagure.io/fedora-infra/ansible/blob/main/f/playbooks/openshift-apps/application-monitoring.yml
|
||||
Once the operator is correctly running, you just define a prometheus crd and it will
|
||||
create prometheus deployment for you.
|
||||
|
||||
The POC lives in
|
||||
https://pagure.io/fedora-infra/ansible/blob/main/f/playbooks/openshift-apps/application-monitoring.yml
|
||||
|
||||
Notes on application monitoring operator deployment
|
||||
---------------------------------------------------
|
||||
|
||||
The application-monitoring operator was created to solve the integration of Prometheus, Alertmanager and Grafana.
|
||||
After you configure it, it configures the relevant operators responsible for these services.
|
||||
The application-monitoring operator was created to solve the integration of Prometheus,
|
||||
Alertmanager and Grafana. After you configure it, it configures the relevant operators
|
||||
responsible for these services.
|
||||
|
||||
The most interesting difference between configuring this shared operator,
|
||||
compared to configuring these operators individually is that it configures some of the integrations,
|
||||
The most interesting difference between configuring this shared operator, compared to
|
||||
configuring these operators individually is that it configures some of the integrations,
|
||||
and it integrates well with openshifts auth system through oauth proxy.
|
||||
|
||||
The biggest drawback is, that the application-monitoring operator is orphanned project,
|
||||
but because it mostly configures other operators, it is relatively simple to just recreate
|
||||
the configuration for both prometheus and alertmanager to be deployed,
|
||||
and deploy the prometheus and alertmanager operators without the help or the application-monitoring operator.
|
||||
but because it mostly configures other operators, it is relatively simple to just
|
||||
recreate the configuration for both prometheus and alertmanager to be deployed, and
|
||||
deploy the prometheus and alertmanager operators without the help or the
|
||||
application-monitoring operator.
|
||||
|
||||
Notes on persistence
|
||||
--------------------
|
||||
|
||||
Prometheus by default expects to have a writable /prometheus folder,
|
||||
that can serve as persistent storage.
|
||||
Prometheus by default expects to have a writable /prometheus folder, that can serve as
|
||||
persistent storage.
|
||||
|
||||
For the persistent volume to work for this purpose, it has to
|
||||
**needs to have POSIX-compliant filesystem**, and NFS we currently have configured is not.
|
||||
This is discussed in the `operational aspects <https://prometheus.io/docs/prometheus/latest/storage/#operational-aspects>`_
|
||||
of Prometheus documentation
|
||||
For the persistent volume to work for this purpose, it has to **needs to have
|
||||
POSIX-compliant filesystem**, and NFS we currently have configured is not. This is
|
||||
discussed in the `operational aspects
|
||||
<https://prometheus.io/docs/prometheus/latest/storage/#operational-aspects>`_ of
|
||||
Prometheus documentation
|
||||
|
||||
The easiest supported way to have a POSIX-compliant `filesystem is to setup local-storage <https://docs.openshift.com/container-platform/3.11/install_config/configuring_local.html>`_
|
||||
The easiest supported way to have a POSIX-compliant `filesystem is to setup
|
||||
local-storage
|
||||
<https://docs.openshift.com/container-platform/3.11/install_config/configuring_local.html>`_
|
||||
in the cluster.
|
||||
|
||||
In 4.x versions of OpenShift `there is a local-storage-operator <https://docs.openshift.com/container-platform/4.7/storage/persistent_storage/persistent-storage-local.html>`_ for this purpose.
|
||||
In 4.x versions of OpenShift `there is a local-storage-operator
|
||||
<https://docs.openshift.com/container-platform/4.7/storage/persistent_storage/persistent-storage-local.html>`_
|
||||
for this purpose.
|
||||
|
||||
This is the simplest way to have working persistence, but it prevents us to have multiple instanes
|
||||
across openshift nodes, as the pod is using the underlying filesystem on the node.
|
||||
This is the simplest way to have working persistence, but it prevents us to have
|
||||
multiple instanes across openshift nodes, as the pod is using the underlying filesystem
|
||||
on the node.
|
||||
|
||||
To ask the operator to create persisted prometheus, you specify in its configuration i.e.:
|
||||
To ask the operator to create persisted prometheus, you specify in its configuration
|
||||
i.e.:
|
||||
|
||||
::
|
||||
.. code-block::
|
||||
|
||||
storage:
|
||||
volumeClaimTemplate:
|
||||
|
@ -87,27 +104,27 @@ To ask the operator to create persisted prometheus, you specify in its configura
|
|||
|
||||
By default retention is set to 24 hours and can be over-ridden
|
||||
|
||||
|
||||
Notes on long term storage
|
||||
--------------------
|
||||
--------------------------
|
||||
|
||||
Usually, prometheus itself is setup to store its metrics for shorter ammount of time,
|
||||
and it is expected that for longterm storage and analysis, there is some other storage solution,
|
||||
such as influxdb or timescaledb.
|
||||
and it is expected that for longterm storage and analysis, there is some other storage
|
||||
solution, such as influxdb or timescaledb.
|
||||
|
||||
We are currently running a POC that sychronizes Prometheus with Timescaledb (running on Postgresql)
|
||||
through a middleware service called `promscale <https://github.com/timescale/promscale>`_ .
|
||||
We are currently running a POC that sychronizes Prometheus with Timescaledb (running on
|
||||
Postgresql) through a middleware service called `promscale
|
||||
<https://github.com/timescale/promscale>`_ .
|
||||
|
||||
Promscale just needs an access to a appropriate postgresql database:
|
||||
and can be configured through PROMSCALE_DB_PASSWORD, PROMSCALE_DB_HOST.
|
||||
Promscale just needs an access to a appropriate postgresql database: and can be
|
||||
configured through PROMSCALE_DB_PASSWORD, PROMSCALE_DB_HOST.
|
||||
|
||||
By default it will ensure the database has timescaledb installed and configures its database
|
||||
automatically.
|
||||
By default it will ensure the database has timescaledb installed and configures its
|
||||
database automatically.
|
||||
|
||||
We setup prometheus with directive to use promscale service as a backend:
|
||||
https://github.com/timescale/promscale
|
||||
|
||||
::
|
||||
.. code-block::
|
||||
|
||||
remote_write:
|
||||
- url: "http://promscale:9201/write"
|
||||
|
@ -117,24 +134,27 @@ https://github.com/timescale/promscale
|
|||
Notes on auxialiary services
|
||||
----------------------------
|
||||
|
||||
As prometheus is primarily targeted to collect metrics from
|
||||
services that have beein instrumented to expose them, if your service is not instrumented,
|
||||
or it is not a service, i.e. a batch-job, you need an adapter to help you with the metrics collection.
|
||||
As prometheus is primarily targeted to collect metrics from services that have beein
|
||||
instrumented to expose them, if your service is not instrumented, or it is not a
|
||||
service, i.e. a batch-job, you need an adapter to help you with the metrics collection.
|
||||
|
||||
There are two services that help with this.
|
||||
|
||||
* `blackbox exporter <https://github.com/prometheus/blackbox_exporter>`_ to monitor services that have not been instrumented based on querying public a.p.i.
|
||||
* `push gateway <https://prometheus.io/docs/practices/pushing/#should-i-be-using-the-pushgateway>`_ that helps collect information from batch-jobs
|
||||
- `blackbox exporter <https://github.com/prometheus/blackbox_exporter>`_ to monitor
|
||||
services that have not been instrumented based on querying public a.p.i.
|
||||
- `push gateway
|
||||
<https://prometheus.io/docs/practices/pushing/#should-i-be-using-the-pushgateway>`_
|
||||
that helps collect information from batch-jobs
|
||||
|
||||
Maintaining the push-gateway can be relegated to the application developer,
|
||||
as it is lightweight, and by colloecting metrics from the namespace it is running in,
|
||||
the data will be correctly labeled.
|
||||
Maintaining the push-gateway can be relegated to the application developer, as it is
|
||||
lightweight, and by colloecting metrics from the namespace it is running in, the data
|
||||
will be correctly labeled.
|
||||
|
||||
With blackbox exporter, it can be beneficial to have it running as prometheus side-car,
|
||||
in simmilar fashion, as we configure oauth-proxy, adding this to the containers section
|
||||
of the prometheus definition:
|
||||
|
||||
::
|
||||
.. code-block::
|
||||
|
||||
- name: blackbox-exporter
|
||||
volumeMounts:
|
||||
|
@ -149,58 +169,65 @@ of the prometheus definition:
|
|||
- containerPort: 9115
|
||||
name: blackbox
|
||||
|
||||
We can then instruct what is to be monitored through the configmap-blackbox, you can find `relevant examples <https://github.com/prometheus/blackbox_exporter/blob/master/example.yml>` in the project repo.
|
||||
Beause blackox exporter is in the same pod, we need to use the additional-scrape-config to add it in.
|
||||
We can then instruct what is to be monitored through the configmap-blackbox, you can
|
||||
find `relevant examples
|
||||
<https://github.com/prometheus/blackbox_exporter/blob/master/example.yml>` in the
|
||||
project repo. Beause blackox exporter is in the same pod, we need to use the
|
||||
additional-scrape-config to add it in.
|
||||
|
||||
Notes on alerting
|
||||
-----------------
|
||||
|
||||
Prometheus as is, can have rules configured that trigger alerts, once
|
||||
a specific query evaluates to true. The definition of the rule is explained in the companion docs
|
||||
for prometheus for developers and can be created in the namespace of the running application.
|
||||
Prometheus as is, can have rules configured that trigger alerts, once a specific query
|
||||
evaluates to true. The definition of the rule is explained in the companion docs for
|
||||
prometheus for developers and can be created in the namespace of the running
|
||||
application.
|
||||
|
||||
Here, we need to focus what happens with alert after prometheus realizes it should fire it,
|
||||
based on a rule.
|
||||
Here, we need to focus what happens with alert after prometheus realizes it should fire
|
||||
it, based on a rule.
|
||||
|
||||
In prometheus crd definition, there is a section about the alert-manager that is supposed to
|
||||
manage the forwarding of these alerts.
|
||||
In prometheus crd definition, there is a section about the alert-manager that is
|
||||
supposed to manage the forwarding of these alerts.
|
||||
|
||||
::
|
||||
.. code-block::
|
||||
|
||||
alerting:
|
||||
alertmanagers:
|
||||
- bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
|
||||
name: alertmanager-service
|
||||
namespace: application-monitoring
|
||||
port: web
|
||||
scheme: https
|
||||
tlsConfig:
|
||||
caFile: /var/run/secrets/kubernetes.io/serviceaccount/service-ca.crt
|
||||
serverName: alertmanager-service.application-monitoring.svc
|
||||
alerting:
|
||||
alertmanagers:
|
||||
- bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
|
||||
name: alertmanager-service
|
||||
namespace: application-monitoring
|
||||
port: web
|
||||
scheme: https
|
||||
tlsConfig:
|
||||
caFile: /var/run/secrets/kubernetes.io/serviceaccount/service-ca.crt
|
||||
serverName: alertmanager-service.application-monitoring.svc
|
||||
|
||||
We already have alertmanager running and configured by the alertmanager-operator.
|
||||
Alertmanager itself is really simplistic with a simple ui and api, that allows for silencing an
|
||||
alert for a given ammount of time.
|
||||
Alertmanager itself is really simplistic with a simple ui and api, that allows for
|
||||
silencing an alert for a given ammount of time.
|
||||
|
||||
It is expected that the actual user-interaction is happening elsewhere,
|
||||
either through services like OpsGenie, or through i.e. `integration with zabbix <https://devopy.io/setting-up-zabbix-alertmanager-integration/>`_
|
||||
It is expected that the actual user-interaction is happening elsewhere, either through
|
||||
services like OpsGenie, or through i.e. `integration with zabbix
|
||||
<https://devopy.io/setting-up-zabbix-alertmanager-integration/>`_
|
||||
|
||||
More of a build-it yourself solution is to use i.e. https://karma-dashboard.io/,
|
||||
but we haven't tried any of these as the part of our POC.
|
||||
More of a build-it yourself solution is to use i.e. https://karma-dashboard.io/, but we
|
||||
haven't tried any of these as the part of our POC.
|
||||
|
||||
To be able to be notified of the alert, you need to have the `correct reciever configuration <https://prometheus.io/docs/alerting/latest/configuration/#email_config>`_ in the alertmanagers secret:
|
||||
To be able to be notified of the alert, you need to have the `correct reciever
|
||||
configuration <https://prometheus.io/docs/alerting/latest/configuration/#email_config>`_
|
||||
in the alertmanagers secret:
|
||||
|
||||
::
|
||||
.. code-block::
|
||||
|
||||
global:
|
||||
resolve_timeout: 5m
|
||||
route:
|
||||
group_by: ['job']
|
||||
group_wait: 10s
|
||||
group_interval: 10s
|
||||
repeat_interval: 30m
|
||||
receiver: 'email'
|
||||
receivers:
|
||||
- name: 'email'
|
||||
email_configs:
|
||||
- to: 'asaleh@redhat.com'
|
||||
global:
|
||||
resolve_timeout: 5m
|
||||
route:
|
||||
group_by: ['job']
|
||||
group_wait: 10s
|
||||
group_interval: 10s
|
||||
repeat_interval: 30m
|
||||
receiver: 'email'
|
||||
receivers:
|
||||
- name: 'email'
|
||||
email_configs:
|
||||
- to: 'asaleh@redhat.com'
|
||||
|
|
Loading…
Add table
Add a link
Reference in a new issue