Notes on Prometheus research.

This commit is contained in:
Adam Saleh 2021-04-14 11:46:27 +02:00
parent 7efca65b49
commit ed508e7b9b
3 changed files with 148 additions and 34 deletions

View file

@ -28,10 +28,26 @@ In process we want to be able to answer the questions posed in the latest mailin
- Can zabbix handle our number of machines?
- How flexible is the alerting?
Main takeaway
-------------
We managed to create proof-of-concept monitoring solutions with both prometheus and zabbix.
The initial configuration has proven to have more pitfals than expected,
with Prometheus especially in the integration with openshift and its other auxialiary services,
and with Zabbix espcially with correctly setting up the ip-tables and network permissions,
and with configuring a reasonable setup for the user-access and user-account management.
Even despite these setbacks, we still feel this would be an improvement over our current setup based on Nagios.
To get a basic overview of Prometheus, you can watch this short tech-talk by Adam Saleh:
(accessible only to RedHat) https://drive.google.com/file/d/1-uEIkS2jaJ2b8V_4y-AKW1J6sdZzzlc9/view
or read up the more indepth report in the relevant sections of this documentation.
.. toctree::
:maxdepth: 1
prometheus
prometheus_for_ops
prometheus_for_dev
faq

View file

@ -1,36 +1,3 @@
Monitoring / Metrics with Prometheus
========================
For deployment, we used combination for configuration of prometheus operator and application-monitoring operator.
Beware, most of the deployment notes could be mostly obsolete in really short time.
The POC was done on OpenShift 3.11, which limited us in using older version of prometheus operator,
as well as the no longer maintained application-monitoring operator.
In openshift 4.x that we plan to use in the near future, there is supported way integrated in the openshift deployment:
* https://docs.openshift.com/container-platform/4.7/monitoring/understanding-the-monitoring-stack.html
* https://docs.openshift.com/container-platform/4.7/monitoring/configuring-the-monitoring-stack.html#configuring-the-monitoring-stack
* https://docs.openshift.com/container-platform/4.7/monitoring/enabling-monitoring-for-user-defined-projects.html
The supported stack is more limited, especially w.r.t. adding user defined pod- and service-monitors, but even if we would want to
run additional prometheus instances, we should be able to skip the instalation of the necessary operators, as all of them should already be present.
Notes on operator deployment
-------------------
The deployment in question was done by configuring the CRDs, roles and rolebinding and operator setup:
The definitions are as follows:
- https://github.com/prometheus-operator/prometheus-operator/tree/v0.38.3/example/prometheus-operator-crd
- https://github.com/prometheus-operator/prometheus-operator/tree/v0.38.3/example/rbac/prometheus-operator-crd
- https://github.com/prometheus-operator/prometheus-operator/tree/v0.38.3/example/rbac/prometheus-operator
Once the operator is correctly running, you just define a prometheus crd and it will create prometheus deployment for you.
The POC lives in https://pagure.io/fedora-infra/ansible/blob/main/f/playbooks/openshift-apps/application-monitoring.yml
Notes on application monitoring self-service
---------------------------------
@ -96,3 +63,19 @@ The alerts themselves would be the routed for further processing and notificatio
these are not available to change from the developers namespaces.
The managing and acknowledging of the alerts can be done in alert-manager in rudimentary fashion.
Notes on instrumenting the application
--------------------------------------
Prometheus expects applications to scrape metrics from
to be services, with '/metrics' endpoint exposed with metrics in correct
format.
There are libraries that help with this for many different languages,
confusingly called client-libraries, eve though they usually export metrics as a http server endpoint:
https://prometheus.io/docs/instrumenting/clientlibs/
As part of the proof of concept we have instrumented Bodhi application,
to collect data through prometheus_client python library:
https://github.com/fedora-infra/bodhi/pull/4079

View file

@ -0,0 +1,115 @@
Monitoring / Metrics with Prometheus
========================
For deployment, we used combination for configuration of prometheus operator and application-monitoring operator.
Beware, most of the deployment notes could be mostly obsolete in really short time.
The POC was done on OpenShift 3.11, which limited us in using older version of prometheus operator,
as well as the no longer maintained application-monitoring operator.
In openshift 4.x that we plan to use in the near future, there is supported way integrated in the openshift deployment:
* https://docs.openshift.com/container-platform/4.7/monitoring/understanding-the-monitoring-stack.html
* https://docs.openshift.com/container-platform/4.7/monitoring/configuring-the-monitoring-stack.html#configuring-the-monitoring-stack
* https://docs.openshift.com/container-platform/4.7/monitoring/enabling-monitoring-for-user-defined-projects.html
The supported stack is more limited, especially w.r.t. adding user defined pod- and service-monitors, but even if we would want to
run additional prometheus instances, we should be able to skip the instalation of the necessary operators, as all of them should already be present.
Notes on operator deployment
-------------------
Operator pattern is often used with kubernetes and openshift for more complex deployments.
Instead of applying all of the configuration to dpeloy your services, you deploy a special,
smaller service called operator, that has necessary permissions to deploy and configure the complex service.
Once the operator is running, instead of configuring the service itself with servie-specific config-maps,
you create operator specific kubernetes objects, so-alled CRDs.
The deployment of the operator in question was done by configuring the CRDs, roles and rolebinding and operator setup:
The definitions are as follows:
- https://github.com/prometheus-operator/prometheus-operator/tree/v0.38.3/example/prometheus-operator-crd
- https://github.com/prometheus-operator/prometheus-operator/tree/v0.38.3/example/rbac/prometheus-operator-crd
- https://github.com/prometheus-operator/prometheus-operator/tree/v0.38.3/example/rbac/prometheus-operator
Once the operator is correctly running, you just define a prometheus crd and it will create prometheus deployment for you.
The POC lives in https://pagure.io/fedora-infra/ansible/blob/main/f/playbooks/openshift-apps/application-monitoring.yml
Notes on application monitoring operator deployment
---------------------------------------------------
The application-monitoring operator was created to solve the integration of Prometheus, Alertmanager and Grafana.
After you configure it, it configures the relevant operators responsible for these services.
The most interesting difference between configuring this shared operator,
compared to configuring these operators individually is that it configures some of the integrations,
and it integrates well with openshifts auth system through oauth proxy.
The biggest drawback is, that the application-monitoring operator is orphanned project,
but because it mostly configures other operators, it is relatively simple to just recreate
the configuration for both prometheus and alertmanager to be deployed,
and deploy the prometheus and alertmanager operators without the help or the application-monitoring operator.
Notes on persistence
--------------------
Prometheus by default expects to have a writable /prometheus folder,
that can serve as persistent storage.
For the persistent volume to work for this purpose, it has to
**needs to have POSIX-compliant filesystem**, and NFS we currently have configured is not.
This is discussed in the `operational aspects <https://prometheus.io/docs/prometheus/latest/storage/#operational-aspects>`_
of Prmetheus documentation
The easiest supported way to have a POSIX-compliant `filesystem is to setup local-storage <https://docs.openshift.com/container-platform/3.11/install_config/configuring_local.html>`_
in the cluster.
In 4.x versions of OpenShift `there is a local-storage-operator <https://docs.openshift.com/container-platform/4.7/storage/persistent_storage/persistent-storage-local.html>`_ for this purpose.
This is the simplest way to have working persistence, but it prevents us to have multiple instanes
across openshift nodes, as the pod is using the underlying gilesystem on the node.
To ask the operator to create persisted prometheus, you specify in its configuration i.e.:
::
storage:
volumeClaimTemplate:
spec:
retention: 24h
storageClassName: local
resources:
requests:
storage: 10Gi
By default retention is set to 24 hours and can be over-ridden
Notes on long term storage
--------------------
Usually, the prometheus itself is setup to store its metrics for shorter ammount of time,
and it is expected that for longterm storage and analysis, there is some other storage solution,
such as influxdb, timescale.
We are currently running a POC that sychronizes Prometheus with Timescaledb (running on Postgresql)
through a middleware service called `promscale <https://github.com/timescale/promscale>`_ .
Promscale just needs an access to a appropriate postgresql database:
and can be configured through PROMSCALE_DB_PASSWORD, PROMSCALE_DB_HOST.
By default it will ensure the database has timescale installed and cofigures its database
automatically.
We setup the prometheus with directive to use promscale service as a backend:
https://github.com/timescale/promscale
::
remote_write:
- url: "http://promscale:9201/write"
remote_read:
- url: "http://promscale:9201/read"