Notes on Prometheus research.
This commit is contained in:
parent
7efca65b49
commit
ed508e7b9b
3 changed files with 148 additions and 34 deletions
|
@ -28,10 +28,26 @@ In process we want to be able to answer the questions posed in the latest mailin
|
|||
- Can zabbix handle our number of machines?
|
||||
- How flexible is the alerting?
|
||||
|
||||
Main takeaway
|
||||
-------------
|
||||
|
||||
We managed to create proof-of-concept monitoring solutions with both prometheus and zabbix.
|
||||
|
||||
The initial configuration has proven to have more pitfals than expected,
|
||||
with Prometheus especially in the integration with openshift and its other auxialiary services,
|
||||
and with Zabbix espcially with correctly setting up the ip-tables and network permissions,
|
||||
and with configuring a reasonable setup for the user-access and user-account management.
|
||||
|
||||
Even despite these setbacks, we still feel this would be an improvement over our current setup based on Nagios.
|
||||
|
||||
To get a basic overview of Prometheus, you can watch this short tech-talk by Adam Saleh:
|
||||
(accessible only to RedHat) https://drive.google.com/file/d/1-uEIkS2jaJ2b8V_4y-AKW1J6sdZzzlc9/view
|
||||
or read up the more indepth report in the relevant sections of this documentation.
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 1
|
||||
|
||||
prometheus
|
||||
prometheus_for_ops
|
||||
prometheus_for_dev
|
||||
faq
|
||||
|
||||
|
|
|
@ -1,36 +1,3 @@
|
|||
Monitoring / Metrics with Prometheus
|
||||
========================
|
||||
|
||||
For deployment, we used combination for configuration of prometheus operator and application-monitoring operator.
|
||||
|
||||
Beware, most of the deployment notes could be mostly obsolete in really short time.
|
||||
The POC was done on OpenShift 3.11, which limited us in using older version of prometheus operator,
|
||||
as well as the no longer maintained application-monitoring operator.
|
||||
|
||||
In openshift 4.x that we plan to use in the near future, there is supported way integrated in the openshift deployment:
|
||||
|
||||
* https://docs.openshift.com/container-platform/4.7/monitoring/understanding-the-monitoring-stack.html
|
||||
* https://docs.openshift.com/container-platform/4.7/monitoring/configuring-the-monitoring-stack.html#configuring-the-monitoring-stack
|
||||
* https://docs.openshift.com/container-platform/4.7/monitoring/enabling-monitoring-for-user-defined-projects.html
|
||||
|
||||
The supported stack is more limited, especially w.r.t. adding user defined pod- and service-monitors, but even if we would want to
|
||||
run additional prometheus instances, we should be able to skip the instalation of the necessary operators, as all of them should already be present.
|
||||
|
||||
|
||||
Notes on operator deployment
|
||||
-------------------
|
||||
|
||||
The deployment in question was done by configuring the CRDs, roles and rolebinding and operator setup:
|
||||
|
||||
The definitions are as follows:
|
||||
- https://github.com/prometheus-operator/prometheus-operator/tree/v0.38.3/example/prometheus-operator-crd
|
||||
- https://github.com/prometheus-operator/prometheus-operator/tree/v0.38.3/example/rbac/prometheus-operator-crd
|
||||
- https://github.com/prometheus-operator/prometheus-operator/tree/v0.38.3/example/rbac/prometheus-operator
|
||||
|
||||
Once the operator is correctly running, you just define a prometheus crd and it will create prometheus deployment for you.
|
||||
|
||||
The POC lives in https://pagure.io/fedora-infra/ansible/blob/main/f/playbooks/openshift-apps/application-monitoring.yml
|
||||
|
||||
Notes on application monitoring self-service
|
||||
---------------------------------
|
||||
|
||||
|
@ -96,3 +63,19 @@ The alerts themselves would be the routed for further processing and notificatio
|
|||
these are not available to change from the developers namespaces.
|
||||
|
||||
The managing and acknowledging of the alerts can be done in alert-manager in rudimentary fashion.
|
||||
|
||||
Notes on instrumenting the application
|
||||
--------------------------------------
|
||||
|
||||
Prometheus expects applications to scrape metrics from
|
||||
to be services, with '/metrics' endpoint exposed with metrics in correct
|
||||
format.
|
||||
|
||||
There are libraries that help with this for many different languages,
|
||||
confusingly called client-libraries, eve though they usually export metrics as a http server endpoint:
|
||||
https://prometheus.io/docs/instrumenting/clientlibs/
|
||||
|
||||
As part of the proof of concept we have instrumented Bodhi application,
|
||||
to collect data through prometheus_client python library:
|
||||
https://github.com/fedora-infra/bodhi/pull/4079
|
||||
|
115
docs/monitoring_metrics/prometheus_for_ops.rst
Normal file
115
docs/monitoring_metrics/prometheus_for_ops.rst
Normal file
|
@ -0,0 +1,115 @@
|
|||
Monitoring / Metrics with Prometheus
|
||||
========================
|
||||
|
||||
For deployment, we used combination for configuration of prometheus operator and application-monitoring operator.
|
||||
|
||||
Beware, most of the deployment notes could be mostly obsolete in really short time.
|
||||
The POC was done on OpenShift 3.11, which limited us in using older version of prometheus operator,
|
||||
as well as the no longer maintained application-monitoring operator.
|
||||
|
||||
In openshift 4.x that we plan to use in the near future, there is supported way integrated in the openshift deployment:
|
||||
|
||||
* https://docs.openshift.com/container-platform/4.7/monitoring/understanding-the-monitoring-stack.html
|
||||
* https://docs.openshift.com/container-platform/4.7/monitoring/configuring-the-monitoring-stack.html#configuring-the-monitoring-stack
|
||||
* https://docs.openshift.com/container-platform/4.7/monitoring/enabling-monitoring-for-user-defined-projects.html
|
||||
|
||||
The supported stack is more limited, especially w.r.t. adding user defined pod- and service-monitors, but even if we would want to
|
||||
run additional prometheus instances, we should be able to skip the instalation of the necessary operators, as all of them should already be present.
|
||||
|
||||
|
||||
Notes on operator deployment
|
||||
-------------------
|
||||
|
||||
Operator pattern is often used with kubernetes and openshift for more complex deployments.
|
||||
Instead of applying all of the configuration to dpeloy your services, you deploy a special,
|
||||
smaller service called operator, that has necessary permissions to deploy and configure the complex service.
|
||||
Once the operator is running, instead of configuring the service itself with servie-specific config-maps,
|
||||
you create operator specific kubernetes objects, so-alled CRDs.
|
||||
|
||||
The deployment of the operator in question was done by configuring the CRDs, roles and rolebinding and operator setup:
|
||||
|
||||
The definitions are as follows:
|
||||
- https://github.com/prometheus-operator/prometheus-operator/tree/v0.38.3/example/prometheus-operator-crd
|
||||
- https://github.com/prometheus-operator/prometheus-operator/tree/v0.38.3/example/rbac/prometheus-operator-crd
|
||||
- https://github.com/prometheus-operator/prometheus-operator/tree/v0.38.3/example/rbac/prometheus-operator
|
||||
|
||||
Once the operator is correctly running, you just define a prometheus crd and it will create prometheus deployment for you.
|
||||
|
||||
The POC lives in https://pagure.io/fedora-infra/ansible/blob/main/f/playbooks/openshift-apps/application-monitoring.yml
|
||||
|
||||
|
||||
Notes on application monitoring operator deployment
|
||||
---------------------------------------------------
|
||||
|
||||
The application-monitoring operator was created to solve the integration of Prometheus, Alertmanager and Grafana.
|
||||
After you configure it, it configures the relevant operators responsible for these services.
|
||||
|
||||
The most interesting difference between configuring this shared operator,
|
||||
compared to configuring these operators individually is that it configures some of the integrations,
|
||||
and it integrates well with openshifts auth system through oauth proxy.
|
||||
|
||||
The biggest drawback is, that the application-monitoring operator is orphanned project,
|
||||
but because it mostly configures other operators, it is relatively simple to just recreate
|
||||
the configuration for both prometheus and alertmanager to be deployed,
|
||||
and deploy the prometheus and alertmanager operators without the help or the application-monitoring operator.
|
||||
|
||||
Notes on persistence
|
||||
--------------------
|
||||
|
||||
Prometheus by default expects to have a writable /prometheus folder,
|
||||
that can serve as persistent storage.
|
||||
|
||||
For the persistent volume to work for this purpose, it has to
|
||||
**needs to have POSIX-compliant filesystem**, and NFS we currently have configured is not.
|
||||
This is discussed in the `operational aspects <https://prometheus.io/docs/prometheus/latest/storage/#operational-aspects>`_
|
||||
of Prmetheus documentation
|
||||
|
||||
The easiest supported way to have a POSIX-compliant `filesystem is to setup local-storage <https://docs.openshift.com/container-platform/3.11/install_config/configuring_local.html>`_
|
||||
in the cluster.
|
||||
|
||||
In 4.x versions of OpenShift `there is a local-storage-operator <https://docs.openshift.com/container-platform/4.7/storage/persistent_storage/persistent-storage-local.html>`_ for this purpose.
|
||||
|
||||
This is the simplest way to have working persistence, but it prevents us to have multiple instanes
|
||||
across openshift nodes, as the pod is using the underlying gilesystem on the node.
|
||||
|
||||
To ask the operator to create persisted prometheus, you specify in its configuration i.e.:
|
||||
|
||||
::
|
||||
|
||||
storage:
|
||||
volumeClaimTemplate:
|
||||
spec:
|
||||
retention: 24h
|
||||
storageClassName: local
|
||||
resources:
|
||||
requests:
|
||||
storage: 10Gi
|
||||
|
||||
By default retention is set to 24 hours and can be over-ridden
|
||||
|
||||
|
||||
Notes on long term storage
|
||||
--------------------
|
||||
|
||||
Usually, the prometheus itself is setup to store its metrics for shorter ammount of time,
|
||||
and it is expected that for longterm storage and analysis, there is some other storage solution,
|
||||
such as influxdb, timescale.
|
||||
|
||||
We are currently running a POC that sychronizes Prometheus with Timescaledb (running on Postgresql)
|
||||
through a middleware service called `promscale <https://github.com/timescale/promscale>`_ .
|
||||
|
||||
Promscale just needs an access to a appropriate postgresql database:
|
||||
and can be configured through PROMSCALE_DB_PASSWORD, PROMSCALE_DB_HOST.
|
||||
|
||||
By default it will ensure the database has timescale installed and cofigures its database
|
||||
automatically.
|
||||
|
||||
We setup the prometheus with directive to use promscale service as a backend:
|
||||
https://github.com/timescale/promscale
|
||||
|
||||
::
|
||||
|
||||
remote_write:
|
||||
- url: "http://promscale:9201/write"
|
||||
remote_read:
|
||||
- url: "http://promscale:9201/read"
|
Loading…
Add table
Add a link
Reference in a new issue