arc/docs/monitoring_metrics/index.rst

Monitoring / Metrics
========================

As an ARC team initiative we want to investigate Prometheus and Zabbix
as our new monitoring and metrics solutions, by:

 -  Installing Zabbix server in a VM, and hooking up the staging dist-git to it with an agent
 -  Installing Prometheus in our Open Shift and collecting metrics for a selected project in a self-service fashion

Prior POCs/deployments
----------------------

Fabian Arrotin deployed and utilizes zabbix in centos infrastructure.
 - https://github.com/CentOS/ansible-role-zabbix-server

Adam Saleh has deployed a POC prometheus deployment for CoreOS team.
 - https://pagure.io/centos-infra/issue/112

David Kirwan was part of the development team of https://github.com/integr8ly/application-monitoring-operator/ and did some POC around prometheus push-gateway in centos openshift

Investigation
-------------

In process we want to be able to answer the questions posed in the latest mailing thread and by the end have a setup that can lead directly into mirating us away from nagios. The questions (mostly from Kevin):

 -  How can we provision both of them automatically from ansible?
 -  Can we get zabbix to pull from prometheus?
 -  Can zabbix handle our number of machines?
 -  How flexible is the alerting?

Main takeaway
-------------

We managed to create proof-of-concept monitoring solutions with both prometheus and zabbix.

The initial configuration has proven to have more pitfals than expected,
with Prometheus especially in the integration with openshift and its other auxialiary services,
and with Zabbix espcially with correctly setting up the ip-tables and network permissions,
and with configuring a reasonable setup for the user-access and user-account management.

Even despite these setbacks, we still feel this would be an improvement over our current setup based on Nagios.

To get a basic overview of Prometheus, you can watch this short tech-talk by Adam Saleh:
(accessible only to RedHat) https://drive.google.com/file/d/1-uEIkS2jaJ2b8V_4y-AKW1J6sdZzzlc9/view
or read up the more indepth report in the relevant sections of this documentation.

.. toctree::
    :maxdepth: 1

    prometheus_for_ops
    prometheus_for_dev
    faq