infra-docs-fpo/modules/sysadmin_guide/pages/nagios.adoc

= Fedora Infrastructure Nagios

== Contact Information

Owner::
  sysadmin-main, sysadmin-noc
Contact::
  #fedora-admin, #fedora-noc
Location::
  Anywhere
Servers::
  noc01, noc02, noc01.stg, batcave01
Purpose::
  This SOP is to describe nagios configurations

== Configuration

Fedora Project runs two nagios instances, nagios (noc01)
https://admin.fedoraproject.org/nagios and nagios-external (noc02)
https://nagios-external.fedoraproject.org/nagios, you must be in the
'sysadmin' group to access them.

Apart from the two production instances, we are currently running a
staging instance for testing-purposes available through SSH at
noc01.stg.

nagios (noc01)::
  The nagios configuration on noc01 should only monitor general host
  statistics ansible status, uptime, apache status (up/down), SSH etc.
  +
  The configurations are found in nagios ansible roles:
  * https://pagure.io/fedora-infra/ansible/blob/main/f/roles/nagios_client[ansible/roles/nagios_client]
  * https://pagure.io/fedora-infra/ansible/blob/main/f/roles/nagios_server[ansible/roles/nagios_server]
nagios-external (noc02)::
  The nagios configuration on noc02 is located outside of our main
  datacenter and should monitor our user websites/applications
  (fedoraproject.org, FAS, PackageDB, Bodhi/Updates).
  +
  The configurations are found in nagios ansible roles:
  * https://pagure.io/fedora-infra/ansible/blob/main/f/roles/nagios_client[ansible/roles/nagios_client]
  * https://pagure.io/fedora-infra/ansible/blob/main/f/roles/nagios_server[ansible/roles/nagios_server]

[NOTE]
====
Production and staging instances through SSH: Please make sure you are
into 'sysadmin' and 'sysadmin-noc' FAS groups before trying to access
these hosts.

See xref:sshaccess.adoc[SSH Access SOP]
====

=== NRPE

We are currently using NRPE to execute remote Nagios plugins on any host
of our network.

A great guide about it and its usage mixed up with some nice images
about its structure can be found at:
https://assets.nagios.com/downloads/nagioscore/docs/nrpe/NRPE.pdf

== Understanding the Messages

=== General

Nagios notifications are generally easy to read, and follow this
consistent format:

....
** PROBLEM/ACKNOWLEDGEMENT/RECOVERY alert - hostname/Check is WARNING/CRITICAL/OK **
** HOST DOWN/UP alert - hostname **
....

Reading the message will provide extra information on what is wrong.

=== Disk Space Warning/Critical

Disk space warnings normally include the following information:

....
DISK WARNING/CRITICAL/OK - free space: mountpoint freespace(MB) (freespace(%) inode=freeinodes(%)):
....

A message stating "(1% inode=99%)" means that the diskspace is critical
not the inode usage and is a sign that more diskspace is required.

=== Oncall Handling
Anyone who is currently oncall should be able to acknowledge alerts and
hosts in Nagios. Therefore, their username should be added to these lines
in `roles/nagios_server/templtaes/nagios/configs/cgi.cfg.j2`:
* `authorized_for_system_commands`
* `authorized_for_all_service_commands`
* `authorized_for_all_host_commands`

It is fine for past oncalls to keep these permissions, so no additional
change is needed at the end of their oncall week.

=== Further Reading

* xref:ansible.adoc[Ansible SOP]
* xref:outage.adoc[Outages SOP]