100 lines
3.2 KiB
Text
100 lines
3.2 KiB
Text
= Fedora Infrastructure Nagios
|
|
|
|
== Contact Information
|
|
|
|
Owner::
|
|
sysadmin-main, sysadmin-noc
|
|
Contact::
|
|
#fedora-admin, #fedora-noc
|
|
Location::
|
|
Anywhere
|
|
Servers::
|
|
noc01, noc02, noc01.stg, batcave01
|
|
Purpose::
|
|
This SOP is to describe nagios configurations
|
|
|
|
== Configuration
|
|
|
|
Fedora Project runs two nagios instances, nagios (noc01)
|
|
https://admin.fedoraproject.org/nagios and nagios-external (noc02)
|
|
https://nagios-external.fedoraproject.org/nagios, you must be in the
|
|
'sysadmin' group to access them.
|
|
|
|
Apart from the two production instances, we are currently running a
|
|
staging instance for testing-purposes available through SSH at
|
|
noc01.stg.
|
|
|
|
nagios (noc01)::
|
|
The nagios configuration on noc01 should only monitor general host
|
|
statistics ansible status, uptime, apache status (up/down), SSH etc.
|
|
+
|
|
The configurations are found in nagios ansible roles:
|
|
* https://pagure.io/fedora-infra/ansible/blob/main/f/roles/nagios_client[ansible/roles/nagios_client]
|
|
* https://pagure.io/fedora-infra/ansible/blob/main/f/roles/nagios_server[ansible/roles/nagios_server]
|
|
nagios-external (noc02)::
|
|
The nagios configuration on noc02 is located outside of our main
|
|
datacenter and should monitor our user websites/applications
|
|
(fedoraproject.org, FAS, PackageDB, Bodhi/Updates).
|
|
+
|
|
The configurations are found in nagios ansible roles:
|
|
* https://pagure.io/fedora-infra/ansible/blob/main/f/roles/nagios_client[ansible/roles/nagios_client]
|
|
* https://pagure.io/fedora-infra/ansible/blob/main/f/roles/nagios_server[ansible/roles/nagios_server]
|
|
|
|
[NOTE]
|
|
====
|
|
Production and staging instances through SSH: Please make sure you are
|
|
into 'sysadmin' and 'sysadmin-noc' FAS groups before trying to access
|
|
these hosts.
|
|
|
|
See xref:sshaccess.adoc[SSH Access SOP]
|
|
====
|
|
|
|
=== NRPE
|
|
|
|
We are currently using NRPE to execute remote Nagios plugins on any host
|
|
of our network.
|
|
|
|
A great guide about it and its usage mixed up with some nice images
|
|
about its structure can be found at:
|
|
https://assets.nagios.com/downloads/nagioscore/docs/nrpe/NRPE.pdf
|
|
|
|
== Understanding the Messages
|
|
|
|
=== General
|
|
|
|
Nagios notifications are generally easy to read, and follow this
|
|
consistent format:
|
|
|
|
....
|
|
** PROBLEM/ACKNOWLEDGEMENT/RECOVERY alert - hostname/Check is WARNING/CRITICAL/OK **
|
|
** HOST DOWN/UP alert - hostname **
|
|
....
|
|
|
|
Reading the message will provide extra information on what is wrong.
|
|
|
|
=== Disk Space Warning/Critical
|
|
|
|
Disk space warnings normally include the following information:
|
|
|
|
....
|
|
DISK WARNING/CRITICAL/OK - free space: mountpoint freespace(MB) (freespace(%) inode=freeinodes(%)):
|
|
....
|
|
|
|
A message stating "(1% inode=99%)" means that the diskspace is critical
|
|
not the inode usage and is a sign that more diskspace is required.
|
|
|
|
=== Oncall Handling
|
|
Anyone who is currently oncall should be able to acknowledge alerts and
|
|
hosts in Nagios. Therefore, their username should be added to these lines
|
|
in `roles/nagios_server/templtaes/nagios/configs/cgi.cfg.j2`:
|
|
* `authorized_for_system_commands`
|
|
* `authorized_for_all_service_commands`
|
|
* `authorized_for_all_host_commands`
|
|
|
|
It is fine for past oncalls to keep these permissions, so no additional
|
|
change is needed at the end of their oncall week.
|
|
|
|
=== Further Reading
|
|
|
|
* xref:ansible.adoc[Ansible SOP]
|
|
* xref:outage.adoc[Outages SOP]
|