Added the infra SOPs ported to asciidoc.

2021-07-26 10:39:47 +02:00 · 2021-07-26 10:39:47 +02:00 · a0301e30f1
commit a0301e30f1
parent 8a7f111a12
148 changed files with 18575 additions and 17 deletions
--- a/modules/sysadmin_guide/pages/nagios.adoc
+++ b/modules/sysadmin_guide/pages/nagios.adoc
@ -0,0 +1,87 @@
+= Fedora Infrastructure Nagios
+
+== Contact Information
+
+Owner::
+  sysadmin-main, sysadmin-noc
+Contact::
+  #fedora-admin, #fedora-noc
+Location::
+  Anywhere
+Servers::
+  noc01, noc02, noc01.stg, batcave01
+Purpose::
+  This SOP is to describe nagios configurations
+
+== Configuration
+
+Fedora Project runs two nagios instances, nagios (noc01)
+https://admin.fedoraproject.org/nagios and nagios-external (noc02)
+https://nagios-external.fedoraproject.org/nagios, you must be in the
+'sysadmin' group to access them.
+
+Apart from the two production instances, we are currently running a
+staging instance for testing-purposes available through SSH at
+noc01.stg.
+
+nagios (noc01)::
+  The nagios configuration on noc01 should only monitor general host
+  statistics ansible status, uptime, apache status (up/down), SSH etc.
+  +
+  The configurations are found in nagios ansible module:
+  ansible/roles/nagios
+nagios-external (noc02)::
+  The nagios configuration on noc02 is located outside of our main
+  datacenter and should monitor our user websites/applications
+  (fedoraproject.org, FAS, PackageDB, Bodhi/Updates).
+  +
+  The configurations are found in nagios ansible role: roles/nagios
+
+[NOTE]
+.Note
+====
+Production and staging instances through SSH: Please make sure you are
+into 'sysadmin' and 'sysadmin-noc' FAS groups before trying to access
+these hosts.
+
+See SSH Access SOP
+====
+
+=== NRPE
+
+We are currently using NRPE to execute remote Nagios plugins on any host
+of our network.
+
+A great guide about it and its usage mixed up with some nice images
+about its structure can be found at:
+https://assets.nagios.com/downloads/nagioscore/docs/nrpe/NRPE.pdf
+
+== Understanding the Messages
+
+=== General:
+
+Nagios notifications are generally easy to read, and follow this
+consistent format:
+
+....
+** PROBLEM/ACKNOWLEDGEMENT/RECOVERY alert - hostname/Check is WARNING/CRITICAL/OK **
+** HOST DOWN/UP alert - hostname **
+....
+
+Reading the message will provide extra information on what is wrong.
+
+=== Disk Space Warning/Critical:
+
+Disk space warnings normally include the following information:
+
+....
+DISK WARNING/CRITICAL/OK - free space: mountpoint freespace(MB) (freespace(%) inode=freeinodes(%)):
+....
+
+A message stating "(1% inode=99%)" means that the diskspace is critical
+not the inode usage and is a sign that more diskspace is required.
+
+=== Further Reading
+
+* Ansible SOP
+* Outages SOP