OpenShift app monitoring with Nagios

mizdebsk commented

2019-02-22 13:46:36 +00:00

I would like to implement monitoring for OpenShift apps using Nagios. I know there are some plans to replace Nagios with something else, but that hasn't happened yet and Nagios is already there. For me this is a blocker for moving Koschei to OpenShift - I'm not feeling comfortable having production Koschei without monitoring that is integrated with our existing alert system (email/IRC notifications).

I would like to start with monitoring number of pods. Nagios would check number of pods matching configured selector and compare it with configured range of expected numbers. Result of the check would be defined as follows:

if the number of pods is within the expected range: OK
if there the number of pods is equal to zero: CRITICAL
otherwise: WARNING

Example:

configured selector: pods in namespace "koschei" with label "service: frontend", in state "running"
configured expected number of pods: range from 2 to 3
0 matching pods -> CRITICAL
1 matching pod -> WARNING
2 to 3 matching pods -> OK
4 or more matching pods -> WARNING

Implementation: Nagios plugin, non-NRPE. There would be a service account created for Nagios. The account would have minimal privileges that would allow it to list pods, but nothing else. Credentials for the account would be stored on noc01 and noc02. Nagios plugin would use Kubernetes REST API to communicate with OpenShift. noc01 would talk directly to each of masters using internal addresses/names. noc02 would talk to OpenShift over public interface.

What do you think about this idea?

I would like to implement monitoring for OpenShift apps using Nagios. I know there are some plans to replace Nagios with something else, but that hasn't happened yet and Nagios is already there. For me this is a blocker for moving Koschei to OpenShift - I'm not feeling comfortable having production Koschei without monitoring that is integrated with our existing alert system (email/IRC notifications). I would like to start with monitoring number of pods. Nagios would check number of pods matching configured selector and compare it with configured range of expected numbers. Result of the check would be defined as follows: - if the number of pods is within the expected range: OK - if there the number of pods is equal to zero: CRITICAL - otherwise: WARNING Example: - configured selector: pods in namespace "koschei" with label "service: frontend", in state "running" - configured expected number of pods: range from 2 to 3 - 0 matching pods -> CRITICAL - 1 matching pod -> WARNING - 2 to 3 matching pods -> OK - 4 or more matching pods -> WARNING Implementation: Nagios plugin, non-NRPE. There would be a service account created for Nagios. The account would have minimal privileges that would allow it to list pods, but nothing else. Credentials for the account would be stored on noc01 and noc02. Nagios plugin would use Kubernetes REST API to communicate with OpenShift. noc01 would talk directly to each of masters using internal addresses/names. noc02 would talk to OpenShift over public interface. What do you think about this idea?

smooge commented

2019-02-22 18:10:50 +00:00

This sounds like a good idea. The plugins I looked at was:

https://github.com/appuio/nagios-plugins-openshift

Another example was

https://github.com/jmferrer/nagios-openshift

This sounds like a good idea. The plugins I looked at was: https://github.com/appuio/nagios-plugins-openshift Another example was https://github.com/jmferrer/nagios-openshift

kevin commented

2019-02-23 20:16:36 +00:00

Sounds good to me. Either a basic script or leveraging one of those plugins...

mizdebsk commented

2019-02-25 19:02:48 +00:00

Author

Metadata Update from @mizdebsk:

Issue assigned to mizdebsk

**Metadata Update from @mizdebsk**: - Issue assigned to mizdebsk

mizdebsk commented

2019-02-25 19:53:29 +00:00

Author

This sounds like a good idea. The plugins I looked at was: https://github.com/appuio/nagios-plugins-openshift
Another example was https://github.com/jmferrer/nagios-openshift

From the two above plugins I like nagios-plugins-openshift better. The approach it uses is almost the same as mine - one difference is that they use oc command to communicate with OpenShift, while I would use curl. If we want to have this plugin used then I can try to package it and build for epel7-infra (I don't want to maintain this package in EPEL 7 myself). Or I can write my own plugin and put it in ansible.git. We can talk about this during one of future meetings.

> This sounds like a good idea. The plugins I looked at was: https://github.com/appuio/nagios-plugins-openshift > Another example was https://github.com/jmferrer/nagios-openshift From the two above plugins I like nagios-plugins-openshift better. The approach it uses is almost the same as mine - one difference is that they use `oc` command to communicate with OpenShift, while I would use `curl`. If we want to have this plugin used then I can try to package it and build for epel7-infra (I don't want to maintain this package in EPEL 7 myself). Or I can write my own plugin and put it in ansible.git. We can talk about this during one of future meetings.

mizdebsk commented

2019-02-28 15:29:18 +00:00

Author

Metadata Update from @mizdebsk:

Issue priority set to: Waiting on Assignee (was: Next Meeting)

**Metadata Update from @mizdebsk**: - Issue priority set to: Waiting on Assignee (was: Next Meeting)

mizdebsk commented

2019-03-09 22:13:49 +00:00

Author

Nagios is frozen. I'll try to work on this ticket after final freeze (F30 GA).

mizdebsk commented

2019-03-09 22:13:50 +00:00

Author

Metadata Update from @mizdebsk:

Issue tagged with: unfreeze

**Metadata Update from @mizdebsk**: - Issue tagged with: unfreeze

mizdebsk commented

2019-05-08 06:57:57 +00:00

Author

Update: the freeze is over now, I am planning to work on this issue some time next week.

mizdebsk commented

2019-05-08 06:57:57 +00:00

Author

Metadata Update from @mizdebsk:

Issue untagged with: unfreeze

**Metadata Update from @mizdebsk**: - Issue **un**tagged with: unfreeze

mizdebsk commented

2019-06-18 14:56:41 +00:00

Author

Currently I don't have time to work on this due to different priorities and upcoming vacation. Lack of monitoring is still blocking Koschei from moving to OpenShift and therefore I would still like this feature to be implemented, but it will need to wait a few months, unless someone else wants to work on this.