OpenShift app monitoring with Nagios #7588

Closed
opened 2019-02-22 13:46:36 +00:00 by mizdebsk · 14 comments

I would like to implement monitoring for OpenShift apps using Nagios. I know there are some plans to replace Nagios with something else, but that hasn't happened yet and Nagios is already there. For me this is a blocker for moving Koschei to OpenShift - I'm not feeling comfortable having production Koschei without monitoring that is integrated with our existing alert system (email/IRC notifications).

I would like to start with monitoring number of pods. Nagios would check number of pods matching configured selector and compare it with configured range of expected numbers. Result of the check would be defined as follows:

  • if the number of pods is within the expected range: OK
  • if there the number of pods is equal to zero: CRITICAL
  • otherwise: WARNING

Example:

  • configured selector: pods in namespace "koschei" with label "service: frontend", in state "running"
  • configured expected number of pods: range from 2 to 3
  • 0 matching pods -> CRITICAL
  • 1 matching pod -> WARNING
  • 2 to 3 matching pods -> OK
  • 4 or more matching pods -> WARNING

Implementation: Nagios plugin, non-NRPE. There would be a service account created for Nagios. The account would have minimal privileges that would allow it to list pods, but nothing else. Credentials for the account would be stored on noc01 and noc02. Nagios plugin would use Kubernetes REST API to communicate with OpenShift. noc01 would talk directly to each of masters using internal addresses/names. noc02 would talk to OpenShift over public interface.

What do you think about this idea?

I would like to implement monitoring for OpenShift apps using Nagios. I know there are some plans to replace Nagios with something else, but that hasn't happened yet and Nagios is already there. For me this is a blocker for moving Koschei to OpenShift - I'm not feeling comfortable having production Koschei without monitoring that is integrated with our existing alert system (email/IRC notifications). I would like to start with monitoring number of pods. Nagios would check number of pods matching configured selector and compare it with configured range of expected numbers. Result of the check would be defined as follows: - if the number of pods is within the expected range: OK - if there the number of pods is equal to zero: CRITICAL - otherwise: WARNING Example: - configured selector: pods in namespace "koschei" with label "service: frontend", in state "running" - configured expected number of pods: range from 2 to 3 - 0 matching pods -> CRITICAL - 1 matching pod -> WARNING - 2 to 3 matching pods -> OK - 4 or more matching pods -> WARNING Implementation: Nagios plugin, non-NRPE. There would be a service account created for Nagios. The account would have minimal privileges that would allow it to list pods, but nothing else. Credentials for the account would be stored on noc01 and noc02. Nagios plugin would use Kubernetes REST API to communicate with OpenShift. noc01 would talk directly to each of masters using internal addresses/names. noc02 would talk to OpenShift over public interface. What do you think about this idea?

This sounds like a good idea. The plugins I looked at was:

https://github.com/appuio/nagios-plugins-openshift

Another example was

https://github.com/jmferrer/nagios-openshift

This sounds like a good idea. The plugins I looked at was: https://github.com/appuio/nagios-plugins-openshift Another example was https://github.com/jmferrer/nagios-openshift

Sounds good to me. Either a basic script or leveraging one of those plugins...

Sounds good to me. Either a basic script or leveraging one of those plugins...
Author

Metadata Update from @mizdebsk:

  • Issue assigned to mizdebsk
**Metadata Update from @mizdebsk**: - Issue assigned to mizdebsk
Author

This sounds like a good idea. The plugins I looked at was: https://github.com/appuio/nagios-plugins-openshift
Another example was https://github.com/jmferrer/nagios-openshift

From the two above plugins I like nagios-plugins-openshift better. The approach it uses is almost the same as mine - one difference is that they use oc command to communicate with OpenShift, while I would use curl. If we want to have this plugin used then I can try to package it and build for epel7-infra (I don't want to maintain this package in EPEL 7 myself). Or I can write my own plugin and put it in ansible.git. We can talk about this during one of future meetings.

> This sounds like a good idea. The plugins I looked at was: https://github.com/appuio/nagios-plugins-openshift > Another example was https://github.com/jmferrer/nagios-openshift From the two above plugins I like nagios-plugins-openshift better. The approach it uses is almost the same as mine - one difference is that they use `oc` command to communicate with OpenShift, while I would use `curl`. If we want to have this plugin used then I can try to package it and build for epel7-infra (I don't want to maintain this package in EPEL 7 myself). Or I can write my own plugin and put it in ansible.git. We can talk about this during one of future meetings.
Author

Metadata Update from @mizdebsk:

  • Issue priority set to: Waiting on Assignee (was: Next Meeting)
**Metadata Update from @mizdebsk**: - Issue priority set to: Waiting on Assignee (was: Next Meeting)
Author

Nagios is frozen. I'll try to work on this ticket after final freeze (F30 GA).

Nagios is frozen. I'll try to work on this ticket after final freeze (F30 GA).
Author

Metadata Update from @mizdebsk:

  • Issue tagged with: unfreeze
**Metadata Update from @mizdebsk**: - Issue tagged with: unfreeze
Author

Update: the freeze is over now, I am planning to work on this issue some time next week.

Update: the freeze is over now, I am planning to work on this issue some time next week.
Author

Metadata Update from @mizdebsk:

  • Issue untagged with: unfreeze
**Metadata Update from @mizdebsk**: - Issue **un**tagged with: unfreeze
Author

Currently I don't have time to work on this due to different priorities and upcoming vacation. Lack of monitoring is still blocking Koschei from moving to OpenShift and therefore I would still like this feature to be implemented, but it will need to wait a few months, unless someone else wants to work on this.

Currently I don't have time to work on this due to different priorities and upcoming vacation. Lack of monitoring is still blocking Koschei from moving to OpenShift and therefore I would still like this feature to be implemented, but it will need to wait a few months, unless someone else wants to work on this.

@mizdebsk I believe you have done that for Koschei, is there a small "How to" to do that for other applications ?

@mizdebsk I believe you have done that for Koschei, is there a small "How to" to do that for other applications ?

Metadata Update from @cverna:

  • Assignee reset
**Metadata Update from @cverna**: - Assignee reset

Going to close as we aren't moving on this and it should be rolled into the monitoring initiative

Going to close as we aren't moving on this and it should be rolled into the monitoring initiative

Metadata Update from @smooge:

  • Issue close_status updated to: Initiative Worthy
  • Issue status updated to: Closed (was: Open)
**Metadata Update from @smooge**: - Issue close_status updated to: Initiative Worthy - Issue status updated to: Closed (was: Open)
Sign in to join this conversation.
No milestone
No project
No assignees
4 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: Infrastructure/fedora-infrastructure#7588
No description provided.