OpenShift app monitoring with Nagios #7588
Labels
No labels
announcement
authentication
automate
aws
backlog
blocked
bodhi
ci
Closed As
Duplicate
Closed As
Fixed
Closed As
Fixed with Explanation
Closed As
Initiative Worthy
Closed As
Insufficient data
Closed As
Invalid
Closed As
Spam
Closed As
Upstream
Closed As/Will Not
Can Not fix
cloud
communishift
copr
database
deprecated
dev
discourse
dns
downloads
easyfix
epel
factory2
firmitas
gitlab
greenwave
hardware
help wanted
high-gain
high-trouble
iad2
koji
koschei
lists
low-gain
low-trouble
mbs
medium-gain
medium-trouble
mini-initiative
mirrorlists
monitoring
Needs investigation
notifier
odcs
OpenShift
ops
OSBS
outage
packager_workflow_blocker
pagure
permissions
Priority
Needs Review
Priority
Next Meeting
Priority
🔥 URGENT 🔥
Priority
Waiting on Assignee
Priority
Waiting on External
Priority
Waiting on Reporter
rabbitmq
rdu-cc
release-monitoring
releng
repoSpanner
request-for-resources
s390x
security
SMTP
src.fp.o
staging
taiga
unfreeze
waiverdb
websites-general
wiki
No milestone
No project
No assignees
4 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: Infrastructure/fedora-infrastructure#7588
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
I would like to implement monitoring for OpenShift apps using Nagios. I know there are some plans to replace Nagios with something else, but that hasn't happened yet and Nagios is already there. For me this is a blocker for moving Koschei to OpenShift - I'm not feeling comfortable having production Koschei without monitoring that is integrated with our existing alert system (email/IRC notifications).
I would like to start with monitoring number of pods. Nagios would check number of pods matching configured selector and compare it with configured range of expected numbers. Result of the check would be defined as follows:
Example:
Implementation: Nagios plugin, non-NRPE. There would be a service account created for Nagios. The account would have minimal privileges that would allow it to list pods, but nothing else. Credentials for the account would be stored on noc01 and noc02. Nagios plugin would use Kubernetes REST API to communicate with OpenShift. noc01 would talk directly to each of masters using internal addresses/names. noc02 would talk to OpenShift over public interface.
What do you think about this idea?
This sounds like a good idea. The plugins I looked at was:
https://github.com/appuio/nagios-plugins-openshift
Another example was
https://github.com/jmferrer/nagios-openshift
Sounds good to me. Either a basic script or leveraging one of those plugins...
Metadata Update from @mizdebsk:
From the two above plugins I like nagios-plugins-openshift better. The approach it uses is almost the same as mine - one difference is that they use
oc
command to communicate with OpenShift, while I would usecurl
. If we want to have this plugin used then I can try to package it and build for epel7-infra (I don't want to maintain this package in EPEL 7 myself). Or I can write my own plugin and put it in ansible.git. We can talk about this during one of future meetings.Metadata Update from @mizdebsk:
Nagios is frozen. I'll try to work on this ticket after final freeze (F30 GA).
Metadata Update from @mizdebsk:
Update: the freeze is over now, I am planning to work on this issue some time next week.
Metadata Update from @mizdebsk:
Currently I don't have time to work on this due to different priorities and upcoming vacation. Lack of monitoring is still blocking Koschei from moving to OpenShift and therefore I would still like this feature to be implemented, but it will need to wait a few months, unless someone else wants to work on this.
@mizdebsk I believe you have done that for Koschei, is there a small "How to" to do that for other applications ?
Metadata Update from @cverna:
Going to close as we aren't moving on this and it should be rolled into the monitoring initiative
Metadata Update from @smooge: