Improve monitoring for container registry #9231
Labels
No labels
announcement
authentication
automate
aws
backlog
blocked
bodhi
ci
Closed As
Duplicate
Closed As
Fixed
Closed As
Fixed with Explanation
Closed As
Initiative Worthy
Closed As
Insufficient data
Closed As
Invalid
Closed As
Spam
Closed As
Upstream
Closed As/Will Not
Can Not fix
cloud
communishift
copr
database
deprecated
dev
discourse
dns
downloads
easyfix
epel
factory2
firmitas
gitlab
greenwave
hardware
help wanted
high-gain
high-trouble
iad2
koji
koschei
lists
low-gain
low-trouble
mbs
medium-gain
medium-trouble
mini-initiative
mirrorlists
monitoring
Needs investigation
notifier
odcs
OpenShift
ops
OSBS
outage
packager_workflow_blocker
pagure
permissions
Priority
Needs Review
Priority
Next Meeting
Priority
🔥 URGENT 🔥
Priority
Waiting on Assignee
Priority
Waiting on External
Priority
Waiting on Reporter
rabbitmq
rdu-cc
release-monitoring
releng
repoSpanner
request-for-resources
s390x
security
SMTP
src.fp.o
staging
taiga
unfreeze
waiverdb
websites-general
wiki
No milestone
No project
No assignees
5 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: Infrastructure/fedora-infrastructure#9231
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Container registry is listed as having Important SLE, yet one of our registries was down for about 11 hours (see #9230 for details) and we didn't get any Nagios notification about the issue.
Monitoring should be improved so that we are notified about this kind of issues sooner.
I would like to work on it.
Metadata Update from @smooge:
@mizdebsk i think we can get notification when systemd-monitored service enters failed state if we do
OnFailure
to unit !for more details about option => https://www.freedesktop.org/software/systemd/man/systemd.unit.html#Specifiers
We have existing Nagios setup that would be trivial to configure to cover OCI registry -- for example adding checks for:
Hint: the relevant file in ansible.git is
roles/nagios_server/templates/nagios/services/websites.cfg.j2
Thank You very much @mizdebsk and @seddik for the pointers, I will work on it after work today.
Take your time. We are in beta freeze, so changes to monitoring will need to wait until the freeze ends, or follow FBR SOP
Hi,
Could you give any update ?
I can work on that if needed
Monitoring for container registry is still needed, patches are welcome. Please let me know if you need any help with implementing this.
Related PR fedora-infra/ansible#321 has been merged
The change has been deployed and can be seen eg. here and here.
Thank you for your contribution. This issue is resolved.
Metadata Update from @mizdebsk: