Expected resultsdb notification not found #8011
Labels
No labels
announcement
authentication
automate
aws
backlog
blocked
bodhi
ci
Closed As
Duplicate
Closed As
Fixed
Closed As
Fixed with Explanation
Closed As
Initiative Worthy
Closed As
Insufficient data
Closed As
Invalid
Closed As
Spam
Closed As
Upstream
Closed As/Will Not
Can Not fix
cloud
communishift
copr
database
deprecated
dev
discourse
dns
downloads
easyfix
epel
factory2
firmitas
gitlab
greenwave
hardware
help wanted
high-gain
high-trouble
iad2
koji
koschei
lists
low-gain
low-trouble
mbs
medium-gain
medium-trouble
mini-initiative
mirrorlists
monitoring
Needs investigation
notifier
odcs
OpenShift
ops
OSBS
outage
packager_workflow_blocker
pagure
permissions
Priority
Needs Review
Priority
Next Meeting
Priority
🔥 URGENT 🔥
Priority
Waiting on Assignee
Priority
Waiting on External
Priority
Waiting on Reporter
rabbitmq
rdu-cc
release-monitoring
releng
repoSpanner
request-for-resources
s390x
security
SMTP
src.fp.o
staging
taiga
unfreeze
waiverdb
websites-general
wiki
No milestone
No project
No assignees
5 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: Infrastructure/fedora-infrastructure#8011
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Good Morning,
I've added a test.yml and a gating.yaml to fedora-gather-easyfix: https://src.fedoraproject.org/rpms/fedora-gather-easyfix/tree/master
I've built a new version with them and created the corresponding bodhi update: https://bodhi.fedoraproject.org/updates/FEDORA-2019-4ffedfe629
Bodhi does see two failed results, one of which is required.
However, looking at datagrepper: https://apps.fedoraproject.org/datagrepper/raw?category=greenwave the last message is 4 days old.
Do we have any info that the message was sent?
I'm reporting this here in case more people can help debugging this.
Sounds like a good test-case to figure out how to track "lost" messages in fedora-messaging :)
It seems like there was something wrong with resultsdb, there were no messages yesterday from 09:22:56 am until 18:02:38:
https://apps.fedoraproject.org/datagrepper/raw?category=resultsdb&delta=127800
The results for that element were created all around 15.20:
https://taskotron.fedoraproject.org/resultsdb_api/api/v2.0/results?item=fedora-gather-easyfix-0.1.1-17.fc30
and checking Greenwave logs I also see that greenwave didn't get any messages from resultsdb during that period of time.
I would suggest to check if there was some resultsdb outage or UMB outage.
So doing some more debugging of this.
I've created another :
So it seems something is off with resultsdb.
Rephrasing the title now that we've narrowed the issue
Metadata Update from @pingou:
I have restarted apache on resultsdb01 as well as rdbsync after that and I already see a message from CI in datagrepper: https://apps.fedoraproject.org/datagrepper/id?id=2019-54bb2f78-f2fa-41e9-8816-2f9a3d289b1a&is_raw=true&size=extra-large
Restarted the process:
So I think this is fixed :)
Metadata Update from @pingou:
Awesome!
Metadata Update from @cverna:
I think that we should not close this until we have added the proper monitoring for resultsdb. If we don't do it now we will never do it 😄
Ran my debug script today and the issue happened again (no messages from resultsdb), I've tried to add some more debugging to figure out what is going on but there is something up there.
Ok, finally found the reason for the error:
ERROR - reasons: Channel is closed.
We'll need some help to fix this
Ok, so there is something weird going on here:
This is what I see in the logs:
However, there is in datagrepper a message about that result: https://apps.fedoraproject.org/datagrepper/id?id=2019-5109396d-0196-4b3a-9230-7363df36b4ac&is_raw=true&size=extra-large
@jcline if you have some time to help us figuring out what's going on here, it would be nice :)
And to be complete, that result id is only listed twice in the logs:
It's the ID 30961752 that failed - I cannot find any datagrepper entry for it.
Oh, shoot you're right!
So the error doesn't trigger all the time, could it be that it's sending too many messages? Some are going through and others are not :(
Metadata Update from @pingou:
Just saw a new stacktrace in the logs:
@abompard @jcline does this ring a bell to you?
Metadata Update from @pingou:
Note that this stacktrace doesn't happen all the time, most often all that is in the logs are
Channel is closed.
Metadata Update from @pingou:
Metadata Update from @pingou:
Yes, this does ring a bell. Which version of fedora-messaging are you running?
I've upgraded to 1.7.1, let's see if this happens again :)
Yes this should fix it, see https://github.com/fedora-infra/fedora-messaging/issues/175
Thanks for the explanation, server side I no longer see the "Channel closed" error.
Let's consider this fixed then, thanks! :)
Metadata Update from @pingou: