Add monitoring for httpd service on resultsdb #8494

Closed
opened 2020-01-02 18:34:28 +00:00 by mizdebsk · 6 comments

httpd.service on resultsdb01.qa.fedoraproject.org host crashed and was down for more than 37 hours, yet we didn't get any alert about that. Monitoring of the service should be added to prevent such long outage from happening in the future.

Jan 01 01:43:23 resultsdb01.qa.fedoraproject.org systemd[1]: httpd.service: A process of this unit has been killed by the OOM killer.
Jan 01 01:43:58 resultsdb01.qa.fedoraproject.org systemd[1]: httpd.service: Failed with result 'oom-kill'.
Jan 01 01:43:58 resultsdb01.qa.fedoraproject.org systemd[1]: httpd.service: Consumed 1h 24min 33.687s CPU time.
Jan 02 14:51:45 resultsdb01.qa.fedoraproject.org systemd[1]: Starting The Apache HTTP Server...
Jan 02 14:51:45 resultsdb01.qa.fedoraproject.org httpd[812580]: [Thu Jan 02 14:51:45.939277 2020] [env:warn] [pid 812580:tid 140307113316672] AH01506: PassEnv variable HOSTNAME was undefined
Jan 02 14:51:46 resultsdb01.qa.fedoraproject.org httpd[812580]: Server configured, listening on: port 80
Jan 02 14:51:46 resultsdb01.qa.fedoraproject.org systemd[1]: Started The Apache HTTP Server.
`httpd.service` on `resultsdb01.qa.fedoraproject.org` host crashed and was down for more than 37 hours, yet we didn't get any alert about that. Monitoring of the service should be added to prevent such long outage from happening in the future. ``` Jan 01 01:43:23 resultsdb01.qa.fedoraproject.org systemd[1]: httpd.service: A process of this unit has been killed by the OOM killer. Jan 01 01:43:58 resultsdb01.qa.fedoraproject.org systemd[1]: httpd.service: Failed with result 'oom-kill'. Jan 01 01:43:58 resultsdb01.qa.fedoraproject.org systemd[1]: httpd.service: Consumed 1h 24min 33.687s CPU time. Jan 02 14:51:45 resultsdb01.qa.fedoraproject.org systemd[1]: Starting The Apache HTTP Server... Jan 02 14:51:45 resultsdb01.qa.fedoraproject.org httpd[812580]: [Thu Jan 02 14:51:45.939277 2020] [env:warn] [pid 812580:tid 140307113316672] AH01506: PassEnv variable HOSTNAME was undefined Jan 02 14:51:46 resultsdb01.qa.fedoraproject.org httpd[812580]: Server configured, listening on: port 80 Jan 02 14:51:46 resultsdb01.qa.fedoraproject.org systemd[1]: Started The Apache HTTP Server. ```

Would something akin to this be sufficient? I’m going by dedf9486721d28637c77f9bf27bd59470c8ebeca.

0001-nagios-Add-httpd-monitoring-for-resultsdb01.patch

Would something akin to this be sufficient? I’m going by dedf9486721d28637c77f9bf27bd59470c8ebeca. [![0001-nagios-Add-httpd-monitoring-for-resultsdb01.patch](/fedora-infrastructure/issue/raw/files/f19a41cdaa762d5aec297fedfe1d5e9bc977b65916848bca8b7434af01f5d512-0001-nagios-Add-httpd-monitoring-for-resultsdb01.patch)](/fedora-infrastructure/issue/raw/files/f19a41cdaa762d5aec297fedfe1d5e9bc977b65916848bca8b7434af01f5d512-0001-nagios-Add-httpd-monitoring-for-resultsdb01.patch)

the issue was reviewd ?

the issue was reviewd ?

Oops. I totally missed the update on this one...

That looks like it should be ok, but the hostname has changed and so much of nagios changed it won't apply.

Can someone rebase it and use the new name (resultsdb01.iad2.fedoraproject.org) ?

Oops. I totally missed the update on this one... That looks like it should be ok, but the hostname has changed and so much of nagios changed it won't apply. Can someone rebase it and use the new name (resultsdb01.iad2.fedoraproject.org) ?

0002-nagios-Add-httpd-monitoring-for-resultsdb01.patch
I'm not yet familiar with the naming scheme here, so I hope it's ok.
I've removed the ssl bit as it seems this service is http only right now, is that correct?

[![0002-nagios-Add-httpd-monitoring-for-resultsdb01.patch](/fedora-infrastructure/issue/raw/files/ab0bfa21d453f51db530bdc397ebe2a9e4c9fe58401113fd617544a34022a616-0002-nagios-Add-httpd-monitoring-for-resultsdb01.patch)](/fedora-infrastructure/issue/raw/files/ab0bfa21d453f51db530bdc397ebe2a9e4c9fe58401113fd617544a34022a616-0002-nagios-Add-httpd-monitoring-for-resultsdb01.patch) I'm not yet familiar with the naming scheme here, so I hope it's ok. I've removed the ssl bit as it seems this service is http only right now, is that correct?

Yep. that looks great. :)

Yep. that looks great. :)

Metadata Update from @kevin:

  • Issue close_status updated to: Fixed
  • Issue status updated to: Closed (was: Open)
**Metadata Update from @kevin**: - Issue close_status updated to: Fixed - Issue status updated to: Closed (was: Open)
Sign in to join this conversation.
No milestone
No project
No assignees
5 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: Infrastructure/fedora-infrastructure#8494
No description provided.