openqa/worker: kill stuck qemu processes daily
This is an awful hack to deal with https://github.com/os-autoinst/os-autoinst/issues/2549 while we try and fix it properly. This finds stuck qemu processes by parsing the journal messages of the workers, and kills them. workers stuck in the broken state should then recover on the next checkin with the server. I tested this manually on all the worker hosts and it...seemed to work, mostly. I'll keep an eye on things after deploying it. Signed-off-by: Adam Williamson <awilliam@redhat.com>
This commit is contained in:
parent
b288e65343
commit
cb026b4120
2 changed files with 10 additions and 0 deletions
7
roles/openqa/worker/files/kill-stuck-qemu.sh
Executable file
7
roles/openqa/worker/files/kill-stuck-qemu.sh
Executable file
|
@ -0,0 +1,7 @@
|
|||
#!/bin/bash
|
||||
|
||||
# this is a hideous hack to find and kill qemu processes stuck as a
|
||||
# result of https://github.com/os-autoinst/os-autoinst/issues/2549
|
||||
# which cause workers to be stuck in broken state. affected workers
|
||||
# should recover some minutes after this script runs
|
||||
for i in {1..35}; do journalctl -u openqa-worker-plain@$i.service -n 5 | grep "is still running" | grep -o "PID: [0-9]\+" | cut -d" " -f2 | sort -u | xargs kill 2> /dev/null; done
|
|
@ -167,6 +167,9 @@
|
|||
service: name=rngd enabled=yes state=started
|
||||
when: "openqa_rngd is defined and openqa_rngd"
|
||||
|
||||
- name: Install cron job to kill stuck qemu processes
|
||||
copy: src=kill-stuck-qemu.sh dest=/etc/cron.daily/kill-stuck-qemu owner=root group=root mode=0755
|
||||
|
||||
- include_tasks: nfs-client.yml
|
||||
when: openqa_nfs_worker|bool
|
||||
|
||||
|
|
Loading…
Add table
Add a link
Reference in a new issue