openqa/worker: kill stuck qemu processes daily

This is an awful hack to deal with
https://github.com/os-autoinst/os-autoinst/issues/2549 while we
try and fix it properly. This finds stuck qemu processes by
parsing the journal messages of the workers, and kills them.
workers stuck in the broken state should then recover on the
next checkin with the server. I tested this manually on all the
worker hosts and it...seemed to work, mostly. I'll keep an eye on
things after deploying it.

Signed-off-by: Adam Williamson <awilliam@redhat.com>
This commit is contained in:
Adam Williamson 2024-10-15 13:13:42 -07:00
parent b288e65343
commit cb026b4120
2 changed files with 10 additions and 0 deletions

View file

@ -0,0 +1,7 @@
#!/bin/bash
# this is a hideous hack to find and kill qemu processes stuck as a
# result of https://github.com/os-autoinst/os-autoinst/issues/2549
# which cause workers to be stuck in broken state. affected workers
# should recover some minutes after this script runs
for i in {1..35}; do journalctl -u openqa-worker-plain@$i.service -n 5 | grep "is still running" | grep -o "PID: [0-9]\+" | cut -d" " -f2 | sort -u | xargs kill 2> /dev/null; done

View file

@ -167,6 +167,9 @@
service: name=rngd enabled=yes state=started
when: "openqa_rngd is defined and openqa_rngd"
- name: Install cron job to kill stuck qemu processes
copy: src=kill-stuck-qemu.sh dest=/etc/cron.daily/kill-stuck-qemu owner=root group=root mode=0755
- include_tasks: nfs-client.yml
when: openqa_nfs_worker|bool