From cb026b41202f455db56e9f442beae543b258426b Mon Sep 17 00:00:00 2001 From: Adam Williamson Date: Tue, 15 Oct 2024 13:13:42 -0700 Subject: [PATCH] openqa/worker: kill stuck qemu processes daily This is an awful hack to deal with https://github.com/os-autoinst/os-autoinst/issues/2549 while we try and fix it properly. This finds stuck qemu processes by parsing the journal messages of the workers, and kills them. workers stuck in the broken state should then recover on the next checkin with the server. I tested this manually on all the worker hosts and it...seemed to work, mostly. I'll keep an eye on things after deploying it. Signed-off-by: Adam Williamson --- roles/openqa/worker/files/kill-stuck-qemu.sh | 7 +++++++ roles/openqa/worker/tasks/main.yml | 3 +++ 2 files changed, 10 insertions(+) create mode 100755 roles/openqa/worker/files/kill-stuck-qemu.sh diff --git a/roles/openqa/worker/files/kill-stuck-qemu.sh b/roles/openqa/worker/files/kill-stuck-qemu.sh new file mode 100755 index 0000000000..05e0a0c61b --- /dev/null +++ b/roles/openqa/worker/files/kill-stuck-qemu.sh @@ -0,0 +1,7 @@ +#!/bin/bash + +# this is a hideous hack to find and kill qemu processes stuck as a +# result of https://github.com/os-autoinst/os-autoinst/issues/2549 +# which cause workers to be stuck in broken state. affected workers +# should recover some minutes after this script runs +for i in {1..35}; do journalctl -u openqa-worker-plain@$i.service -n 5 | grep "is still running" | grep -o "PID: [0-9]\+" | cut -d" " -f2 | sort -u | xargs kill 2> /dev/null; done diff --git a/roles/openqa/worker/tasks/main.yml b/roles/openqa/worker/tasks/main.yml index ad5f4088d1..713d0b1af3 100644 --- a/roles/openqa/worker/tasks/main.yml +++ b/roles/openqa/worker/tasks/main.yml @@ -167,6 +167,9 @@ service: name=rngd enabled=yes state=started when: "openqa_rngd is defined and openqa_rngd" +- name: Install cron job to kill stuck qemu processes + copy: src=kill-stuck-qemu.sh dest=/etc/cron.daily/kill-stuck-qemu owner=root group=root mode=0755 + - include_tasks: nfs-client.yml when: openqa_nfs_worker|bool