removed restarting koji and OOM sections, updated disk space section and added a new section about checking the builders

Signed-off-by: Pedro Moura <pmoura@redhat.com>
2023-01-11 21:31:00 -03:00 · 2023-01-11 21:31:00 -03:00 · c3bdce539a
commit c3bdce539a
parent c91ad6257f
1 changed files with 11 additions and 114 deletions
--- a/modules/sysadmin_guide/pages/koji.adoc
+++ b/modules/sysadmin_guide/pages/koji.adoc
@ -17,11 +17,8 @@ machines to do their work.
 * <<_troubleshooting_and_resolution>>
 ** <<_restarting_koji>>
 ** <<_kojid_wont_start_or_some_builders_wont_connect>>
-** <<_oom_out_of_memory_issues>>
-*** <<_increase_memory>>
-*** <<_decrease_weight>>
 ** <<_disk_space_issues>>
-
+** <<_checking builders are all checking in correctly>>
 == Contact Information

 Owner::
@ -64,26 +61,6 @@ koji tag-pkg dist-$release-override <package_nvr>

 == Troubleshooting and Resolution

-=== Restarting Koji
-
-If for some reason koji needs to be restarted, make sure to restart the
-koji master first, then the builders. If the koji master has been down
-for a short enough time the builders do not need to be restarted.:
-
-....
-service httpd restart
-service kojira restart
-service kojid restart
-....
-
-[IMPORTANT]
-====
-If postgres becomes interrupted in some way, koji will need to be
-restarted. As long as the koji master daemon gets restarted the builders
-should reconnect automatically. If the db server has been restarted and
-the builders don't seem to be building, restart their daemons as well.
-====
-
 === kojid won't start or some builders won't connect

 In the event that some items are able to connect to koji while some are
@ -92,115 +69,35 @@ This is common if koji crashes and the db connections aren't properly
 cleared. Upon restart many of the connections are full so koji cannot
 reconnect. Clearing old connections is easy, guess about how long it the
 new koji has been up and pick a number of minutes larger then that and
-kill those queries. From _db3_ as _postgres_ run:
+kill those queries. From _db-koji01_ as _postgres_ run:

 ....
 echo "select procpid from pg_stat_activity where usename='koji' and now() - query_start \
 >= '00:40:00' order by query_start;" | psql koji | grep "^  " | xargs kill
 ....

-=== OOM (Out of Memory) Issues
-
-Out of memory issues occur from time to time on the build machines.
-There are a couple of options for correction. The first fix is to just
-restart the machine and hope it was a one time thing. If the problem
-continues please choose from one of the following options.
-
-==== Increase Memory
-
-The xen machines can have memory increased on their corresponding xen
-hosts. At present this is the table:
-
-[width="34%",cols="44%,56%",]
-|===
-|xen3 |xenbuilder1
-|xen4 |xenbuilder2
-|disabled |xenbuilder3
-|xen8 |xenbuilder4
-|===
-
-Edit `/etc/xen/xenbuilder[1-4]` and add more memory.
-
-==== Decrease weight
-
-Each builder has a weight as to how much work can be given to it.
-Presently the only way to alter weight is actually changing the database
-on _db3_:
-
-....
-$ sudo su - postgres
-bash-2.05b$ psql koji
-koji=# select * from host limit 1;
-id | user_id |          name          |  arches   | task_load | capacity | ready | enabled
---+---------+------------------------+-----------+-----------+----------+-------+---------
-6  |     130 | ppc3.fedora.redhat.com | ppc ppc64 |       1.5 |        4 | t     | t
-(1 row)
-koji=# update host set capacity=2 where name='ppc3.fedora.redhat.com';
-....
-
-Simply update capacity to a lower number.
-
 === Disk Space Issues

-The builders use a lot of temporary storage. Failed builds also get left
-on the builders, most should get cleaned but plague does not. The
-easiest thing to do is remove some older cache dirs.
-
-Step one is to turn off both koji and plague:
+The builders use a lot of temporary storage. Failed builds and old mock buildroots should be automatic cleaned, but in case it doesn't remove /var/lib/mock/* and restart kojid:

 ....
-/etc/init.d/plague-builder stop
-/etc/init.d/kojid stop
-....
-
-Next check to see what file system is full:
-
-....
-df -h
+/etc/init.d/kojid restart
 ....

 [IMPORTANT]
 ====
-If any one of the following directories is full, send an outage
-notification as outlined in: Infrastructure/OutageTemplate to the
-fedora-infrastructure-list and fedora-devel-list, then contact Mike
-McGrath
-
-* /mnt/koji
-* /mnt/ntap-fedora1/scratch
-* /pub/epel
-* /pub/fedora
+aarch64 buildhw's have a lot of space taken up in /var/log/libvirt/qemu/nvram/ which can all be deleted to free up space.
 ====

-Typically just / will be full. The next thing to do is determine if
-we have any extremely large builds left on the builder. Typical
-locations include /var/lib/mock and /mnt/build (/mnt/build actually is
-on the local filesystem):
+=== Checking builders are all checking in correctly
+
+To check builders, list all builders and grep by the time of last update. If the builders are checking in correctly the time of last update should be close to your current date/time, so use a command like the following example:

 ....
-du -sh /var/lib/mock/* /mnt/build/*
+koji list-hosts --enabled  | grep -v '04 Dec 2022 12:1'
 ....

-`/var/lib/mock/dist-f8-build-10443-1503`::
-  classic koji build
-`/var/lib/mock/fedora-6-ppc-core-57cd31505683ef1afa533197e91608c5a2c52864`::
-  classic plague build
-
-If nothing jumps out immediately, just start deleting files older than
-one week. Once enough space has been freed start koji and plague back
-up:
-
-....
-/etc/init.d/plague-builder start
-/etc/init.d/kojid start
-....
-
-=== Unmounting
-
-[WARNING]
+[IMPORTANT]
 ====
-Should there be mention of being sure filesystems in chroots are
-unmounted before you delete the chroots?
-
-Res ipsa loquitur.
+Kojira process should only run on koji02. Never on koji01.
 ====