2021-07-26 10:39:47 +02:00
|
|
|
= Copr
|
|
|
|
|
|
|
|
Copr is build system for 3rd party packages.
|
|
|
|
|
|
|
|
Frontend:::
|
|
|
|
* http://copr.fedorainfracloud.org/
|
|
|
|
Backend:::
|
|
|
|
* http://copr-be.cloud.fedoraproject.org/
|
|
|
|
Package signer:::
|
|
|
|
* copr-keygen.cloud.fedoraproject.org
|
|
|
|
Dist-git::
|
|
|
|
* copr-dist-git.fedorainfracloud.org
|
|
|
|
Devel instances (NO NEED TO CARE ABOUT THEM, JUST THOSE ABOVE):::
|
|
|
|
* http://copr-fe-dev.cloud.fedoraproject.org/
|
|
|
|
* http://copr-be-dev.cloud.fedoraproject.org/
|
|
|
|
* copr-keygen-dev.cloud.fedoraproject.org
|
|
|
|
* copr-dist-git-dev.fedorainfracloud.org
|
|
|
|
|
|
|
|
== Contact Information
|
|
|
|
|
|
|
|
Owner::
|
|
|
|
msuchy (mirek)
|
|
|
|
Contact::
|
|
|
|
#fedora-admin, #fedora-buildsys
|
|
|
|
Location::
|
|
|
|
Fedora Cloud
|
|
|
|
Purpose::
|
|
|
|
Build system
|
|
|
|
|
|
|
|
== This document
|
|
|
|
|
|
|
|
This document provides a condensed information allowing you to keep Copr
|
|
|
|
alive and working. For more sofisticated business processes, please see
|
|
|
|
https://docs.pagure.org/copr.copr/maintenance_documentation.html
|
|
|
|
|
|
|
|
== TROUBLESHOOTING
|
|
|
|
|
|
|
|
Almost every problem with Copr is due problem with spawning builder VMs,
|
|
|
|
or with processing action queue on backend.
|
|
|
|
|
|
|
|
=== VM spawning/termination problems
|
|
|
|
|
|
|
|
Try to restart copr-backend service:
|
|
|
|
|
|
|
|
....
|
|
|
|
$ ssh root@copr-be.cloud.fedoraproject.org
|
|
|
|
$ systemctl restart copr-backend
|
|
|
|
....
|
|
|
|
|
|
|
|
If this doesn't solve the problem, try to follow logs for some clues:
|
|
|
|
|
|
|
|
....
|
|
|
|
$ tail -f /var/log/copr-backend/{vmm,spawner,terminator}.log
|
|
|
|
....
|
|
|
|
|
|
|
|
As the last resort option, you can terminate all builders and let
|
|
|
|
copr-backend to throw all information about them. This action will
|
|
|
|
obviously interrupt all running builds and reschedule them:
|
|
|
|
|
|
|
|
....
|
|
|
|
$ ssh root@copr-be.cloud.fedoraproject.org
|
|
|
|
$ systemctl stop copr-backend
|
|
|
|
$ cleanup_vm_nova.py
|
|
|
|
$ redis-cli
|
|
|
|
> FLUSHALL
|
|
|
|
$ systemctl start copr-backend
|
|
|
|
....
|
|
|
|
|
|
|
|
Sometimes OpenStack can not handle spawning too much VMs at the same
|
2021-08-18 12:49:35 +02:00
|
|
|
time. So it is safer to edit on _copr-be.cloud.fedoraproject.org_:
|
2021-07-26 10:39:47 +02:00
|
|
|
|
|
|
|
....
|
|
|
|
vi /etc/copr/copr-be.conf
|
|
|
|
....
|
|
|
|
|
|
|
|
and change:
|
|
|
|
|
|
|
|
....
|
|
|
|
group0_max_workers=12
|
|
|
|
....
|
|
|
|
|
|
|
|
to "6". Start copr-backend service and some time later increase it to
|
|
|
|
original value. Copr automaticaly detect change in script and increase
|
|
|
|
number of workers.
|
|
|
|
|
|
|
|
The set of aarch64 VMs isn't maintained by OpenStack, but by Copr's
|
|
|
|
backend itself. Steps to diagnose:
|
|
|
|
|
|
|
|
....
|
|
|
|
$ ssh root@copr-be.cloud.fedoraproject.org
|
|
|
|
[root@copr-be ~][PROD]# systemctl status resalloc
|
|
|
|
● resalloc.service - Resource allocator server
|
|
|
|
...
|
|
|
|
|
|
|
|
[root@copr-be ~][PROD]# less /var/log/resallocserver/main.log
|
|
|
|
|
|
|
|
[root@copr-be ~][PROD]# su - resalloc
|
|
|
|
|
|
|
|
[resalloc@copr-be ~][PROD]$ resalloc-maint resource-list
|
|
|
|
13569 - aarch64_01_prod_00013569_20190613_151319 pool=aarch64_01_prod tags=aarch64 status=UP
|
|
|
|
13597 - aarch64_01_prod_00013597_20190614_083418 pool=aarch64_01_prod tags=aarch64 status=UP
|
|
|
|
13594 - aarch64_02_prod_00013594_20190614_082303 pool=aarch64_02_prod tags=aarch64 status=STARTING
|
|
|
|
...
|
|
|
|
|
|
|
|
[resalloc@copr-be ~][PROD]$ resalloc-maint ticket-list
|
|
|
|
879 - state=OPEN tags=aarch64 resource=aarch64_01_prod_00013569_20190613_151319
|
|
|
|
918 - state=OPEN tags=aarch64 resource=aarch64_01_prod_00013608_20190614_135536
|
|
|
|
904 - state=OPEN tags=aarch64 resource=aarch64_02_prod_00013594_20190614_082303
|
|
|
|
919 - state=OPEN tags=aarch64
|
|
|
|
...
|
|
|
|
....
|
|
|
|
|
|
|
|
Be careful when there's some resource in `STARTING` state. If that's so,
|
|
|
|
check
|
|
|
|
`/usr/bin/tail -F -n +0 /var/log/resallocserver/hooks/013594_alloc`.
|
|
|
|
Copr takes tickets from resalloc server; and if the resources fail to
|
|
|
|
spawn, the ticket numbers are not assigned with appropriately tagged
|
|
|
|
resource for a long time.
|
|
|
|
|
|
|
|
If that happens (it shouldn't) and there's some inconsistency between
|
|
|
|
resalloc's database and the actual status on aarch64 hypervisors
|
|
|
|
(`ssh copr@virthost-aarch64-os0{1,2}.fedorainfracloud.org`) - use
|
|
|
|
`virsh` there to introspect theirs statuses - use
|
|
|
|
`resalloc-maint resource-delete`, `resalloc ticket-close` or `psql`
|
|
|
|
commands to fix-up the resalloc's DB.
|
|
|
|
|
|
|
|
=== Backend Troubleshoting
|
|
|
|
|
|
|
|
Information about status of Copr backend services:
|
|
|
|
|
|
|
|
....
|
|
|
|
systemctl status copr-backend*.service
|
|
|
|
....
|
|
|
|
|
|
|
|
Utilization of workers:
|
|
|
|
|
|
|
|
....
|
|
|
|
ps axf
|
|
|
|
....
|
|
|
|
|
|
|
|
Worker process change $0 to list which task they are working on and on
|
|
|
|
which builder.
|
|
|
|
|
|
|
|
To list which VM builders are tracked by copr-vmm service:
|
|
|
|
|
|
|
|
....
|
|
|
|
/usr/bin/copr_get_vm_info.py
|
|
|
|
....
|
|
|
|
|
|
|
|
=== Appstream builder troubleshoting
|
|
|
|
|
|
|
|
Appstream builder is painfully slow when running on a repository with a
|
|
|
|
huge amount of packages. See
|
|
|
|
https://github.com/hughsie/appstream-glib/issues/301 . You might need to
|
|
|
|
disable it for some projects:
|
|
|
|
|
|
|
|
....
|
|
|
|
$ ssh root@copr-be.cloud.fedoraproject.org
|
|
|
|
$ cd /var/lib/copr/public_html/results/<owner>/<project>/
|
|
|
|
$ touch .disable-appstream
|
|
|
|
# You should probably also delete existing appstream data because
|
|
|
|
# they might be obsolete
|
|
|
|
$ rm -rf ./appdata
|
|
|
|
....
|
|
|
|
|
|
|
|
=== Backend action queue issues
|
|
|
|
|
2021-08-18 12:49:35 +02:00
|
|
|
First check the _number of not-yet-processed actions_. If that
|
2021-07-26 10:39:47 +02:00
|
|
|
number isn't equal to zero, and is not decrementing relatively fast (say
|
|
|
|
single action takes longer than 30s) -- there might be some problem.
|
|
|
|
Logs for the action dispatcher can be found in:
|
|
|
|
|
|
|
|
....
|
|
|
|
/var/log/copr-backend/action_dispatcher.log
|
|
|
|
....
|
|
|
|
|
|
|
|
Check if there's no stucked process under `Action dispatch` parent
|
|
|
|
process in `pstree -a copr` output.
|
|
|
|
|
|
|
|
== Deploy information
|
|
|
|
|
|
|
|
Using playbooks and rbac:
|
|
|
|
|
|
|
|
....
|
|
|
|
$ sudo rbac-playbook groups/copr-backend.yml
|
|
|
|
$ sudo rbac-playbook groups/copr-frontend-cloud.yml
|
|
|
|
$ sudo rbac-playbook groups/copr-keygen.yml
|
|
|
|
$ sudo rbac-playbook groups/copr-dist-git.yml
|
|
|
|
....
|
|
|
|
|
2021-08-18 12:49:35 +02:00
|
|
|
The
|
|
|
|
https://pagure.io/copr/copr/blob/main/f/copr-setup.txt[copr-setup.txt]
|
|
|
|
manual is severely outdated, but there is
|
2021-07-26 10:39:47 +02:00
|
|
|
no up-to-date alternative. We should extract useful information from it
|
|
|
|
and put it here in the SOP or into
|
|
|
|
https://docs.pagure.org/copr.copr/maintenance_documentation.html and
|
2021-08-18 12:49:35 +02:00
|
|
|
then throw the _copr-setup.txt_ away.
|
2021-07-26 10:39:47 +02:00
|
|
|
|
|
|
|
On backend should run copr-backend service (which spawns several
|
|
|
|
processes). Backend spawns VM from Fedora Cloud. You could not login to
|
|
|
|
those machines directly. You have to:
|
|
|
|
|
|
|
|
....
|
|
|
|
$ ssh root@copr-be.cloud.fedoraproject.org
|
|
|
|
$ su - copr
|
|
|
|
$ copr_get_vm_info.py
|
|
|
|
# find IP address of the VM that you want
|
|
|
|
$ ssh root@172.16.3.3
|
|
|
|
....
|
|
|
|
|
|
|
|
Instances can be easily terminated in
|
|
|
|
https://fedorainfracloud.org/dashboard
|
|
|
|
|
|
|
|
=== Order of start up
|
|
|
|
|
|
|
|
When reprovision you should start first: copr-keygen and copr-dist-git
|
|
|
|
machines (in any order). Then you can start copr-be. Well you can start
|
|
|
|
it sooner, but make sure that copr-* services are stopped.
|
|
|
|
|
|
|
|
Copr-fe machine is completly independent and can be start any time. If
|
|
|
|
backend is stopped it will just queue jobs.
|
|
|
|
|
|
|
|
== Logs
|
|
|
|
|
|
|
|
=== Backend
|
|
|
|
|
|
|
|
* /var/log/copr-backend/action_dispatcher.log
|
|
|
|
* /var/log/copr-backend/actions.log
|
|
|
|
* /var/log/copr-backend/backend.log
|
|
|
|
* /var/log/copr-backend/build_dispatcher.log
|
|
|
|
* /var/log/copr-backend/logger.log
|
|
|
|
* /var/log/copr-backend/spawner.log
|
|
|
|
* /var/log/copr-backend/terminator.log
|
|
|
|
* /var/log/copr-backend/vmm.log
|
|
|
|
* /var/log/copr-backend/worker.log
|
|
|
|
|
|
|
|
And several logs for non-essential features such as
|
|
|
|
copr_prune_results.log, hitcounter.log, cleanup_vms.log, that you
|
|
|
|
shouldn't be worried with.
|
|
|
|
|
|
|
|
=== Frontend
|
|
|
|
|
|
|
|
* /var/log/copr-frontend/frontend.log
|
|
|
|
* /var/log/httpd/access_log
|
|
|
|
* /var/log/httpd/error_log
|
|
|
|
|
|
|
|
=== Keygen
|
|
|
|
|
|
|
|
* /var/log/copr-keygen/main.log
|
|
|
|
|
|
|
|
=== Dist-git
|
|
|
|
|
|
|
|
* /var/log/copr-dist-git/main.log
|
|
|
|
* /var/log/httpd/access_log
|
|
|
|
* /var/log/httpd/error_log
|
|
|
|
|
|
|
|
== Services
|
|
|
|
|
|
|
|
=== Backend
|
|
|
|
|
|
|
|
* copr-backend
|
|
|
|
** copr-backend-action
|
|
|
|
** copr-backend-build
|
|
|
|
** copr-backend-log
|
|
|
|
** copr-backend-vmm
|
|
|
|
* redis
|
|
|
|
* lighttpd
|
|
|
|
|
2021-08-18 12:49:35 +02:00
|
|
|
All the _copr-backend-*.service_ are configured to be a part
|
|
|
|
of the _copr-backend.service_ so e.g. in case of restarting
|
|
|
|
all of them, just restart the _copr-backend.service_.
|
2021-07-26 10:39:47 +02:00
|
|
|
|
|
|
|
=== Frontend
|
|
|
|
|
|
|
|
* httpd
|
|
|
|
* postgresql
|
|
|
|
|
|
|
|
=== Keygen
|
|
|
|
|
|
|
|
* signd
|
|
|
|
|
|
|
|
=== Dist-git
|
|
|
|
|
|
|
|
* httpd
|
|
|
|
* copr-dist-git
|
|
|
|
|
|
|
|
== PPC64LE Builders
|
|
|
|
|
|
|
|
Builders for PPC64 are located at rh-power2.fit.vutbr.cz and anyone with
|
|
|
|
access to buildsys ssh key can get there using keys as::
|
|
|
|
msuchy@rh-power2.fit.vutbr.cz
|
|
|
|
|
2021-08-18 12:49:35 +02:00
|
|
|
There are commands:
|
|
|
|
....
|
|
|
|
$ ls bin/
|
|
|
|
destroy-all.sh reinit-vm26.sh
|
2021-07-26 10:39:47 +02:00
|
|
|
reinit-vm28.sh virsh-destroy-vm26.sh virsh-destroy-vm28.sh
|
|
|
|
virsh-start-vm26.sh virsh-start-vm28.sh get-one-vm.sh reinit-vm27.sh
|
|
|
|
reinit-vm29.sh virsh-destroy-vm27.sh virsh-destroy-vm29.sh
|
|
|
|
virsh-start-vm27.sh virsh-start-vm29.sh
|
2021-08-18 12:49:35 +02:00
|
|
|
....
|
|
|
|
|
|
|
|
`destroy-all.sh` destroy all VM and reinit them
|
|
|
|
|
|
|
|
`reinit-vmXX.sh` copy VM image from template
|
|
|
|
|
|
|
|
`virsh-destroy-vmXX.sh` destroys VM
|
|
|
|
|
|
|
|
`virsh-start-vmXX.sh` starts VM
|
2021-07-26 10:39:47 +02:00
|
|
|
|
2021-08-18 12:49:35 +02:00
|
|
|
`get-one-vm.sh` start one VM and return its IP - this is used in Copr playbooks.
|
2021-07-26 10:39:47 +02:00
|
|
|
|
2021-08-18 12:49:35 +02:00
|
|
|
In case of big queue of PPC64 tasks simply call `bin/destroy-all.sh` and
|
2021-07-26 10:39:47 +02:00
|
|
|
it will destroy stuck VM and copr backend will spawn new VM.
|
|
|
|
|
|
|
|
== Ports opened for public
|
|
|
|
|
|
|
|
Frontend:
|
|
|
|
|
|
|
|
[width="86%",cols="13%,17%,16%,54%",options="header",]
|
|
|
|
|===
|
|
|
|
|Port |Protocol |Service |Reason
|
|
|
|
|22 |TCP |ssh |Remote control
|
|
|
|
|80 |TCP |http |Serving Copr frontend website
|
|
|
|
|443 |TCP |https |^^
|
|
|
|
|===
|
|
|
|
|
|
|
|
Backend:
|
|
|
|
|
|
|
|
[width="86%",cols="13%,17%,16%,54%",options="header",]
|
|
|
|
|===
|
|
|
|
|Port |Protocol |Service |Reason
|
|
|
|
|22 |TCP |ssh |Remote control
|
|
|
|
|80 |TCP |http |Serving build results and repos
|
|
|
|
|443 |TCP |https |^^
|
|
|
|
|===
|
|
|
|
|
|
|
|
Distgit:
|
|
|
|
|
|
|
|
[width="86%",cols="13%,17%,16%,54%",options="header",]
|
|
|
|
|===
|
|
|
|
|Port |Protocol |Service |Reason
|
|
|
|
|22 |TCP |ssh |Remote control
|
|
|
|
|80 |TCP |http |Serving cgit interface
|
|
|
|
|443 |TCP |https |^^
|
|
|
|
|===
|
|
|
|
|
|
|
|
Keygen:
|
|
|
|
|
|
|
|
[width="86%",cols="13%,17%,16%,54%",options="header",]
|
|
|
|
|===
|
|
|
|
|Port |Protocol |Service |Reason
|
|
|
|
|22 |TCP |ssh |Remote control
|
|
|
|
|===
|
|
|
|
|
|
|
|
== Resources justification
|
|
|
|
|
|
|
|
Copr currently uses the following resources.
|
|
|
|
|
|
|
|
=== Frontend
|
|
|
|
|
|
|
|
* RAM: 2G (out of 4G) and some swap
|
|
|
|
* CPU: 2 cores (3400mhz) with load 0.92, 0.68, 0.65
|
|
|
|
|
|
|
|
Most of the memory is eaten by PostgreSQL, followed by Apache. The CPU
|
|
|
|
usage is also mainly used for those two services but in the reversed
|
|
|
|
order.
|
|
|
|
|
|
|
|
I don't think we can settle down with any instance that provides less
|
|
|
|
than (2G RAM, obviously), but ideally, we need 3G+. 2-core CPU is good
|
|
|
|
enough.
|
|
|
|
|
2021-08-18 12:49:35 +02:00
|
|
|
* Disk space: 17G for system and 8G for _pgsqldb_ directory
|
2021-07-26 10:39:47 +02:00
|
|
|
|
|
|
|
If needed, we are able to clean-up the database directory of old dumps
|
|
|
|
and backups and get down to around 4G disk space.
|
|
|
|
|
|
|
|
=== Backend
|
|
|
|
|
|
|
|
* RAM: 5G (out of 16G)
|
|
|
|
* CPU: 8 cores (3400MHz) with load 4.09, 4.55, 4.24
|
|
|
|
|
|
|
|
Backend takes care of spinning-up builders and running ansible playbooks
|
2021-08-18 12:49:35 +02:00
|
|
|
on them, running _createrepo_c_ (on big repositories) and so
|
2021-07-26 10:39:47 +02:00
|
|
|
on. Copr utilizes two queues, one for builds, which are delegated to
|
|
|
|
OpenStack builders, and action queue. Actions, however, are processed
|
|
|
|
directly by the backend, so it can spike our load up. We would ideally
|
|
|
|
like to have the same computing power that we have now. Maybe we can go
|
|
|
|
lower than 16G RAM, possibly down to 12G RAM.
|
|
|
|
|
|
|
|
* Disk space: 30G for the system, 5.6T (out of 6.8T) for build results
|
|
|
|
|
|
|
|
Currently, we have 1.3T of backup data, that is going to be deleted
|
|
|
|
soon, but nevertheless, we cannot go any lower on storage. Disk space is
|
|
|
|
a long-term issue for us and we need to do a lot of compromises and
|
|
|
|
settling down just to survive our daily increase (which is around 10G of
|
|
|
|
new data). Many features are blocked by not having enough storage. We
|
|
|
|
cannot go any lower and also we cannot go much longer with the current
|
|
|
|
storage.
|
|
|
|
|
|
|
|
=== Distgit
|
|
|
|
|
|
|
|
* RAM: ~270M (out of 4G), but climbs to ~1G when busy
|
|
|
|
* CPU: 2 cores (3400MHz) with load 1.35, 1.00, 0.53
|
|
|
|
|
|
|
|
Personally, I wouldn't downgrade the machine too much. Possibly we can
|
|
|
|
live with 3G ram, but I wouldn't go any lower.
|
|
|
|
|
|
|
|
* Disk space: 7G for system, 1.3T dist-git data
|
|
|
|
|
|
|
|
We currently employ a lot of aggressive cleaning strategies on our
|
|
|
|
distgit data, so we can't go any lower than what we have.
|
|
|
|
|
|
|
|
=== Keygen
|
|
|
|
|
|
|
|
* RAM: ~150M (out of 2G)
|
|
|
|
* CPU: 1 core (3400MHz) with load 0.10, 0.31, 0.25
|
|
|
|
|
2021-08-18 12:49:35 +02:00
|
|
|
We are basically running just _signd_ and
|
|
|
|
_httpd_ here, both with minimal resource requirements. The
|
|
|
|
memory usage is topped by _systemd-journald_.
|
2021-07-26 10:39:47 +02:00
|
|
|
|
|
|
|
* Disk space: 7G for system and ~500M (out of ~700M) for GPG keys
|
|
|
|
|
|
|
|
We are slowly pushing the GPG keys storage to its limit, so in the case
|
|
|
|
of migrating copr-keygen somewhere, we would like to scale-up it to at
|
|
|
|
least 1G.
|