infra-docs-fpo/modules/sysadmin_guide/pages/copr.adoc

428 lines
12 KiB
Text
Raw Normal View History

= Copr
Copr is build system for 3rd party packages.
Frontend:::
* http://copr.fedorainfracloud.org/
Backend:::
* http://copr-be.cloud.fedoraproject.org/
Package signer:::
* copr-keygen.cloud.fedoraproject.org
Dist-git::
* copr-dist-git.fedorainfracloud.org
Devel instances (NO NEED TO CARE ABOUT THEM, JUST THOSE ABOVE):::
* http://copr-fe-dev.cloud.fedoraproject.org/
* http://copr-be-dev.cloud.fedoraproject.org/
* copr-keygen-dev.cloud.fedoraproject.org
* copr-dist-git-dev.fedorainfracloud.org
== Contact Information
Owner::
msuchy (mirek)
Contact::
#fedora-admin, #fedora-buildsys
Location::
Fedora Cloud
Purpose::
Build system
== This document
This document provides a condensed information allowing you to keep Copr
alive and working. For more sofisticated business processes, please see
https://docs.pagure.org/copr.copr/maintenance_documentation.html
== TROUBLESHOOTING
Almost every problem with Copr is due problem with spawning builder VMs,
or with processing action queue on backend.
=== VM spawning/termination problems
Try to restart copr-backend service:
....
$ ssh root@copr-be.cloud.fedoraproject.org
$ systemctl restart copr-backend
....
If this doesn't solve the problem, try to follow logs for some clues:
....
$ tail -f /var/log/copr-backend/{vmm,spawner,terminator}.log
....
As the last resort option, you can terminate all builders and let
copr-backend to throw all information about them. This action will
obviously interrupt all running builds and reschedule them:
....
$ ssh root@copr-be.cloud.fedoraproject.org
$ systemctl stop copr-backend
$ cleanup_vm_nova.py
$ redis-cli
> FLUSHALL
$ systemctl start copr-backend
....
Sometimes OpenStack can not handle spawning too much VMs at the same
time. So it is safer to edit on _copr-be.cloud.fedoraproject.org_:
....
vi /etc/copr/copr-be.conf
....
and change:
....
group0_max_workers=12
....
to "6". Start copr-backend service and some time later increase it to
original value. Copr automaticaly detect change in script and increase
number of workers.
The set of aarch64 VMs isn't maintained by OpenStack, but by Copr's
backend itself. Steps to diagnose:
....
$ ssh root@copr-be.cloud.fedoraproject.org
[root@copr-be ~][PROD]# systemctl status resalloc
● resalloc.service - Resource allocator server
...
[root@copr-be ~][PROD]# less /var/log/resallocserver/main.log
[root@copr-be ~][PROD]# su - resalloc
[resalloc@copr-be ~][PROD]$ resalloc-maint resource-list
13569 - aarch64_01_prod_00013569_20190613_151319 pool=aarch64_01_prod tags=aarch64 status=UP
13597 - aarch64_01_prod_00013597_20190614_083418 pool=aarch64_01_prod tags=aarch64 status=UP
13594 - aarch64_02_prod_00013594_20190614_082303 pool=aarch64_02_prod tags=aarch64 status=STARTING
...
[resalloc@copr-be ~][PROD]$ resalloc-maint ticket-list
879 - state=OPEN tags=aarch64 resource=aarch64_01_prod_00013569_20190613_151319
918 - state=OPEN tags=aarch64 resource=aarch64_01_prod_00013608_20190614_135536
904 - state=OPEN tags=aarch64 resource=aarch64_02_prod_00013594_20190614_082303
919 - state=OPEN tags=aarch64
...
....
Be careful when there's some resource in `STARTING` state. If that's so,
check
`/usr/bin/tail -F -n +0 /var/log/resallocserver/hooks/013594_alloc`.
Copr takes tickets from resalloc server; and if the resources fail to
spawn, the ticket numbers are not assigned with appropriately tagged
resource for a long time.
If that happens (it shouldn't) and there's some inconsistency between
resalloc's database and the actual status on aarch64 hypervisors
(`ssh copr@virthost-aarch64-os0{1,2}.fedorainfracloud.org`) - use
`virsh` there to introspect theirs statuses - use
`resalloc-maint resource-delete`, `resalloc ticket-close` or `psql`
commands to fix-up the resalloc's DB.
=== Backend Troubleshoting
Information about status of Copr backend services:
....
systemctl status copr-backend*.service
....
Utilization of workers:
....
ps axf
....
Worker process change $0 to list which task they are working on and on
which builder.
To list which VM builders are tracked by copr-vmm service:
....
/usr/bin/copr_get_vm_info.py
....
=== Appstream builder troubleshoting
Appstream builder is painfully slow when running on a repository with a
huge amount of packages. See
https://github.com/hughsie/appstream-glib/issues/301 . You might need to
disable it for some projects:
....
$ ssh root@copr-be.cloud.fedoraproject.org
$ cd /var/lib/copr/public_html/results/<owner>/<project>/
$ touch .disable-appstream
# You should probably also delete existing appstream data because
# they might be obsolete
$ rm -rf ./appdata
....
=== Backend action queue issues
First check the _number of not-yet-processed actions_. If that
number isn't equal to zero, and is not decrementing relatively fast (say
single action takes longer than 30s) -- there might be some problem.
Logs for the action dispatcher can be found in:
....
/var/log/copr-backend/action_dispatcher.log
....
Check if there's no stucked process under `Action dispatch` parent
process in `pstree -a copr` output.
== Deploy information
Using playbooks and rbac:
....
$ sudo rbac-playbook groups/copr-backend.yml
$ sudo rbac-playbook groups/copr-frontend-cloud.yml
$ sudo rbac-playbook groups/copr-keygen.yml
$ sudo rbac-playbook groups/copr-dist-git.yml
....
The
https://pagure.io/copr/copr/blob/main/f/copr-setup.txt[copr-setup.txt]
manual is severely outdated, but there is
no up-to-date alternative. We should extract useful information from it
and put it here in the SOP or into
https://docs.pagure.org/copr.copr/maintenance_documentation.html and
then throw the _copr-setup.txt_ away.
On backend should run copr-backend service (which spawns several
processes). Backend spawns VM from Fedora Cloud. You could not login to
those machines directly. You have to:
....
$ ssh root@copr-be.cloud.fedoraproject.org
$ su - copr
$ copr_get_vm_info.py
# find IP address of the VM that you want
$ ssh root@172.16.3.3
....
Instances can be easily terminated in
https://fedorainfracloud.org/dashboard
=== Order of start up
When reprovision you should start first: copr-keygen and copr-dist-git
machines (in any order). Then you can start copr-be. Well you can start
it sooner, but make sure that copr-* services are stopped.
Copr-fe machine is completly independent and can be start any time. If
backend is stopped it will just queue jobs.
== Logs
=== Backend
* /var/log/copr-backend/action_dispatcher.log
* /var/log/copr-backend/actions.log
* /var/log/copr-backend/backend.log
* /var/log/copr-backend/build_dispatcher.log
* /var/log/copr-backend/logger.log
* /var/log/copr-backend/spawner.log
* /var/log/copr-backend/terminator.log
* /var/log/copr-backend/vmm.log
* /var/log/copr-backend/worker.log
And several logs for non-essential features such as
copr_prune_results.log, hitcounter.log, cleanup_vms.log, that you
shouldn't be worried with.
=== Frontend
* /var/log/copr-frontend/frontend.log
* /var/log/httpd/access_log
* /var/log/httpd/error_log
=== Keygen
* /var/log/copr-keygen/main.log
=== Dist-git
* /var/log/copr-dist-git/main.log
* /var/log/httpd/access_log
* /var/log/httpd/error_log
== Services
=== Backend
* copr-backend
** copr-backend-action
** copr-backend-build
** copr-backend-log
** copr-backend-vmm
* redis
* lighttpd
All the _copr-backend-*.service_ are configured to be a part
of the _copr-backend.service_ so e.g. in case of restarting
all of them, just restart the _copr-backend.service_.
=== Frontend
* httpd
* postgresql
=== Keygen
* signd
=== Dist-git
* httpd
* copr-dist-git
== PPC64LE Builders
Builders for PPC64 are located at rh-power2.fit.vutbr.cz and anyone with
access to buildsys ssh key can get there using keys as::
msuchy@rh-power2.fit.vutbr.cz
There are commands:
....
$ ls bin/
destroy-all.sh reinit-vm26.sh
reinit-vm28.sh virsh-destroy-vm26.sh virsh-destroy-vm28.sh
virsh-start-vm26.sh virsh-start-vm28.sh get-one-vm.sh reinit-vm27.sh
reinit-vm29.sh virsh-destroy-vm27.sh virsh-destroy-vm29.sh
virsh-start-vm27.sh virsh-start-vm29.sh
....
`destroy-all.sh` destroy all VM and reinit them
`reinit-vmXX.sh` copy VM image from template
`virsh-destroy-vmXX.sh` destroys VM
`virsh-start-vmXX.sh` starts VM
`get-one-vm.sh` start one VM and return its IP - this is used in Copr playbooks.
In case of big queue of PPC64 tasks simply call `bin/destroy-all.sh` and
it will destroy stuck VM and copr backend will spawn new VM.
== Ports opened for public
Frontend:
[width="86%",cols="13%,17%,16%,54%",options="header",]
|===
|Port |Protocol |Service |Reason
|22 |TCP |ssh |Remote control
|80 |TCP |http |Serving Copr frontend website
|443 |TCP |https |^^
|===
Backend:
[width="86%",cols="13%,17%,16%,54%",options="header",]
|===
|Port |Protocol |Service |Reason
|22 |TCP |ssh |Remote control
|80 |TCP |http |Serving build results and repos
|443 |TCP |https |^^
|===
Distgit:
[width="86%",cols="13%,17%,16%,54%",options="header",]
|===
|Port |Protocol |Service |Reason
|22 |TCP |ssh |Remote control
|80 |TCP |http |Serving cgit interface
|443 |TCP |https |^^
|===
Keygen:
[width="86%",cols="13%,17%,16%,54%",options="header",]
|===
|Port |Protocol |Service |Reason
|22 |TCP |ssh |Remote control
|===
== Resources justification
Copr currently uses the following resources.
=== Frontend
* RAM: 2G (out of 4G) and some swap
* CPU: 2 cores (3400mhz) with load 0.92, 0.68, 0.65
Most of the memory is eaten by PostgreSQL, followed by Apache. The CPU
usage is also mainly used for those two services but in the reversed
order.
I don't think we can settle down with any instance that provides less
than (2G RAM, obviously), but ideally, we need 3G+. 2-core CPU is good
enough.
* Disk space: 17G for system and 8G for _pgsqldb_ directory
If needed, we are able to clean-up the database directory of old dumps
and backups and get down to around 4G disk space.
=== Backend
* RAM: 5G (out of 16G)
* CPU: 8 cores (3400MHz) with load 4.09, 4.55, 4.24
Backend takes care of spinning-up builders and running ansible playbooks
on them, running _createrepo_c_ (on big repositories) and so
on. Copr utilizes two queues, one for builds, which are delegated to
OpenStack builders, and action queue. Actions, however, are processed
directly by the backend, so it can spike our load up. We would ideally
like to have the same computing power that we have now. Maybe we can go
lower than 16G RAM, possibly down to 12G RAM.
* Disk space: 30G for the system, 5.6T (out of 6.8T) for build results
Currently, we have 1.3T of backup data, that is going to be deleted
soon, but nevertheless, we cannot go any lower on storage. Disk space is
a long-term issue for us and we need to do a lot of compromises and
settling down just to survive our daily increase (which is around 10G of
new data). Many features are blocked by not having enough storage. We
cannot go any lower and also we cannot go much longer with the current
storage.
=== Distgit
* RAM: ~270M (out of 4G), but climbs to ~1G when busy
* CPU: 2 cores (3400MHz) with load 1.35, 1.00, 0.53
Personally, I wouldn't downgrade the machine too much. Possibly we can
live with 3G ram, but I wouldn't go any lower.
* Disk space: 7G for system, 1.3T dist-git data
We currently employ a lot of aggressive cleaning strategies on our
distgit data, so we can't go any lower than what we have.
=== Keygen
* RAM: ~150M (out of 2G)
* CPU: 1 core (3400MHz) with load 0.10, 0.31, 0.25
We are basically running just _signd_ and
_httpd_ here, both with minimal resource requirements. The
memory usage is topped by _systemd-journald_.
* Disk space: 7G for system and ~500M (out of ~700M) for GPG keys
We are slowly pushing the GPG keys storage to its limit, so in the case
of migrating copr-keygen somewhere, we would like to scale-up it to at
least 1G.