2021-07-26 10:39:47 +02:00
|
|
|
= Fedora Debuginfod Service - SOP
|
|
|
|
|
|
|
|
Debuginfod is the software that lies behind the service at
|
|
|
|
https://debuginfod.fedoraproject.org/ and
|
|
|
|
https://debuginfod.stg.fedoraproject.org/ . These services run on 1 VM
|
2025-07-04 11:55:02 +02:00
|
|
|
each in the stg and prod infrastructure at RDU3.
|
2021-07-26 10:39:47 +02:00
|
|
|
|
|
|
|
== Contact Information
|
|
|
|
|
|
|
|
Owner:::
|
|
|
|
RH perftools team + Fedora Infrastructure Team
|
|
|
|
Contact:::
|
|
|
|
@fche in #fedora-noc
|
|
|
|
Servers:::
|
|
|
|
VMs
|
|
|
|
Purpose:::
|
|
|
|
Serve elf/dwarf/source-code debuginfo for supported releases to
|
|
|
|
debugger-like tools in Fedora.
|
|
|
|
Repository:::
|
|
|
|
https://sourceware.org/elfutils/Debuginfod.html
|
|
|
|
https://fedoraproject.org/wiki/Debuginfod
|
|
|
|
|
|
|
|
== How it works
|
|
|
|
|
|
|
|
One virtual machine in prod NFS-mount the koji build system's RPM
|
|
|
|
repository, read-only. The production VM has a virtual twin in the
|
|
|
|
staging environment. They each run elfutils debuginfod to index
|
|
|
|
designated RPMs into a large local sqlite database. They answers HTTP
|
|
|
|
queries received from users on the Internet via reverse-proxies at the
|
|
|
|
https://debuginfod.fedoraproject.org/ URL. The reverse proxies apply
|
|
|
|
gzip compression on the data and provide redirection of the root `/`
|
|
|
|
location only into the fedora wiki.
|
|
|
|
|
|
|
|
Normally, it is autonomous and needs no maintenance. It should come back
|
|
|
|
nicely after many kinds of outage. The software is based on elfutils in
|
|
|
|
Fedora, but may occasionally track a custom COPR build with backported
|
|
|
|
patches from future elfutils versions.
|
|
|
|
|
|
|
|
== Configuration
|
|
|
|
|
|
|
|
The daemon uses systemd and `/etc/sysconfig/debuginfod` to set basic
|
|
|
|
parameters. These have been tuned from the distro defaults via
|
|
|
|
experimental hand-editing or ansible. Key parameters are:
|
|
|
|
|
2021-11-09 19:47:16 +00:00
|
|
|
* The -I/-X include/exclude regexes. These tell debuginfod what fedora
|
2021-07-26 10:39:47 +02:00
|
|
|
versions to include RPMs for. If index disk space starts to run low, one
|
|
|
|
can eliminate some older fedoras from the index to free up space (after
|
|
|
|
the next groom cycle).
|
2021-11-09 19:47:16 +00:00
|
|
|
* The --fdcache related parameters. These tell debuginfod how much data
|
2021-07-26 10:39:47 +02:00
|
|
|
to cache from RPMs. (Some debuginfo files - kernel, llvm, gtkweb, ...)
|
|
|
|
are huge and worth retaining instead of repeated extracting.) This is
|
|
|
|
straight disk space vs. time tradeoff.
|
2021-11-09 19:47:16 +00:00
|
|
|
* The -t (scan interval) parameter. Scanning lets an index get bigger,
|
2021-07-26 10:39:47 +02:00
|
|
|
as new RPMs in koji are examined and their contents indexed. Each pass
|
|
|
|
takes a bunch of hours to traverse the entire koji NFS directory
|
|
|
|
structure to fstat() everything for newness or change. A smaller scan
|
|
|
|
interval lets debuginfod react quicker to koji builds coming into
|
|
|
|
existence, but increases load on the NFS server. More -n (scan threads)
|
|
|
|
may help the indexing process go faster, if the networking fabric & NFS
|
|
|
|
server are underloaded.
|
2021-11-09 19:47:16 +00:00
|
|
|
* The -g (groom interval) parameter. Grooming lets an index get smaller,
|
2021-07-26 10:39:47 +02:00
|
|
|
as files removed from koji will be forgotten about. It can be run very
|
|
|
|
intermittently - weekly or less - since it takes many hours and cannot
|
|
|
|
run concurrently with scanning.
|
|
|
|
|
|
|
|
A quick:
|
|
|
|
|
|
|
|
....
|
|
|
|
systemd restart debuginfod
|
|
|
|
....
|
|
|
|
|
|
|
|
activates the new settings.
|
|
|
|
|
|
|
|
In case of some drastic failure like database corruption or signs of
|
|
|
|
penetration/abuse, one can shut down the server with systemd, and/or
|
|
|
|
stop traffic at the incoming proxy configuration level. The index sqlite
|
|
|
|
database under `/var/cache/debuginfod` may be deleted, if necessary, but
|
|
|
|
keep in mind that it takes days to reindex the relevant parts of koji.
|
|
|
|
Alternately, with the services stopped, the 150GB+ sqlite database files
|
|
|
|
may be freely copied between the staging and production servers, if that
|
|
|
|
helps during disaster recovery.
|
|
|
|
|
|
|
|
== Monitoring
|
|
|
|
|
|
|
|
=== Prometheus
|
|
|
|
|
|
|
|
The debuginfod daemons answer the standard /metrics URL endpoint to
|
|
|
|
serve a variety of operational metrics in prometheus. Important metrics
|
|
|
|
include:
|
|
|
|
|
2021-11-09 19:47:16 +00:00
|
|
|
* filesys_free_ratio - free space on the filesystems. (These are also
|
2021-07-26 10:39:47 +02:00
|
|
|
monitored via fedora-infra nagios.) If the free space on the database or
|
|
|
|
tmp partition falls low, further indexing or even service may be
|
|
|
|
impacted. Add more disk space if possible, or start eliding older fedora
|
|
|
|
versions from the database via the -I/-X daemon options.
|
2021-11-09 19:47:16 +00:00
|
|
|
* thread_busy - number of busy threads. During indexing, 1-6 threads may
|
2021-07-26 10:39:47 +02:00
|
|
|
be busy for minutes or even days, intermittently. User requests show up
|
|
|
|
as "buildid" (real request) or "buildid-after-you" (deferred duplicate
|
|
|
|
request) labels. If there are more than a handful of "buildid" ones,
|
|
|
|
there may be an overload/abuse underway, in which case it's time to
|
|
|
|
identify the excessive traffic via the logs and get a temporary iptables
|
|
|
|
block going. Or perhaps there is an outage or slowdown of the koji NFS
|
|
|
|
storage system, in which case there's not much to do.
|
2021-11-09 19:47:16 +00:00
|
|
|
* error_count. These should be zero or near zero all the time.
|
2021-07-26 10:39:47 +02:00
|
|
|
|
|
|
|
=== Logs
|
|
|
|
|
|
|
|
The debuginfod daemons produce voluminous logs into the local systemd
|
|
|
|
journal, whence the traffic moves to the usual fedora-infra log01
|
|
|
|
server, `/var/log/hosts/debuginfod*/YYYY/MM/DD/messages.log`. The lines
|
|
|
|
related to HTTP GET identify the main webapi traffic, with originating
|
|
|
|
IP addresses in the XFF: field, and response size and elapsed service
|
|
|
|
time in the last columns. These can be useful in tracking down possible
|
|
|
|
abuse. :
|
|
|
|
|
|
|
|
....
|
2025-07-04 11:55:02 +02:00
|
|
|
Jun 28 22:36:43 debuginfod01 debuginfod[381551]: [Mon 28 Jun 2021 10:36:43 PM GMT] (381551/2413727): 10.16.163.75:43776 UA:elfutils/0.185,Linux/x86_64,fedora/35 XFF:*elided* GET /buildid/90910c1963bbcf700c0c0c06ee3bf4c5cc831d3a/debuginfo 200 335440 0+0ms
|
2021-07-26 10:39:47 +02:00
|
|
|
....
|
|
|
|
|
|
|
|
The lines related to prometheus /metrics are usually no big deal.
|
|
|
|
|
|
|
|
The log also includes info about errors and indexing progress.
|
|
|
|
Interesting may be the lines like:
|
|
|
|
|
|
|
|
....
|
|
|
|
Jun 28 22:36:43 debuginfod01 debuginfod[381551]: [Mon 28 Jun 2021 10:36:43 PM GMT] (381551/2413727): serving fdcache archive /mnt/fedora_koji_prod/koji/packages/valgrind/3.17.0/3.fc35/x86_64/valgrind-3.17.0-3.fc35.x86_64.rpm file /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so
|
|
|
|
....
|
|
|
|
|
|
|
|
which identify the file names derived from requests (which RPMs the
|
|
|
|
buildids to). These can provide some indirect distro telemetry: what
|
|
|
|
packages and binaries are being debugged and for which architectures?
|