infra-docs-fpo/modules/sysadmin_guide/pages/debuginfod.adoc

= Fedora Debuginfod Service - SOP

Debuginfod is the software that lies behind the service at
https://debuginfod.fedoraproject.org/ and
https://debuginfod.stg.fedoraproject.org/ . These services run on 1 VM
each in the stg and prod infrastructure at RDU3.

== Contact Information

Owner:::
  RH perftools team + Fedora Infrastructure Team
Contact:::
  @fche in #fedora-noc
Servers:::
  VMs
Purpose:::
  Serve elf/dwarf/source-code debuginfo for supported releases to
  debugger-like tools in Fedora.
Repository:::
  https://sourceware.org/elfutils/Debuginfod.html
  https://fedoraproject.org/wiki/Debuginfod

== How it works

One virtual machine in prod NFS-mount the koji build system's RPM
repository, read-only. The production VM has a virtual twin in the
staging environment. They each run elfutils debuginfod to index
designated RPMs into a large local sqlite database. They answers HTTP
queries received from users on the Internet via reverse-proxies at the
https://debuginfod.fedoraproject.org/ URL. The reverse proxies apply
gzip compression on the data and provide redirection of the root `/`
location only into the fedora wiki.

Normally, it is autonomous and needs no maintenance. It should come back
nicely after many kinds of outage. The software is based on elfutils in
Fedora, but may occasionally track a custom COPR build with backported
patches from future elfutils versions.

== Configuration

The daemon uses systemd and `/etc/sysconfig/debuginfod` to set basic
parameters. These have been tuned from the distro defaults via
experimental hand-editing or ansible. Key parameters are:

* The -I/-X include/exclude regexes. These tell debuginfod what fedora
versions to include RPMs for. If index disk space starts to run low, one
can eliminate some older fedoras from the index to free up space (after
the next groom cycle).
* The --fdcache related parameters. These tell debuginfod how much data
to cache from RPMs. (Some debuginfo files - kernel, llvm, gtkweb, ...)
are huge and worth retaining instead of repeated extracting.) This is
straight disk space vs. time tradeoff.
* The -t (scan interval) parameter. Scanning lets an index get bigger,
as new RPMs in koji are examined and their contents indexed. Each pass
takes a bunch of hours to traverse the entire koji NFS directory
structure to fstat() everything for newness or change. A smaller scan
interval lets debuginfod react quicker to koji builds coming into
existence, but increases load on the NFS server. More -n (scan threads)
may help the indexing process go faster, if the networking fabric & NFS
server are underloaded.
* The -g (groom interval) parameter. Grooming lets an index get smaller,
as files removed from koji will be forgotten about. It can be run very
intermittently - weekly or less - since it takes many hours and cannot
run concurrently with scanning.

A quick:

....
systemd restart debuginfod
....

activates the new settings.

In case of some drastic failure like database corruption or signs of
penetration/abuse, one can shut down the server with systemd, and/or
stop traffic at the incoming proxy configuration level. The index sqlite
database under `/var/cache/debuginfod` may be deleted, if necessary, but
keep in mind that it takes days to reindex the relevant parts of koji.
Alternately, with the services stopped, the 150GB+ sqlite database files
may be freely copied between the staging and production servers, if that
helps during disaster recovery.

== Monitoring

=== Prometheus

The debuginfod daemons answer the standard /metrics URL endpoint to
serve a variety of operational metrics in prometheus. Important metrics
include:

* filesys_free_ratio - free space on the filesystems. (These are also
monitored via fedora-infra nagios.) If the free space on the database or
tmp partition falls low, further indexing or even service may be
impacted. Add more disk space if possible, or start eliding older fedora
versions from the database via the -I/-X daemon options.
* thread_busy - number of busy threads. During indexing, 1-6 threads may
be busy for minutes or even days, intermittently. User requests show up
as "buildid" (real request) or "buildid-after-you" (deferred duplicate
request) labels. If there are more than a handful of "buildid" ones,
there may be an overload/abuse underway, in which case it's time to
identify the excessive traffic via the logs and get a temporary iptables
block going. Or perhaps there is an outage or slowdown of the koji NFS
storage system, in which case there's not much to do.
* error_count. These should be zero or near zero all the time.

=== Logs

The debuginfod daemons produce voluminous logs into the local systemd
journal, whence the traffic moves to the usual fedora-infra log01
server, `/var/log/hosts/debuginfod*/YYYY/MM/DD/messages.log`. The lines
related to HTTP GET identify the main webapi traffic, with originating
IP addresses in the XFF: field, and response size and elapsed service
time in the last columns. These can be useful in tracking down possible
abuse. :

....
Jun 28 22:36:43 debuginfod01 debuginfod[381551]: [Mon 28 Jun 2021 10:36:43 PM GMT] (381551/2413727): 10.16.163.75:43776 UA:elfutils/0.185,Linux/x86_64,fedora/35 XFF:*elided* GET /buildid/90910c1963bbcf700c0c0c06ee3bf4c5cc831d3a/debuginfo 200 335440 0+0ms
....

The lines related to prometheus /metrics are usually no big deal.

The log also includes info about errors and indexing progress.
Interesting may be the lines like:

....
Jun 28 22:36:43 debuginfod01 debuginfod[381551]: [Mon 28 Jun 2021 10:36:43 PM GMT] (381551/2413727): serving fdcache archive /mnt/fedora_koji_prod/koji/packages/valgrind/3.17.0/3.fc35/x86_64/valgrind-3.17.0-3.fc35.x86_64.rpm file /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so
....

which identify the file names derived from requests (which RPMs the
buildids to). These can provide some indirect distro telemetry: what
packages and binaries are being debugged and for which architectures?
Added the infra SOPs ported to asciidoc. 2021-07-26 10:39:47 +02:00			`= Fedora Debuginfod Service - SOP`

			`Debuginfod is the software that lies behind the service at`
			`https://debuginfod.fedoraproject.org/ and`
			`https://debuginfod.stg.fedoraproject.org/ . These services run on 1 VM`
DC move: iad => rdu3, 10.3. => 10.16. And remove some obsolete things. Signed-off-by: Nils Philippsen <nils@redhat.com> 2025-07-04 11:55:02 +02:00			`each in the stg and prod infrastructure at RDU3.`
Added the infra SOPs ported to asciidoc. 2021-07-26 10:39:47 +02:00
			`== Contact Information`

			`Owner:::`
			`RH perftools team + Fedora Infrastructure Team`
			`Contact:::`
			`@fche in #fedora-noc`
			`Servers:::`
			`VMs`
			`Purpose:::`
			`Serve elf/dwarf/source-code debuginfo for supported releases to`
			`debugger-like tools in Fedora.`
			`Repository:::`
			`https://sourceware.org/elfutils/Debuginfod.html`
			`https://fedoraproject.org/wiki/Debuginfod`

			`== How it works`

			`One virtual machine in prod NFS-mount the koji build system's RPM`
			`repository, read-only. The production VM has a virtual twin in the`
			`staging environment. They each run elfutils debuginfod to index`
			`designated RPMs into a large local sqlite database. They answers HTTP`
			`queries received from users on the Internet via reverse-proxies at the`
			`https://debuginfod.fedoraproject.org/ URL. The reverse proxies apply`
			gzip compression on the data and provide redirection of the root `/`
			`location only into the fedora wiki.`

			`Normally, it is autonomous and needs no maintenance. It should come back`
			`nicely after many kinds of outage. The software is based on elfutils in`
			`Fedora, but may occasionally track a custom COPR build with backported`
			`patches from future elfutils versions.`

			`== Configuration`

			The daemon uses systemd and `/etc/sysconfig/debuginfod` to set basic
			`parameters. These have been tuned from the distro defaults via`
			`experimental hand-editing or ansible. Key parameters are:`

Update modules/sysadmin_guide/pages/debuginfod.adoc 2021-11-09 19:47:16 +00:00			`* The -I/-X include/exclude regexes. These tell debuginfod what fedora`
Added the infra SOPs ported to asciidoc. 2021-07-26 10:39:47 +02:00			`versions to include RPMs for. If index disk space starts to run low, one`
			`can eliminate some older fedoras from the index to free up space (after`
			`the next groom cycle).`
Update modules/sysadmin_guide/pages/debuginfod.adoc 2021-11-09 19:47:16 +00:00			`* The --fdcache related parameters. These tell debuginfod how much data`
Added the infra SOPs ported to asciidoc. 2021-07-26 10:39:47 +02:00			`to cache from RPMs. (Some debuginfo files - kernel, llvm, gtkweb, ...)`
			`are huge and worth retaining instead of repeated extracting.) This is`
			`straight disk space vs. time tradeoff.`
Update modules/sysadmin_guide/pages/debuginfod.adoc 2021-11-09 19:47:16 +00:00			`* The -t (scan interval) parameter. Scanning lets an index get bigger,`
Added the infra SOPs ported to asciidoc. 2021-07-26 10:39:47 +02:00			`as new RPMs in koji are examined and their contents indexed. Each pass`
			`takes a bunch of hours to traverse the entire koji NFS directory`
			`structure to fstat() everything for newness or change. A smaller scan`
			`interval lets debuginfod react quicker to koji builds coming into`
			`existence, but increases load on the NFS server. More -n (scan threads)`
			`may help the indexing process go faster, if the networking fabric & NFS`
			`server are underloaded.`
Update modules/sysadmin_guide/pages/debuginfod.adoc 2021-11-09 19:47:16 +00:00			`* The -g (groom interval) parameter. Grooming lets an index get smaller,`
Added the infra SOPs ported to asciidoc. 2021-07-26 10:39:47 +02:00			`as files removed from koji will be forgotten about. It can be run very`
			`intermittently - weekly or less - since it takes many hours and cannot`
			`run concurrently with scanning.`

			`A quick:`

			`....`
			`systemd restart debuginfod`
			`....`

			`activates the new settings.`

			`In case of some drastic failure like database corruption or signs of`
			`penetration/abuse, one can shut down the server with systemd, and/or`
			`stop traffic at the incoming proxy configuration level. The index sqlite`
			database under `/var/cache/debuginfod` may be deleted, if necessary, but
			`keep in mind that it takes days to reindex the relevant parts of koji.`
			`Alternately, with the services stopped, the 150GB+ sqlite database files`
			`may be freely copied between the staging and production servers, if that`
			`helps during disaster recovery.`

			`== Monitoring`

			`=== Prometheus`

			`The debuginfod daemons answer the standard /metrics URL endpoint to`
			`serve a variety of operational metrics in prometheus. Important metrics`
			`include:`

Update modules/sysadmin_guide/pages/debuginfod.adoc 2021-11-09 19:47:16 +00:00			`* filesys_free_ratio - free space on the filesystems. (These are also`
Added the infra SOPs ported to asciidoc. 2021-07-26 10:39:47 +02:00			`monitored via fedora-infra nagios.) If the free space on the database or`
			`tmp partition falls low, further indexing or even service may be`
			`impacted. Add more disk space if possible, or start eliding older fedora`
			`versions from the database via the -I/-X daemon options.`
Update modules/sysadmin_guide/pages/debuginfod.adoc 2021-11-09 19:47:16 +00:00			`* thread_busy - number of busy threads. During indexing, 1-6 threads may`
Added the infra SOPs ported to asciidoc. 2021-07-26 10:39:47 +02:00			`be busy for minutes or even days, intermittently. User requests show up`
			`as "buildid" (real request) or "buildid-after-you" (deferred duplicate`
			`request) labels. If there are more than a handful of "buildid" ones,`
			`there may be an overload/abuse underway, in which case it's time to`
			`identify the excessive traffic via the logs and get a temporary iptables`
			`block going. Or perhaps there is an outage or slowdown of the koji NFS`
			`storage system, in which case there's not much to do.`
Update modules/sysadmin_guide/pages/debuginfod.adoc 2021-11-09 19:47:16 +00:00			`* error_count. These should be zero or near zero all the time.`
Added the infra SOPs ported to asciidoc. 2021-07-26 10:39:47 +02:00
			`=== Logs`

			`The debuginfod daemons produce voluminous logs into the local systemd`
			`journal, whence the traffic moves to the usual fedora-infra log01`
			server, `/var/log/hosts/debuginfod*/YYYY/MM/DD/messages.log`. The lines
			`related to HTTP GET identify the main webapi traffic, with originating`
			`IP addresses in the XFF: field, and response size and elapsed service`
			`time in the last columns. These can be useful in tracking down possible`
			`abuse. :`

			`....`
DC move: iad => rdu3, 10.3. => 10.16. And remove some obsolete things. Signed-off-by: Nils Philippsen <nils@redhat.com> 2025-07-04 11:55:02 +02:00			`Jun 28 22:36:43 debuginfod01 debuginfod[381551]: [Mon 28 Jun 2021 10:36:43 PM GMT] (381551/2413727): 10.16.163.75:43776 UA:elfutils/0.185,Linux/x86_64,fedora/35 XFF:elided GET /buildid/90910c1963bbcf700c0c0c06ee3bf4c5cc831d3a/debuginfo 200 335440 0+0ms`
Added the infra SOPs ported to asciidoc. 2021-07-26 10:39:47 +02:00			`....`

			`The lines related to prometheus /metrics are usually no big deal.`

			`The log also includes info about errors and indexing progress.`
			`Interesting may be the lines like:`

			`....`
			`Jun 28 22:36:43 debuginfod01 debuginfod[381551]: [Mon 28 Jun 2021 10:36:43 PM GMT] (381551/2413727): serving fdcache archive /mnt/fedora_koji_prod/koji/packages/valgrind/3.17.0/3.fc35/x86_64/valgrind-3.17.0-3.fc35.x86_64.rpm file /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so`
			`....`

			`which identify the file names derived from requests (which RPMs the`
			`buildids to). These can provide some indirect distro telemetry: what`
			`packages and binaries are being debugged and for which architectures?`