Added the infra SOPs ported to asciidoc.
This commit is contained in:
parent
8a7f111a12
commit
a0301e30f1
148 changed files with 18575 additions and 17 deletions
133
modules/sysadmin_guide/pages/debuginfod.adoc
Normal file
133
modules/sysadmin_guide/pages/debuginfod.adoc
Normal file
|
@ -0,0 +1,133 @@
|
|||
= Fedora Debuginfod Service - SOP
|
||||
|
||||
Debuginfod is the software that lies behind the service at
|
||||
https://debuginfod.fedoraproject.org/ and
|
||||
https://debuginfod.stg.fedoraproject.org/ . These services run on 1 VM
|
||||
each in the stg and prod infrastructure at IAD2.
|
||||
|
||||
== Contact Information
|
||||
|
||||
Owner:::
|
||||
RH perftools team + Fedora Infrastructure Team
|
||||
Contact:::
|
||||
@fche in #fedora-noc
|
||||
Servers:::
|
||||
VMs
|
||||
Purpose:::
|
||||
Serve elf/dwarf/source-code debuginfo for supported releases to
|
||||
debugger-like tools in Fedora.
|
||||
Repository:::
|
||||
https://sourceware.org/elfutils/Debuginfod.html
|
||||
https://fedoraproject.org/wiki/Debuginfod
|
||||
|
||||
== How it works
|
||||
|
||||
One virtual machine in prod NFS-mount the koji build system's RPM
|
||||
repository, read-only. The production VM has a virtual twin in the
|
||||
staging environment. They each run elfutils debuginfod to index
|
||||
designated RPMs into a large local sqlite database. They answers HTTP
|
||||
queries received from users on the Internet via reverse-proxies at the
|
||||
https://debuginfod.fedoraproject.org/ URL. The reverse proxies apply
|
||||
gzip compression on the data and provide redirection of the root `/`
|
||||
location only into the fedora wiki.
|
||||
|
||||
Normally, it is autonomous and needs no maintenance. It should come back
|
||||
nicely after many kinds of outage. The software is based on elfutils in
|
||||
Fedora, but may occasionally track a custom COPR build with backported
|
||||
patches from future elfutils versions.
|
||||
|
||||
== Configuration
|
||||
|
||||
The daemon uses systemd and `/etc/sysconfig/debuginfod` to set basic
|
||||
parameters. These have been tuned from the distro defaults via
|
||||
experimental hand-editing or ansible. Key parameters are:
|
||||
|
||||
[arabic]
|
||||
. The -I/-X include/exclude regexes. These tell debuginfod what fedora
|
||||
versions to include RPMs for. If index disk space starts to run low, one
|
||||
can eliminate some older fedoras from the index to free up space (after
|
||||
the next groom cycle).
|
||||
. The --fdcache related parameters. These tell debuginfod how much data
|
||||
to cache from RPMs. (Some debuginfo files - kernel, llvm, gtkweb, ...)
|
||||
are huge and worth retaining instead of repeated extracting.) This is
|
||||
straight disk space vs. time tradeoff.
|
||||
. The -t (scan interval) parameter. Scanning lets an index get bigger,
|
||||
as new RPMs in koji are examined and their contents indexed. Each pass
|
||||
takes a bunch of hours to traverse the entire koji NFS directory
|
||||
structure to fstat() everything for newness or change. A smaller scan
|
||||
interval lets debuginfod react quicker to koji builds coming into
|
||||
existence, but increases load on the NFS server. More -n (scan threads)
|
||||
may help the indexing process go faster, if the networking fabric & NFS
|
||||
server are underloaded.
|
||||
. The -g (groom interval) parameter. Grooming lets an index get smaller,
|
||||
as files removed from koji will be forgotten about. It can be run very
|
||||
intermittently - weekly or less - since it takes many hours and cannot
|
||||
run concurrently with scanning.
|
||||
|
||||
A quick:
|
||||
|
||||
....
|
||||
systemd restart debuginfod
|
||||
....
|
||||
|
||||
activates the new settings.
|
||||
|
||||
In case of some drastic failure like database corruption or signs of
|
||||
penetration/abuse, one can shut down the server with systemd, and/or
|
||||
stop traffic at the incoming proxy configuration level. The index sqlite
|
||||
database under `/var/cache/debuginfod` may be deleted, if necessary, but
|
||||
keep in mind that it takes days to reindex the relevant parts of koji.
|
||||
Alternately, with the services stopped, the 150GB+ sqlite database files
|
||||
may be freely copied between the staging and production servers, if that
|
||||
helps during disaster recovery.
|
||||
|
||||
== Monitoring
|
||||
|
||||
=== Prometheus
|
||||
|
||||
The debuginfod daemons answer the standard /metrics URL endpoint to
|
||||
serve a variety of operational metrics in prometheus. Important metrics
|
||||
include:
|
||||
|
||||
[arabic]
|
||||
. filesys_free_ratio - free space on the filesystems. (These are also
|
||||
monitored via fedora-infra nagios.) If the free space on the database or
|
||||
tmp partition falls low, further indexing or even service may be
|
||||
impacted. Add more disk space if possible, or start eliding older fedora
|
||||
versions from the database via the -I/-X daemon options.
|
||||
. thread_busy - number of busy threads. During indexing, 1-6 threads may
|
||||
be busy for minutes or even days, intermittently. User requests show up
|
||||
as "buildid" (real request) or "buildid-after-you" (deferred duplicate
|
||||
request) labels. If there are more than a handful of "buildid" ones,
|
||||
there may be an overload/abuse underway, in which case it's time to
|
||||
identify the excessive traffic via the logs and get a temporary iptables
|
||||
block going. Or perhaps there is an outage or slowdown of the koji NFS
|
||||
storage system, in which case there's not much to do.
|
||||
. error_count. These should be zero or near zero all the time.
|
||||
|
||||
=== Logs
|
||||
|
||||
The debuginfod daemons produce voluminous logs into the local systemd
|
||||
journal, whence the traffic moves to the usual fedora-infra log01
|
||||
server, `/var/log/hosts/debuginfod*/YYYY/MM/DD/messages.log`. The lines
|
||||
related to HTTP GET identify the main webapi traffic, with originating
|
||||
IP addresses in the XFF: field, and response size and elapsed service
|
||||
time in the last columns. These can be useful in tracking down possible
|
||||
abuse. :
|
||||
|
||||
....
|
||||
Jun 28 22:36:43 debuginfod01 debuginfod[381551]: [Mon 28 Jun 2021 10:36:43 PM GMT] (381551/2413727): 10.3.163.75:43776 UA:elfutils/0.185,Linux/x86_64,fedora/35 XFF:*elided* GET /buildid/90910c1963bbcf700c0c0c06ee3bf4c5cc831d3a/debuginfo 200 335440 0+0ms
|
||||
....
|
||||
|
||||
The lines related to prometheus /metrics are usually no big deal.
|
||||
|
||||
The log also includes info about errors and indexing progress.
|
||||
Interesting may be the lines like:
|
||||
|
||||
....
|
||||
Jun 28 22:36:43 debuginfod01 debuginfod[381551]: [Mon 28 Jun 2021 10:36:43 PM GMT] (381551/2413727): serving fdcache archive /mnt/fedora_koji_prod/koji/packages/valgrind/3.17.0/3.fc35/x86_64/valgrind-3.17.0-3.fc35.x86_64.rpm file /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so
|
||||
....
|
||||
|
||||
which identify the file names derived from requests (which RPMs the
|
||||
buildids to). These can provide some indirect distro telemetry: what
|
||||
packages and binaries are being debugged and for which architectures?
|
Loading…
Add table
Add a link
Reference in a new issue