Added the infra SOPs ported to asciidoc.

2021-07-26 10:39:47 +02:00 · 2021-07-26 10:39:47 +02:00 · a0301e30f1
commit a0301e30f1
parent 8a7f111a12
148 changed files with 18575 additions and 17 deletions
--- a/modules/sysadmin_guide/pages/debuginfod.adoc
+++ b/modules/sysadmin_guide/pages/debuginfod.adoc
@ -0,0 +1,133 @@
+= Fedora Debuginfod Service - SOP
+
+Debuginfod is the software that lies behind the service at
+https://debuginfod.fedoraproject.org/ and
+https://debuginfod.stg.fedoraproject.org/ . These services run on 1 VM
+each in the stg and prod infrastructure at IAD2.
+
+== Contact Information
+
+Owner:::
+  RH perftools team + Fedora Infrastructure Team
+Contact:::
+  @fche in #fedora-noc
+Servers:::
+  VMs
+Purpose:::
+  Serve elf/dwarf/source-code debuginfo for supported releases to
+  debugger-like tools in Fedora.
+Repository:::
+  https://sourceware.org/elfutils/Debuginfod.html
+  https://fedoraproject.org/wiki/Debuginfod
+
+== How it works
+
+One virtual machine in prod NFS-mount the koji build system's RPM
+repository, read-only. The production VM has a virtual twin in the
+staging environment. They each run elfutils debuginfod to index
+designated RPMs into a large local sqlite database. They answers HTTP
+queries received from users on the Internet via reverse-proxies at the
+https://debuginfod.fedoraproject.org/ URL. The reverse proxies apply
+gzip compression on the data and provide redirection of the root `/`
+location only into the fedora wiki.
+
+Normally, it is autonomous and needs no maintenance. It should come back
+nicely after many kinds of outage. The software is based on elfutils in
+Fedora, but may occasionally track a custom COPR build with backported
+patches from future elfutils versions.
+
+== Configuration
+
+The daemon uses systemd and `/etc/sysconfig/debuginfod` to set basic
+parameters. These have been tuned from the distro defaults via
+experimental hand-editing or ansible. Key parameters are:
+
+[arabic]
+. The -I/-X include/exclude regexes. These tell debuginfod what fedora
+versions to include RPMs for. If index disk space starts to run low, one
+can eliminate some older fedoras from the index to free up space (after
+the next groom cycle).
+. The --fdcache related parameters. These tell debuginfod how much data
+to cache from RPMs. (Some debuginfo files - kernel, llvm, gtkweb, ...)
+are huge and worth retaining instead of repeated extracting.) This is
+straight disk space vs. time tradeoff.
+. The -t (scan interval) parameter. Scanning lets an index get bigger,
+as new RPMs in koji are examined and their contents indexed. Each pass
+takes a bunch of hours to traverse the entire koji NFS directory
+structure to fstat() everything for newness or change. A smaller scan
+interval lets debuginfod react quicker to koji builds coming into
+existence, but increases load on the NFS server. More -n (scan threads)
+may help the indexing process go faster, if the networking fabric & NFS
+server are underloaded.
+. The -g (groom interval) parameter. Grooming lets an index get smaller,
+as files removed from koji will be forgotten about. It can be run very
+intermittently - weekly or less - since it takes many hours and cannot
+run concurrently with scanning.
+
+A quick:
+
+....
+systemd restart debuginfod
+....
+
+activates the new settings.
+
+In case of some drastic failure like database corruption or signs of
+penetration/abuse, one can shut down the server with systemd, and/or
+stop traffic at the incoming proxy configuration level. The index sqlite
+database under `/var/cache/debuginfod` may be deleted, if necessary, but
+keep in mind that it takes days to reindex the relevant parts of koji.
+Alternately, with the services stopped, the 150GB+ sqlite database files
+may be freely copied between the staging and production servers, if that
+helps during disaster recovery.
+
+== Monitoring
+
+=== Prometheus
+
+The debuginfod daemons answer the standard /metrics URL endpoint to
+serve a variety of operational metrics in prometheus. Important metrics
+include:
+
+[arabic]
+. filesys_free_ratio - free space on the filesystems. (These are also
+monitored via fedora-infra nagios.) If the free space on the database or
+tmp partition falls low, further indexing or even service may be
+impacted. Add more disk space if possible, or start eliding older fedora
+versions from the database via the -I/-X daemon options.
+. thread_busy - number of busy threads. During indexing, 1-6 threads may
+be busy for minutes or even days, intermittently. User requests show up
+as "buildid" (real request) or "buildid-after-you" (deferred duplicate
+request) labels. If there are more than a handful of "buildid" ones,
+there may be an overload/abuse underway, in which case it's time to
+identify the excessive traffic via the logs and get a temporary iptables
+block going. Or perhaps there is an outage or slowdown of the koji NFS
+storage system, in which case there's not much to do.
+. error_count. These should be zero or near zero all the time.
+
+=== Logs
+
+The debuginfod daemons produce voluminous logs into the local systemd
+journal, whence the traffic moves to the usual fedora-infra log01
+server, `/var/log/hosts/debuginfod*/YYYY/MM/DD/messages.log`. The lines
+related to HTTP GET identify the main webapi traffic, with originating
+IP addresses in the XFF: field, and response size and elapsed service
+time in the last columns. These can be useful in tracking down possible
+abuse. :
+
+....
+Jun 28 22:36:43 debuginfod01 debuginfod[381551]: [Mon 28 Jun 2021 10:36:43 PM GMT] (381551/2413727): 10.3.163.75:43776 UA:elfutils/0.185,Linux/x86_64,fedora/35 XFF:*elided* GET /buildid/90910c1963bbcf700c0c0c06ee3bf4c5cc831d3a/debuginfo 200 335440 0+0ms
+....
+
+The lines related to prometheus /metrics are usually no big deal.
+
+The log also includes info about errors and indexing progress.
+Interesting may be the lines like:
+
+....
+Jun 28 22:36:43 debuginfod01 debuginfod[381551]: [Mon 28 Jun 2021 10:36:43 PM GMT] (381551/2413727): serving fdcache archive /mnt/fedora_koji_prod/koji/packages/valgrind/3.17.0/3.fc35/x86_64/valgrind-3.17.0-3.fc35.x86_64.rpm file /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so
+....
+
+which identify the file names derived from requests (which RPMs the
+buildids to). These can provide some indirect distro telemetry: what
+packages and binaries are being debugged and for which architectures?