Put more meat onto the DNF Counting SOP skeleton

Signed-off-by: Nils Philippsen <nils@redhat.com>
2021-09-22 10:40:57 +02:00 · 2021-09-22 10:40:57 +02:00 · 791130f580
commit 791130f580
parent 46569e38c2
1 changed files with 186 additions and 6 deletions
--- a/modules/sysadmin_guide/pages/dnf-counting.adoc
+++ b/modules/sysadmin_guide/pages/dnf-counting.adoc
@ -9,6 +9,9 @@ Owner::
  Fedora Infrastructure Team
 Contact::
  #fedora-admin, #fedora-noc
+Initiative Representatives::
+  * Nils Philippsen (nphilipp)
+  * Adam Saleh (asaleh)
 Servers::
  log01, proxy0*
 Purpose::
@ -18,17 +21,137 @@ Repositories::
  * https://pagure.io/mirrors-countme
  * https://pagure.io/fedora-infra/ansible/blob/main/f/roles/web-data-analysis

+== What it is
+
+DNF Counting is a way for us to gather statistics about the number of Fedora
+installations, differentiated by version, spin, etc. On the infrastructure
+side this is implemented by a bunch of scripts and a Python package
+(`mirrors-countme`).
+
+== Scope
+
+This SOP concerns itself with the infrastructure side of the equation. For any
+issues with the various frontends logging in to be counted (DNF, PackageKit,
+…), contact their respective maintainers or upstreams.
+
 == How it works

-Scripts sync http log files from proxies, combine the log data,
-summarize per Fedora version and spin, and produce graphs.
+Clients (DNF, PackageKit, …) have been modified so they add a `countme`
+variable in their requests to `mirrors.fedoraproject.org` once a week. This
+ends up in our webserver log data which lets us generate usage statistics.

-== `mirrors-countme`
+Cron jobs are set up on `log01` which collect http log files from the various
+web proxies, combine them (accesses to different backend services including
+`mirrors.fedoraproject.org` are scattered across the proxy logs), and produce
+statistics from them. The various pieces live in a) the `mirrors-countme`
+project (Python package and related scripts to generate statistics from the
+log data) and b) shell scripts in the `web-data-analysis` role in Ansible:

-The `mirrors-countme` project creates statistics from the already
-combined log data.
+* `sync-http-logs.py` (Ansible) syncs individual log files from various hosts
+  including proxies to `log01`.
+* `combineHttpLogs.sh` (Ansible) combines the logs for the different web sites
+  which are scattered across the proxy hosts.
+* `condense-mirrorlogs.sh` & `mirrorlist.py` (Ansible) extract hosts from the
+  combined log data.
+* `countme-update.sh` (Ansible) drives `countme-update-rawdb.sh` &
+  `countme-update-totals.sh` (`mirrors-countme`) which generate statistics.

-=== Deploying Upstream Changes
+== Changes implemented in the Q3/2021 DNF Counting Initiative
+
+During the Q3/2021 DNF Counting Initiative, a number of changes were
+implemented which improved the DNF Counting backend in the areas of monitoring
+& debugging, performance & robustness.
+
+* The involved scripts send messages about state changes and errors to the
+fedora-messaging bus. State changes are e.g. start and finish of a complete
+script or of its individual steps.
+* The shell script which syncs log files from various hosts to `log01`
+(`syncHttpLogs.sh`) was reimplemented in Python (as `sync-http-logs.py`), with
+several improvements which reduced the time it takes for syncing from 6-7
+hours to little more than 30 minutes per day:
+** All log files for one date of one host are synced in one call to `rsync`.
+This greatly reduces overhead.
+
+The reason to sync these files one-by-one previously was because `rsync` only
+allows differing file names when syncing single files, which we have: the log
+files on the hosts contain their date in the name, on `log01` they don't but
+are stored in directories for each date.
+
+To overcome this limitation, `sync-http-logs.py` maintains a shadow structure
+of hard links with dates in their names, and `rsync` operates on this
+structure instead, which are linked back to "date-less" file names afterwards
+for further processing.
+** Because syncing log files from some hosts is pretty slow, several hosts are
+synced in parallel.
+* Previously, `syncHttpLogs.sh` and `combineHttpLogs.sh` were run from
+  individual cron jobs which were set to run a couple of hours apart.
+  Sometimes, this caused problems because the former wasn't finished when the
+  latter started to run (i.e. a race condition). Now, `sync-http-logs.py` and
+  `combineHttpLogs.sh` are run from one cron job to avoid this.
+* Previously, the scripts where scattered across the `web-data-analysis`,
+  `awstats` and `base` roles. All of the deployment has been consolidated into
+  the `web-data-analysis` role, `awstats` has been removed.
+* The `mirrors-countme` Python package and scripts are packaged as RPM
+  packages in Fedora, previously they were deployed from a local clone of the
+  upstream git repository.
+
+== Reboot me
+
+Yes, just reboot. Or don't. There are no continuously running services,
+everything is regularly run as cronjobs.
+
+== Logs
+
+The `sync-http-logs.py` script sends relatively verbose output to syslog.
+Other than that, the closest anything comes to logs are mails sent if cronjobs
+produce (error) output and messages sent to the bus.
+
+== First steps to debug
+
+The scripts send messages with a topic prefix of `logging.stats` to the bus,
+in various stages of their operation. If anything doesn't work as it should,
+review if every step started is also finished, compare run times between days.
+
+If anything crashes, cron should have sent mails to the recipients configured
+(at least `root@fedoraproject.org`), which could also contain valuable
+information.
+
+== Ephemeral data
+
+Generated CSV reports and images are in `/var/www/html/csv-reports` which are
+exposed on https://data-analysis.fedoraproject.org/ – but they get regenerated
+with every cycle of the scripts that is run.
+
+== Persistent data
+
+All combined http log data is kept on the `/fedora_stats` NFS share.  Log
+files from the proxy hosts are synced to `/var/log/hosts/<hostname>` locally,
+but these are just copies of what exists elsewhere already.
+
+== Other operational considerations
+
+The scripts only process data from the previous three days (roughly). If they
+don't run for a longer time, there might be gaps in the generated statistics
+which can be plugged by temporarily adjusting the respective settings in the
+scripts and re-running them.
+
+== Where are the docs?
+
+Here :) and at https://pagure.io/mirrors-countme/blob/main/f/README.md
+
+== Is there data that needs to be backed up?
+
+Yes, but it's on the `/fedora_stats` file share, so it's assumed to get backed
+up regularly already.
+
+== Upgrading
+
+=== `mirrors-countme`
+
+The `mirrors-countme` shell and Python scripts create statistics from the
+already combined log data.
+
+==== Making upstream changes available

 Prerequisites: A change (bug fix or feature) is available in the `main`
 branch of `mirrors-countme`.
@ -133,3 +256,60 @@ koji tag-build epel8-infra

 When the respective infra tag repository is updated, the new version
 should be ready to be installed/updated in our infrastructure.
+
+=== Other scripts
+
+Scripts other than what is contained in `mirrors-countme` live in the
+`web-data-analysis` role in Ansible. Simply "upgrade" them in place.
+
+=== Deployment of updates
+
+To deploy updated scripts, etc. from the Ansible repository, simply run the
+`groups/logging.yaml` playbook.
+
+To update `mirrors-countme`, run the `manual/update-packages.yml` playbook
+with `--extra-vars="package='*mirrors-countme*'"` set.
+
+== Related applications
+
+The scripts send out status messages over `fedora-messaging` with a topic
+prefix of `logging.stats`.
+
+== How is it deployed?
+
+All of this runs on `log01.iad2.fedoraproject.org` and is deployed through the
+`web-data-analysis` role and the `groups/logserver.yml` playbook,
+respectively.
+
+The `mirrors-countme` upstream project publishes source tarballs here:
+
+https://releases.pagure.org/mirrors-countme/
+
+These are packaged in Fedora as the `python-mirrors-countme` (SRPM) and
+`python3-mirrors-countme` (RPM) packages.
+
+Other scripts are located directly in the Fedora Infrastructure Ansible
+repository, in the `web-data-analysis` role.
+
+== Does it have any special requirements?
+
+No.
+
+== Are there any security requirements?
+
+The same as anything else that deals with log data.
+
+== Bug reports
+
+Report bugs with `mirrors-countme` at its upstream project:
+
+https://pagure.io/mirrors-countme/new_issue
+
+Anything concerning the cron jobs or other scripts should probably go into our
+Infrastructure tracker:
+
+https://pagure.io/fedora-infrastructure/new_issue
+
+== Are there any GDPR related concerns? Mechanisms to deal with PII?
+
+The same as anything else that deals with log data.