diff --git a/modules/sysadmin_guide/pages/dnf-counting.adoc b/modules/sysadmin_guide/pages/dnf-counting.adoc index 81650c3..b5c48b4 100644 --- a/modules/sysadmin_guide/pages/dnf-counting.adoc +++ b/modules/sysadmin_guide/pages/dnf-counting.adoc @@ -9,6 +9,9 @@ Owner:: Fedora Infrastructure Team Contact:: #fedora-admin, #fedora-noc +Initiative Representatives:: + * Nils Philippsen (nphilipp) + * Adam Saleh (asaleh) Servers:: log01, proxy0* Purpose:: @@ -18,17 +21,137 @@ Repositories:: * https://pagure.io/mirrors-countme * https://pagure.io/fedora-infra/ansible/blob/main/f/roles/web-data-analysis +== What it is + +DNF Counting is a way for us to gather statistics about the number of Fedora +installations, differentiated by version, spin, etc. On the infrastructure +side this is implemented by a bunch of scripts and a Python package +(`mirrors-countme`). + +== Scope + +This SOP concerns itself with the infrastructure side of the equation. For any +issues with the various frontends logging in to be counted (DNF, PackageKit, +…), contact their respective maintainers or upstreams. + == How it works -Scripts sync http log files from proxies, combine the log data, -summarize per Fedora version and spin, and produce graphs. +Clients (DNF, PackageKit, …) have been modified so they add a `countme` +variable in their requests to `mirrors.fedoraproject.org` once a week. This +ends up in our webserver log data which lets us generate usage statistics. -== `mirrors-countme` +Cron jobs are set up on `log01` which collect http log files from the various +web proxies, combine them (accesses to different backend services including +`mirrors.fedoraproject.org` are scattered across the proxy logs), and produce +statistics from them. The various pieces live in a) the `mirrors-countme` +project (Python package and related scripts to generate statistics from the +log data) and b) shell scripts in the `web-data-analysis` role in Ansible: -The `mirrors-countme` project creates statistics from the already -combined log data. +* `sync-http-logs.py` (Ansible) syncs individual log files from various hosts + including proxies to `log01`. +* `combineHttpLogs.sh` (Ansible) combines the logs for the different web sites + which are scattered across the proxy hosts. +* `condense-mirrorlogs.sh` & `mirrorlist.py` (Ansible) extract hosts from the + combined log data. +* `countme-update.sh` (Ansible) drives `countme-update-rawdb.sh` & + `countme-update-totals.sh` (`mirrors-countme`) which generate statistics. -=== Deploying Upstream Changes +== Changes implemented in the Q3/2021 DNF Counting Initiative + +During the Q3/2021 DNF Counting Initiative, a number of changes were +implemented which improved the DNF Counting backend in the areas of monitoring +& debugging, performance & robustness. + +* The involved scripts send messages about state changes and errors to the +fedora-messaging bus. State changes are e.g. start and finish of a complete +script or of its individual steps. +* The shell script which syncs log files from various hosts to `log01` +(`syncHttpLogs.sh`) was reimplemented in Python (as `sync-http-logs.py`), with +several improvements which reduced the time it takes for syncing from 6-7 +hours to little more than 30 minutes per day: +** All log files for one date of one host are synced in one call to `rsync`. +This greatly reduces overhead. ++ +The reason to sync these files one-by-one previously was because `rsync` only +allows differing file names when syncing single files, which we have: the log +files on the hosts contain their date in the name, on `log01` they don't but +are stored in directories for each date. ++ +To overcome this limitation, `sync-http-logs.py` maintains a shadow structure +of hard links with dates in their names, and `rsync` operates on this +structure instead, which are linked back to "date-less" file names afterwards +for further processing. +** Because syncing log files from some hosts is pretty slow, several hosts are +synced in parallel. +* Previously, `syncHttpLogs.sh` and `combineHttpLogs.sh` were run from + individual cron jobs which were set to run a couple of hours apart. + Sometimes, this caused problems because the former wasn't finished when the + latter started to run (i.e. a race condition). Now, `sync-http-logs.py` and + `combineHttpLogs.sh` are run from one cron job to avoid this. +* Previously, the scripts where scattered across the `web-data-analysis`, + `awstats` and `base` roles. All of the deployment has been consolidated into + the `web-data-analysis` role, `awstats` has been removed. +* The `mirrors-countme` Python package and scripts are packaged as RPM + packages in Fedora, previously they were deployed from a local clone of the + upstream git repository. + +== Reboot me + +Yes, just reboot. Or don't. There are no continuously running services, +everything is regularly run as cronjobs. + +== Logs + +The `sync-http-logs.py` script sends relatively verbose output to syslog. +Other than that, the closest anything comes to logs are mails sent if cronjobs +produce (error) output and messages sent to the bus. + +== First steps to debug + +The scripts send messages with a topic prefix of `logging.stats` to the bus, +in various stages of their operation. If anything doesn't work as it should, +review if every step started is also finished, compare run times between days. + +If anything crashes, cron should have sent mails to the recipients configured +(at least `root@fedoraproject.org`), which could also contain valuable +information. + +== Ephemeral data + +Generated CSV reports and images are in `/var/www/html/csv-reports` which are +exposed on https://data-analysis.fedoraproject.org/ – but they get regenerated +with every cycle of the scripts that is run. + +== Persistent data + +All combined http log data is kept on the `/fedora_stats` NFS share. Log +files from the proxy hosts are synced to `/var/log/hosts/` locally, +but these are just copies of what exists elsewhere already. + +== Other operational considerations + +The scripts only process data from the previous three days (roughly). If they +don't run for a longer time, there might be gaps in the generated statistics +which can be plugged by temporarily adjusting the respective settings in the +scripts and re-running them. + +== Where are the docs? + +Here :) and at https://pagure.io/mirrors-countme/blob/main/f/README.md + +== Is there data that needs to be backed up? + +Yes, but it's on the `/fedora_stats` file share, so it's assumed to get backed +up regularly already. + +== Upgrading + +=== `mirrors-countme` + +The `mirrors-countme` shell and Python scripts create statistics from the +already combined log data. + +==== Making upstream changes available Prerequisites: A change (bug fix or feature) is available in the `main` branch of `mirrors-countme`. @@ -133,3 +256,60 @@ koji tag-build epel8-infra When the respective infra tag repository is updated, the new version should be ready to be installed/updated in our infrastructure. + +=== Other scripts + +Scripts other than what is contained in `mirrors-countme` live in the +`web-data-analysis` role in Ansible. Simply "upgrade" them in place. + +=== Deployment of updates + +To deploy updated scripts, etc. from the Ansible repository, simply run the +`groups/logging.yaml` playbook. + +To update `mirrors-countme`, run the `manual/update-packages.yml` playbook +with `--extra-vars="package='*mirrors-countme*'"` set. + +== Related applications + +The scripts send out status messages over `fedora-messaging` with a topic +prefix of `logging.stats`. + +== How is it deployed? + +All of this runs on `log01.iad2.fedoraproject.org` and is deployed through the +`web-data-analysis` role and the `groups/logserver.yml` playbook, +respectively. + +The `mirrors-countme` upstream project publishes source tarballs here: + +https://releases.pagure.org/mirrors-countme/ + +These are packaged in Fedora as the `python-mirrors-countme` (SRPM) and +`python3-mirrors-countme` (RPM) packages. + +Other scripts are located directly in the Fedora Infrastructure Ansible +repository, in the `web-data-analysis` role. + +== Does it have any special requirements? + +No. + +== Are there any security requirements? + +The same as anything else that deals with log data. + +== Bug reports + +Report bugs with `mirrors-countme` at its upstream project: + +https://pagure.io/mirrors-countme/new_issue + +Anything concerning the cron jobs or other scripts should probably go into our +Infrastructure tracker: + +https://pagure.io/fedora-infrastructure/new_issue + +== Are there any GDPR related concerns? Mechanisms to deal with PII? + +The same as anything else that deals with log data.