Put more meat onto the DNF Counting SOP skeleton

Signed-off-by: Nils Philippsen <nils@redhat.com>
This commit is contained in:
Nils Philippsen 2021-09-22 10:40:57 +02:00
parent 46569e38c2
commit 791130f580

View file

@ -9,6 +9,9 @@ Owner::
Fedora Infrastructure Team
Contact::
#fedora-admin, #fedora-noc
Initiative Representatives::
* Nils Philippsen (nphilipp)
* Adam Saleh (asaleh)
Servers::
log01, proxy0*
Purpose::
@ -18,17 +21,137 @@ Repositories::
* https://pagure.io/mirrors-countme
* https://pagure.io/fedora-infra/ansible/blob/main/f/roles/web-data-analysis
== What it is
DNF Counting is a way for us to gather statistics about the number of Fedora
installations, differentiated by version, spin, etc. On the infrastructure
side this is implemented by a bunch of scripts and a Python package
(`mirrors-countme`).
== Scope
This SOP concerns itself with the infrastructure side of the equation. For any
issues with the various frontends logging in to be counted (DNF, PackageKit,
…), contact their respective maintainers or upstreams.
== How it works
Scripts sync http log files from proxies, combine the log data,
summarize per Fedora version and spin, and produce graphs.
Clients (DNF, PackageKit, …) have been modified so they add a `countme`
variable in their requests to `mirrors.fedoraproject.org` once a week. This
ends up in our webserver log data which lets us generate usage statistics.
== `mirrors-countme`
Cron jobs are set up on `log01` which collect http log files from the various
web proxies, combine them (accesses to different backend services including
`mirrors.fedoraproject.org` are scattered across the proxy logs), and produce
statistics from them. The various pieces live in a) the `mirrors-countme`
project (Python package and related scripts to generate statistics from the
log data) and b) shell scripts in the `web-data-analysis` role in Ansible:
The `mirrors-countme` project creates statistics from the already
combined log data.
* `sync-http-logs.py` (Ansible) syncs individual log files from various hosts
including proxies to `log01`.
* `combineHttpLogs.sh` (Ansible) combines the logs for the different web sites
which are scattered across the proxy hosts.
* `condense-mirrorlogs.sh` & `mirrorlist.py` (Ansible) extract hosts from the
combined log data.
* `countme-update.sh` (Ansible) drives `countme-update-rawdb.sh` &
`countme-update-totals.sh` (`mirrors-countme`) which generate statistics.
=== Deploying Upstream Changes
== Changes implemented in the Q3/2021 DNF Counting Initiative
During the Q3/2021 DNF Counting Initiative, a number of changes were
implemented which improved the DNF Counting backend in the areas of monitoring
& debugging, performance & robustness.
* The involved scripts send messages about state changes and errors to the
fedora-messaging bus. State changes are e.g. start and finish of a complete
script or of its individual steps.
* The shell script which syncs log files from various hosts to `log01`
(`syncHttpLogs.sh`) was reimplemented in Python (as `sync-http-logs.py`), with
several improvements which reduced the time it takes for syncing from 6-7
hours to little more than 30 minutes per day:
** All log files for one date of one host are synced in one call to `rsync`.
This greatly reduces overhead.
+
The reason to sync these files one-by-one previously was because `rsync` only
allows differing file names when syncing single files, which we have: the log
files on the hosts contain their date in the name, on `log01` they don't but
are stored in directories for each date.
+
To overcome this limitation, `sync-http-logs.py` maintains a shadow structure
of hard links with dates in their names, and `rsync` operates on this
structure instead, which are linked back to "date-less" file names afterwards
for further processing.
** Because syncing log files from some hosts is pretty slow, several hosts are
synced in parallel.
* Previously, `syncHttpLogs.sh` and `combineHttpLogs.sh` were run from
individual cron jobs which were set to run a couple of hours apart.
Sometimes, this caused problems because the former wasn't finished when the
latter started to run (i.e. a race condition). Now, `sync-http-logs.py` and
`combineHttpLogs.sh` are run from one cron job to avoid this.
* Previously, the scripts where scattered across the `web-data-analysis`,
`awstats` and `base` roles. All of the deployment has been consolidated into
the `web-data-analysis` role, `awstats` has been removed.
* The `mirrors-countme` Python package and scripts are packaged as RPM
packages in Fedora, previously they were deployed from a local clone of the
upstream git repository.
== Reboot me
Yes, just reboot. Or don't. There are no continuously running services,
everything is regularly run as cronjobs.
== Logs
The `sync-http-logs.py` script sends relatively verbose output to syslog.
Other than that, the closest anything comes to logs are mails sent if cronjobs
produce (error) output and messages sent to the bus.
== First steps to debug
The scripts send messages with a topic prefix of `logging.stats` to the bus,
in various stages of their operation. If anything doesn't work as it should,
review if every step started is also finished, compare run times between days.
If anything crashes, cron should have sent mails to the recipients configured
(at least `root@fedoraproject.org`), which could also contain valuable
information.
== Ephemeral data
Generated CSV reports and images are in `/var/www/html/csv-reports` which are
exposed on https://data-analysis.fedoraproject.org/ but they get regenerated
with every cycle of the scripts that is run.
== Persistent data
All combined http log data is kept on the `/fedora_stats` NFS share. Log
files from the proxy hosts are synced to `/var/log/hosts/<hostname>` locally,
but these are just copies of what exists elsewhere already.
== Other operational considerations
The scripts only process data from the previous three days (roughly). If they
don't run for a longer time, there might be gaps in the generated statistics
which can be plugged by temporarily adjusting the respective settings in the
scripts and re-running them.
== Where are the docs?
Here :) and at https://pagure.io/mirrors-countme/blob/main/f/README.md
== Is there data that needs to be backed up?
Yes, but it's on the `/fedora_stats` file share, so it's assumed to get backed
up regularly already.
== Upgrading
=== `mirrors-countme`
The `mirrors-countme` shell and Python scripts create statistics from the
already combined log data.
==== Making upstream changes available
Prerequisites: A change (bug fix or feature) is available in the `main`
branch of `mirrors-countme`.
@ -133,3 +256,60 @@ koji tag-build epel8-infra
When the respective infra tag repository is updated, the new version
should be ready to be installed/updated in our infrastructure.
=== Other scripts
Scripts other than what is contained in `mirrors-countme` live in the
`web-data-analysis` role in Ansible. Simply "upgrade" them in place.
=== Deployment of updates
To deploy updated scripts, etc. from the Ansible repository, simply run the
`groups/logging.yaml` playbook.
To update `mirrors-countme`, run the `manual/update-packages.yml` playbook
with `--extra-vars="package='*mirrors-countme*'"` set.
== Related applications
The scripts send out status messages over `fedora-messaging` with a topic
prefix of `logging.stats`.
== How is it deployed?
All of this runs on `log01.iad2.fedoraproject.org` and is deployed through the
`web-data-analysis` role and the `groups/logserver.yml` playbook,
respectively.
The `mirrors-countme` upstream project publishes source tarballs here:
https://releases.pagure.org/mirrors-countme/
These are packaged in Fedora as the `python-mirrors-countme` (SRPM) and
`python3-mirrors-countme` (RPM) packages.
Other scripts are located directly in the Fedora Infrastructure Ansible
repository, in the `web-data-analysis` role.
== Does it have any special requirements?
No.
== Are there any security requirements?
The same as anything else that deals with log data.
== Bug reports
Report bugs with `mirrors-countme` at its upstream project:
https://pagure.io/mirrors-countme/new_issue
Anything concerning the cron jobs or other scripts should probably go into our
Infrastructure tracker:
https://pagure.io/fedora-infrastructure/new_issue
== Are there any GDPR related concerns? Mechanisms to deal with PII?
The same as anything else that deals with log data.