Put more meat onto the DNF Counting SOP skeleton
Signed-off-by: Nils Philippsen <nils@redhat.com>
This commit is contained in:
parent
46569e38c2
commit
791130f580
1 changed files with 186 additions and 6 deletions
|
@ -9,6 +9,9 @@ Owner::
|
|||
Fedora Infrastructure Team
|
||||
Contact::
|
||||
#fedora-admin, #fedora-noc
|
||||
Initiative Representatives::
|
||||
* Nils Philippsen (nphilipp)
|
||||
* Adam Saleh (asaleh)
|
||||
Servers::
|
||||
log01, proxy0*
|
||||
Purpose::
|
||||
|
@ -18,17 +21,137 @@ Repositories::
|
|||
* https://pagure.io/mirrors-countme
|
||||
* https://pagure.io/fedora-infra/ansible/blob/main/f/roles/web-data-analysis
|
||||
|
||||
== What it is
|
||||
|
||||
DNF Counting is a way for us to gather statistics about the number of Fedora
|
||||
installations, differentiated by version, spin, etc. On the infrastructure
|
||||
side this is implemented by a bunch of scripts and a Python package
|
||||
(`mirrors-countme`).
|
||||
|
||||
== Scope
|
||||
|
||||
This SOP concerns itself with the infrastructure side of the equation. For any
|
||||
issues with the various frontends logging in to be counted (DNF, PackageKit,
|
||||
…), contact their respective maintainers or upstreams.
|
||||
|
||||
== How it works
|
||||
|
||||
Scripts sync http log files from proxies, combine the log data,
|
||||
summarize per Fedora version and spin, and produce graphs.
|
||||
Clients (DNF, PackageKit, …) have been modified so they add a `countme`
|
||||
variable in their requests to `mirrors.fedoraproject.org` once a week. This
|
||||
ends up in our webserver log data which lets us generate usage statistics.
|
||||
|
||||
== `mirrors-countme`
|
||||
Cron jobs are set up on `log01` which collect http log files from the various
|
||||
web proxies, combine them (accesses to different backend services including
|
||||
`mirrors.fedoraproject.org` are scattered across the proxy logs), and produce
|
||||
statistics from them. The various pieces live in a) the `mirrors-countme`
|
||||
project (Python package and related scripts to generate statistics from the
|
||||
log data) and b) shell scripts in the `web-data-analysis` role in Ansible:
|
||||
|
||||
The `mirrors-countme` project creates statistics from the already
|
||||
combined log data.
|
||||
* `sync-http-logs.py` (Ansible) syncs individual log files from various hosts
|
||||
including proxies to `log01`.
|
||||
* `combineHttpLogs.sh` (Ansible) combines the logs for the different web sites
|
||||
which are scattered across the proxy hosts.
|
||||
* `condense-mirrorlogs.sh` & `mirrorlist.py` (Ansible) extract hosts from the
|
||||
combined log data.
|
||||
* `countme-update.sh` (Ansible) drives `countme-update-rawdb.sh` &
|
||||
`countme-update-totals.sh` (`mirrors-countme`) which generate statistics.
|
||||
|
||||
=== Deploying Upstream Changes
|
||||
== Changes implemented in the Q3/2021 DNF Counting Initiative
|
||||
|
||||
During the Q3/2021 DNF Counting Initiative, a number of changes were
|
||||
implemented which improved the DNF Counting backend in the areas of monitoring
|
||||
& debugging, performance & robustness.
|
||||
|
||||
* The involved scripts send messages about state changes and errors to the
|
||||
fedora-messaging bus. State changes are e.g. start and finish of a complete
|
||||
script or of its individual steps.
|
||||
* The shell script which syncs log files from various hosts to `log01`
|
||||
(`syncHttpLogs.sh`) was reimplemented in Python (as `sync-http-logs.py`), with
|
||||
several improvements which reduced the time it takes for syncing from 6-7
|
||||
hours to little more than 30 minutes per day:
|
||||
** All log files for one date of one host are synced in one call to `rsync`.
|
||||
This greatly reduces overhead.
|
||||
+
|
||||
The reason to sync these files one-by-one previously was because `rsync` only
|
||||
allows differing file names when syncing single files, which we have: the log
|
||||
files on the hosts contain their date in the name, on `log01` they don't but
|
||||
are stored in directories for each date.
|
||||
+
|
||||
To overcome this limitation, `sync-http-logs.py` maintains a shadow structure
|
||||
of hard links with dates in their names, and `rsync` operates on this
|
||||
structure instead, which are linked back to "date-less" file names afterwards
|
||||
for further processing.
|
||||
** Because syncing log files from some hosts is pretty slow, several hosts are
|
||||
synced in parallel.
|
||||
* Previously, `syncHttpLogs.sh` and `combineHttpLogs.sh` were run from
|
||||
individual cron jobs which were set to run a couple of hours apart.
|
||||
Sometimes, this caused problems because the former wasn't finished when the
|
||||
latter started to run (i.e. a race condition). Now, `sync-http-logs.py` and
|
||||
`combineHttpLogs.sh` are run from one cron job to avoid this.
|
||||
* Previously, the scripts where scattered across the `web-data-analysis`,
|
||||
`awstats` and `base` roles. All of the deployment has been consolidated into
|
||||
the `web-data-analysis` role, `awstats` has been removed.
|
||||
* The `mirrors-countme` Python package and scripts are packaged as RPM
|
||||
packages in Fedora, previously they were deployed from a local clone of the
|
||||
upstream git repository.
|
||||
|
||||
== Reboot me
|
||||
|
||||
Yes, just reboot. Or don't. There are no continuously running services,
|
||||
everything is regularly run as cronjobs.
|
||||
|
||||
== Logs
|
||||
|
||||
The `sync-http-logs.py` script sends relatively verbose output to syslog.
|
||||
Other than that, the closest anything comes to logs are mails sent if cronjobs
|
||||
produce (error) output and messages sent to the bus.
|
||||
|
||||
== First steps to debug
|
||||
|
||||
The scripts send messages with a topic prefix of `logging.stats` to the bus,
|
||||
in various stages of their operation. If anything doesn't work as it should,
|
||||
review if every step started is also finished, compare run times between days.
|
||||
|
||||
If anything crashes, cron should have sent mails to the recipients configured
|
||||
(at least `root@fedoraproject.org`), which could also contain valuable
|
||||
information.
|
||||
|
||||
== Ephemeral data
|
||||
|
||||
Generated CSV reports and images are in `/var/www/html/csv-reports` which are
|
||||
exposed on https://data-analysis.fedoraproject.org/ – but they get regenerated
|
||||
with every cycle of the scripts that is run.
|
||||
|
||||
== Persistent data
|
||||
|
||||
All combined http log data is kept on the `/fedora_stats` NFS share. Log
|
||||
files from the proxy hosts are synced to `/var/log/hosts/<hostname>` locally,
|
||||
but these are just copies of what exists elsewhere already.
|
||||
|
||||
== Other operational considerations
|
||||
|
||||
The scripts only process data from the previous three days (roughly). If they
|
||||
don't run for a longer time, there might be gaps in the generated statistics
|
||||
which can be plugged by temporarily adjusting the respective settings in the
|
||||
scripts and re-running them.
|
||||
|
||||
== Where are the docs?
|
||||
|
||||
Here :) and at https://pagure.io/mirrors-countme/blob/main/f/README.md
|
||||
|
||||
== Is there data that needs to be backed up?
|
||||
|
||||
Yes, but it's on the `/fedora_stats` file share, so it's assumed to get backed
|
||||
up regularly already.
|
||||
|
||||
== Upgrading
|
||||
|
||||
=== `mirrors-countme`
|
||||
|
||||
The `mirrors-countme` shell and Python scripts create statistics from the
|
||||
already combined log data.
|
||||
|
||||
==== Making upstream changes available
|
||||
|
||||
Prerequisites: A change (bug fix or feature) is available in the `main`
|
||||
branch of `mirrors-countme`.
|
||||
|
@ -133,3 +256,60 @@ koji tag-build epel8-infra
|
|||
|
||||
When the respective infra tag repository is updated, the new version
|
||||
should be ready to be installed/updated in our infrastructure.
|
||||
|
||||
=== Other scripts
|
||||
|
||||
Scripts other than what is contained in `mirrors-countme` live in the
|
||||
`web-data-analysis` role in Ansible. Simply "upgrade" them in place.
|
||||
|
||||
=== Deployment of updates
|
||||
|
||||
To deploy updated scripts, etc. from the Ansible repository, simply run the
|
||||
`groups/logging.yaml` playbook.
|
||||
|
||||
To update `mirrors-countme`, run the `manual/update-packages.yml` playbook
|
||||
with `--extra-vars="package='*mirrors-countme*'"` set.
|
||||
|
||||
== Related applications
|
||||
|
||||
The scripts send out status messages over `fedora-messaging` with a topic
|
||||
prefix of `logging.stats`.
|
||||
|
||||
== How is it deployed?
|
||||
|
||||
All of this runs on `log01.iad2.fedoraproject.org` and is deployed through the
|
||||
`web-data-analysis` role and the `groups/logserver.yml` playbook,
|
||||
respectively.
|
||||
|
||||
The `mirrors-countme` upstream project publishes source tarballs here:
|
||||
|
||||
https://releases.pagure.org/mirrors-countme/
|
||||
|
||||
These are packaged in Fedora as the `python-mirrors-countme` (SRPM) and
|
||||
`python3-mirrors-countme` (RPM) packages.
|
||||
|
||||
Other scripts are located directly in the Fedora Infrastructure Ansible
|
||||
repository, in the `web-data-analysis` role.
|
||||
|
||||
== Does it have any special requirements?
|
||||
|
||||
No.
|
||||
|
||||
== Are there any security requirements?
|
||||
|
||||
The same as anything else that deals with log data.
|
||||
|
||||
== Bug reports
|
||||
|
||||
Report bugs with `mirrors-countme` at its upstream project:
|
||||
|
||||
https://pagure.io/mirrors-countme/new_issue
|
||||
|
||||
Anything concerning the cron jobs or other scripts should probably go into our
|
||||
Infrastructure tracker:
|
||||
|
||||
https://pagure.io/fedora-infrastructure/new_issue
|
||||
|
||||
== Are there any GDPR related concerns? Mechanisms to deal with PII?
|
||||
|
||||
The same as anything else that deals with log data.
|
||||
|
|
Loading…
Add table
Add a link
Reference in a new issue