2021-09-16 16:24:37 +02:00
|
|
|
|
= DNF Counting
|
|
|
|
|
|
|
|
|
|
We use DNF Counting to get statistics about the number of Fedora
|
|
|
|
|
installations.
|
|
|
|
|
|
|
|
|
|
== Contact Information
|
|
|
|
|
|
|
|
|
|
Owner::
|
|
|
|
|
Fedora Infrastructure Team
|
|
|
|
|
Contact::
|
2023-07-12 15:02:43 +02:00
|
|
|
|
#fedora-admin, #fedora-noc,
|
|
|
|
|
admin@fedoraproject.org
|
2021-09-16 16:24:37 +02:00
|
|
|
|
Servers::
|
|
|
|
|
log01, proxy0*
|
|
|
|
|
Purpose::
|
|
|
|
|
Give interested parties information about the number of Fedora
|
|
|
|
|
installations.
|
|
|
|
|
Repositories::
|
2023-07-12 15:02:43 +02:00
|
|
|
|
* https://github.com/fedora-infra/mirrors-countme
|
2021-09-16 16:24:37 +02:00
|
|
|
|
* https://pagure.io/fedora-infra/ansible/blob/main/f/roles/web-data-analysis
|
|
|
|
|
|
2021-09-22 10:40:57 +02:00
|
|
|
|
== What it is
|
|
|
|
|
|
|
|
|
|
DNF Counting is a way for us to gather statistics about the number of Fedora
|
|
|
|
|
installations, differentiated by version, spin, etc. On the infrastructure
|
|
|
|
|
side this is implemented by a bunch of scripts and a Python package
|
|
|
|
|
(`mirrors-countme`).
|
|
|
|
|
|
|
|
|
|
== Scope
|
|
|
|
|
|
|
|
|
|
This SOP concerns itself with the infrastructure side of the equation. For any
|
|
|
|
|
issues with the various frontends logging in to be counted (DNF, PackageKit,
|
|
|
|
|
…), contact their respective maintainers or upstreams.
|
|
|
|
|
|
2021-09-16 16:24:37 +02:00
|
|
|
|
== How it works
|
|
|
|
|
|
2021-09-22 10:40:57 +02:00
|
|
|
|
Clients (DNF, PackageKit, …) have been modified so they add a `countme`
|
|
|
|
|
variable in their requests to `mirrors.fedoraproject.org` once a week. This
|
|
|
|
|
ends up in our webserver log data which lets us generate usage statistics.
|
|
|
|
|
|
|
|
|
|
Cron jobs are set up on `log01` which collect http log files from the various
|
|
|
|
|
web proxies, combine them (accesses to different backend services including
|
|
|
|
|
`mirrors.fedoraproject.org` are scattered across the proxy logs), and produce
|
|
|
|
|
statistics from them. The various pieces live in a) the `mirrors-countme`
|
|
|
|
|
project (Python package and related scripts to generate statistics from the
|
|
|
|
|
log data) and b) shell scripts in the `web-data-analysis` role in Ansible:
|
|
|
|
|
|
|
|
|
|
* `sync-http-logs.py` (Ansible) syncs individual log files from various hosts
|
|
|
|
|
including proxies to `log01`.
|
|
|
|
|
* `combineHttpLogs.sh` (Ansible) combines the logs for the different web sites
|
|
|
|
|
which are scattered across the proxy hosts.
|
2023-07-12 15:02:43 +02:00
|
|
|
|
* `condense-mirrorlogs.sh` & `mirrorlist.py` (Ansible) extracts hosts from the
|
2021-09-22 10:40:57 +02:00
|
|
|
|
combined log data.
|
|
|
|
|
* `countme-update.sh` (Ansible) drives `countme-update-rawdb.sh` &
|
2023-07-12 15:02:43 +02:00
|
|
|
|
`countme-update-totals.sh` (`mirrors-countme`) which generates statistics.
|
|
|
|
|
* `countme-trim-raw` (`mirrors-countme`) to trim the intermediary database file
|
|
|
|
|
(``raw.db``).
|
|
|
|
|
|
|
|
|
|
== Changes implemented in the Q2/2023 DNF mirrors-countme initiative
|
|
|
|
|
|
|
|
|
|
* The “traditional“ statistics which were done before DNF
|
|
|
|
|
learned about the `countme` variable were reimplemented: Count any
|
|
|
|
|
individual IP sighted, no matter if with or without `countme`. This is
|
|
|
|
|
necessary to count systems which don’t have that feature in their DNF or
|
|
|
|
|
YUM, and – while giving different numbers – gives us an idea how things
|
|
|
|
|
develop when compared to the same numbers for more modern OSes.
|
|
|
|
|
* The ``countme-trim-raw`` tool was implemented, to trim the intermediary
|
|
|
|
|
database ``raw.db`` which contains necessary information gleamed from
|
|
|
|
|
parsing the merged log files. This database grows steadily and – with the
|
|
|
|
|
brought back counting of any individual IP sighted – quickly, so once these
|
|
|
|
|
data have been safely turned into the final statistics, we wanted a way to
|
|
|
|
|
remove them so that the local volume were it is stored doesn’t fill up
|
|
|
|
|
completely.
|
|
|
|
|
* The project repository was cleaned up, i.e. large data files used in
|
|
|
|
|
integration tests were removed because they made cloning the repository
|
|
|
|
|
unnecessarily slow, for a couple hundred KB of code, the repo was more than
|
|
|
|
|
300 MB in size. In the context, the repository was moved from Pagure to
|
|
|
|
|
GitHub.
|
|
|
|
|
* Unused code was removed, the remaining code was refactored and condensed
|
|
|
|
|
to remove redundancies and comprehensive unit tests were added so that the
|
|
|
|
|
barrier to contributing is lower and changes are less risky.
|
2021-09-22 10:40:57 +02:00
|
|
|
|
|
|
|
|
|
== Changes implemented in the Q3/2021 DNF Counting Initiative
|
|
|
|
|
|
|
|
|
|
During the Q3/2021 DNF Counting Initiative, a number of changes were
|
|
|
|
|
implemented which improved the DNF Counting backend in the areas of monitoring
|
|
|
|
|
& debugging, performance & robustness.
|
|
|
|
|
|
|
|
|
|
* The involved scripts send messages about state changes and errors to the
|
|
|
|
|
fedora-messaging bus. State changes are e.g. start and finish of a complete
|
|
|
|
|
script or of its individual steps.
|
|
|
|
|
* The shell script which syncs log files from various hosts to `log01`
|
|
|
|
|
(`syncHttpLogs.sh`) was reimplemented in Python (as `sync-http-logs.py`), with
|
|
|
|
|
several improvements which reduced the time it takes for syncing from 6-7
|
|
|
|
|
hours to little more than 30 minutes per day:
|
|
|
|
|
** All log files for one date of one host are synced in one call to `rsync`.
|
|
|
|
|
This greatly reduces overhead.
|
|
|
|
|
+
|
|
|
|
|
The reason to sync these files one-by-one previously was because `rsync` only
|
|
|
|
|
allows differing file names when syncing single files, which we have: the log
|
|
|
|
|
files on the hosts contain their date in the name, on `log01` they don't but
|
|
|
|
|
are stored in directories for each date.
|
|
|
|
|
+
|
|
|
|
|
To overcome this limitation, `sync-http-logs.py` maintains a shadow structure
|
|
|
|
|
of hard links with dates in their names, and `rsync` operates on this
|
|
|
|
|
structure instead, which are linked back to "date-less" file names afterwards
|
|
|
|
|
for further processing.
|
|
|
|
|
** Because syncing log files from some hosts is pretty slow, several hosts are
|
|
|
|
|
synced in parallel.
|
|
|
|
|
* Previously, `syncHttpLogs.sh` and `combineHttpLogs.sh` were run from
|
|
|
|
|
individual cron jobs which were set to run a couple of hours apart.
|
|
|
|
|
Sometimes, this caused problems because the former wasn't finished when the
|
|
|
|
|
latter started to run (i.e. a race condition). Now, `sync-http-logs.py` and
|
|
|
|
|
`combineHttpLogs.sh` are run from one cron job to avoid this.
|
|
|
|
|
* Previously, the scripts where scattered across the `web-data-analysis`,
|
|
|
|
|
`awstats` and `base` roles. All of the deployment has been consolidated into
|
|
|
|
|
the `web-data-analysis` role, `awstats` has been removed.
|
|
|
|
|
* The `mirrors-countme` Python package and scripts are packaged as RPM
|
|
|
|
|
packages in Fedora, previously they were deployed from a local clone of the
|
|
|
|
|
upstream git repository.
|
|
|
|
|
|
|
|
|
|
== Reboot me
|
|
|
|
|
|
|
|
|
|
Yes, just reboot. Or don't. There are no continuously running services,
|
|
|
|
|
everything is regularly run as cronjobs.
|
|
|
|
|
|
|
|
|
|
== Logs
|
|
|
|
|
|
|
|
|
|
The `sync-http-logs.py` script sends relatively verbose output to syslog.
|
|
|
|
|
Other than that, the closest anything comes to logs are mails sent if cronjobs
|
|
|
|
|
produce (error) output and messages sent to the bus.
|
|
|
|
|
|
|
|
|
|
== First steps to debug
|
|
|
|
|
|
|
|
|
|
The scripts send messages with a topic prefix of `logging.stats` to the bus,
|
|
|
|
|
in various stages of their operation. If anything doesn't work as it should,
|
|
|
|
|
review if every step started is also finished, compare run times between days.
|
|
|
|
|
|
|
|
|
|
If anything crashes, cron should have sent mails to the recipients configured
|
|
|
|
|
(at least `root@fedoraproject.org`), which could also contain valuable
|
|
|
|
|
information.
|
|
|
|
|
|
|
|
|
|
== Ephemeral data
|
|
|
|
|
|
|
|
|
|
Generated CSV reports and images are in `/var/www/html/csv-reports` which are
|
|
|
|
|
exposed on https://data-analysis.fedoraproject.org/ – but they get regenerated
|
|
|
|
|
with every cycle of the scripts that is run.
|
|
|
|
|
|
|
|
|
|
== Persistent data
|
|
|
|
|
|
|
|
|
|
All combined http log data is kept on the `/fedora_stats` NFS share. Log
|
|
|
|
|
files from the proxy hosts are synced to `/var/log/hosts/<hostname>` locally,
|
|
|
|
|
but these are just copies of what exists elsewhere already.
|
|
|
|
|
|
|
|
|
|
== Other operational considerations
|
|
|
|
|
|
|
|
|
|
The scripts only process data from the previous three days (roughly). If they
|
|
|
|
|
don't run for a longer time, there might be gaps in the generated statistics
|
|
|
|
|
which can be plugged by temporarily adjusting the respective settings in the
|
|
|
|
|
scripts and re-running them.
|
|
|
|
|
|
|
|
|
|
== Where are the docs?
|
|
|
|
|
|
2023-07-12 15:02:43 +02:00
|
|
|
|
Here :) and at https://github.com/fedora-infra/mirrors-countme
|
2021-09-22 10:40:57 +02:00
|
|
|
|
|
|
|
|
|
== Is there data that needs to be backed up?
|
2021-09-16 16:25:56 +02:00
|
|
|
|
|
2021-09-22 10:40:57 +02:00
|
|
|
|
Yes, but it's on the `/fedora_stats` file share, so it's assumed to get backed
|
|
|
|
|
up regularly already.
|
2021-09-16 16:25:56 +02:00
|
|
|
|
|
2021-09-22 10:40:57 +02:00
|
|
|
|
== Upgrading
|
2021-09-16 16:25:56 +02:00
|
|
|
|
|
2021-09-22 10:40:57 +02:00
|
|
|
|
=== `mirrors-countme`
|
|
|
|
|
|
|
|
|
|
The `mirrors-countme` shell and Python scripts create statistics from the
|
|
|
|
|
already combined log data.
|
|
|
|
|
|
|
|
|
|
==== Making upstream changes available
|
2021-09-16 16:25:56 +02:00
|
|
|
|
|
|
|
|
|
Prerequisites: A change (bug fix or feature) is available in the `main`
|
|
|
|
|
branch of `mirrors-countme`.
|
|
|
|
|
|
|
|
|
|
. Publish an upstream release
|
|
|
|
|
+
|
|
|
|
|
From a clone of the upstream repository:
|
|
|
|
|
+
|
2023-07-12 15:02:43 +02:00
|
|
|
|
.. In `pyproject.toml`, bump `tool.poetry.version` (e.g. to `0.1.2`) and
|
2021-09-16 16:25:56 +02:00
|
|
|
|
commit the change, e.g.:
|
|
|
|
|
+
|
|
|
|
|
....
|
2023-07-12 15:02:43 +02:00
|
|
|
|
git commit -s -m "Version 0.1.2" -- pyproject.toml
|
2021-09-16 16:25:56 +02:00
|
|
|
|
....
|
|
|
|
|
.. Tag the previous change with a GPG-signed tag:
|
|
|
|
|
+
|
|
|
|
|
....
|
2023-07-12 15:02:43 +02:00
|
|
|
|
git tag -s 0.1.2
|
2021-09-16 16:25:56 +02:00
|
|
|
|
....
|
|
|
|
|
.. Push both the change and the tag:
|
|
|
|
|
+
|
|
|
|
|
....
|
2023-07-12 15:02:43 +02:00
|
|
|
|
git push origin main 0.1.2
|
2021-09-16 16:25:56 +02:00
|
|
|
|
....
|
|
|
|
|
.. Create a source tarball (this will be created as e.g.
|
2023-07-12 15:02:43 +02:00
|
|
|
|
`dist/mirrors_countme-0.1.2.tar.gz`):
|
2021-09-16 16:25:56 +02:00
|
|
|
|
+
|
|
|
|
|
....
|
2023-07-12 15:02:43 +02:00
|
|
|
|
poetry build
|
2021-09-16 16:25:56 +02:00
|
|
|
|
....
|
2023-07-12 15:02:43 +02:00
|
|
|
|
From the https://github.com/fedora-infra/mirrors-countme/tags[list of tags],
|
|
|
|
|
select “Create release” in the menu for the respective tag, and attach the
|
|
|
|
|
created tarball and wheel files to the created release.
|
2021-09-16 16:25:56 +02:00
|
|
|
|
. Update and Build the `python-mirrors-countme` Fedora Package
|
|
|
|
|
+
|
|
|
|
|
From a clone of the Fedora package repository, in the `rawhide` branch:
|
|
|
|
|
+
|
|
|
|
|
.. Bump the version in `python-mirrors-countme.spec`. No other changes
|
|
|
|
|
are necessary, the packages uses automatic release fields and changelog.
|
|
|
|
|
+
|
|
|
|
|
.. Download the source tarball, either manually or one of:
|
|
|
|
|
+
|
|
|
|
|
....
|
|
|
|
|
spectool -g python-mirrors-countme.spec
|
|
|
|
|
....
|
|
|
|
|
+
|
|
|
|
|
....
|
|
|
|
|
rpmspectool get python-mirrors-countme.spec
|
|
|
|
|
....
|
|
|
|
|
.. Upload the source tarball to the lookaside cache:
|
|
|
|
|
+
|
|
|
|
|
....
|
2023-07-12 15:02:43 +02:00
|
|
|
|
fedpkg new-sources mirrors_countme-0.1.2.tar.gz
|
2021-09-16 16:25:56 +02:00
|
|
|
|
....
|
|
|
|
|
.. Commit the changes to the repository, e.g.:
|
|
|
|
|
+
|
|
|
|
|
....
|
2023-07-12 15:02:43 +02:00
|
|
|
|
git commit -s -m "Version 0.1.2" -- python-mirrors-countme.spec
|
2021-09-16 16:25:56 +02:00
|
|
|
|
....
|
|
|
|
|
.. Push the changes and build:
|
|
|
|
|
+
|
|
|
|
|
....
|
|
|
|
|
git push && fedpkg build
|
|
|
|
|
....
|
|
|
|
|
.. For any other active Fedora and EPEL branch, fast forward them to the
|
|
|
|
|
state of the `rawhide` branch, push and build, e.g.:
|
|
|
|
|
+
|
|
|
|
|
....
|
|
|
|
|
git checkout epel8 \
|
|
|
|
|
&& git merge --ff-only rawhide \
|
|
|
|
|
&& git push \
|
|
|
|
|
&& fedpkg build
|
|
|
|
|
....
|
|
|
|
|
. Submit Fedora/EPEL Package Updates
|
|
|
|
|
+
|
|
|
|
|
Either submit the update via the
|
|
|
|
|
https://bodhi.fedoraproject.org/updates/new[Bodhi web interface], or
|
|
|
|
|
from the command line in the respective checked out Fedora or EPEL
|
|
|
|
|
branch, e.g.:
|
|
|
|
|
+
|
|
|
|
|
....
|
|
|
|
|
fedpkg update --type bugfix --notes 'Put in some notes!'
|
|
|
|
|
....
|
|
|
|
|
. Tag with Infra-Tags in Koji
|
|
|
|
|
.. Tag the build into the respective infra candidate tag in Koji, e.g.:
|
|
|
|
|
+
|
|
|
|
|
....
|
|
|
|
|
koji tag-build epel8-infra-candidate
|
|
|
|
|
....
|
|
|
|
|
.. Check that the build was picked up and signed (this should take no
|
|
|
|
|
more than a few minutes), e.g.:
|
|
|
|
|
+
|
|
|
|
|
....
|
2023-07-12 15:02:43 +02:00
|
|
|
|
koji buildinfo python-mirrors-countme-0.1.2-1.el8
|
2021-09-16 16:25:56 +02:00
|
|
|
|
....
|
|
|
|
|
+
|
|
|
|
|
The build must be tagged with the corresponding `*-infra-stg` tag.
|
|
|
|
|
.. Tag the build into the respective infra production tag in Koji, e.g.:
|
|
|
|
|
+
|
|
|
|
|
....
|
|
|
|
|
koji tag-build epel8-infra
|
|
|
|
|
....
|
|
|
|
|
|
|
|
|
|
When the respective infra tag repository is updated, the new version
|
|
|
|
|
should be ready to be installed/updated in our infrastructure.
|
2021-09-22 10:40:57 +02:00
|
|
|
|
|
|
|
|
|
=== Other scripts
|
|
|
|
|
|
|
|
|
|
Scripts other than what is contained in `mirrors-countme` live in the
|
|
|
|
|
`web-data-analysis` role in Ansible. Simply "upgrade" them in place.
|
|
|
|
|
|
|
|
|
|
=== Deployment of updates
|
|
|
|
|
|
|
|
|
|
To deploy updated scripts, etc. from the Ansible repository, simply run the
|
|
|
|
|
`groups/logging.yaml` playbook.
|
|
|
|
|
|
|
|
|
|
To update `mirrors-countme`, run the `manual/update-packages.yml` playbook
|
|
|
|
|
with `--extra-vars="package='*mirrors-countme*'"` set.
|
|
|
|
|
|
|
|
|
|
== Related applications
|
|
|
|
|
|
|
|
|
|
The scripts send out status messages over `fedora-messaging` with a topic
|
|
|
|
|
prefix of `logging.stats`.
|
|
|
|
|
|
|
|
|
|
== How is it deployed?
|
|
|
|
|
|
2025-07-04 11:55:02 +02:00
|
|
|
|
All of this runs on `log01.rdu3.fedoraproject.org` and is deployed through the
|
2021-09-22 10:40:57 +02:00
|
|
|
|
`web-data-analysis` role and the `groups/logserver.yml` playbook,
|
|
|
|
|
respectively.
|
|
|
|
|
|
2023-07-12 15:02:43 +02:00
|
|
|
|
The `mirrors-countme` upstream project publishes source tarballs to their
|
|
|
|
|
corresponding releases in the repository on GitHub:
|
2021-09-22 10:40:57 +02:00
|
|
|
|
|
2023-07-12 15:02:43 +02:00
|
|
|
|
https://github.com/fedora-infra/mirrors-countme/releases
|
2021-09-22 10:40:57 +02:00
|
|
|
|
|
|
|
|
|
These are packaged in Fedora as the `python-mirrors-countme` (SRPM) and
|
|
|
|
|
`python3-mirrors-countme` (RPM) packages.
|
|
|
|
|
|
|
|
|
|
Other scripts are located directly in the Fedora Infrastructure Ansible
|
|
|
|
|
repository, in the `web-data-analysis` role.
|
|
|
|
|
|
|
|
|
|
== Does it have any special requirements?
|
|
|
|
|
|
|
|
|
|
No.
|
|
|
|
|
|
|
|
|
|
== Are there any security requirements?
|
|
|
|
|
|
|
|
|
|
The same as anything else that deals with log data.
|
|
|
|
|
|
|
|
|
|
== Bug reports
|
|
|
|
|
|
|
|
|
|
Report bugs with `mirrors-countme` at its upstream project:
|
|
|
|
|
|
2023-07-12 15:02:43 +02:00
|
|
|
|
https://github.com/fedora-infra/mirrors-countme/issues/new
|
2021-09-22 10:40:57 +02:00
|
|
|
|
|
|
|
|
|
Anything concerning the cron jobs or other scripts should probably go into our
|
|
|
|
|
Infrastructure tracker:
|
|
|
|
|
|
|
|
|
|
https://pagure.io/fedora-infrastructure/new_issue
|
|
|
|
|
|
|
|
|
|
== Are there any GDPR related concerns? Mechanisms to deal with PII?
|
|
|
|
|
|
|
|
|
|
The same as anything else that deals with log data.
|