arc/docs/mirrors-countme/index.rst

Improving reliability of mirrors-countme scripts
================================================

Notes on curent deployment
--------------------------

For investigating and deployment, you need to be the member of sysadmin-analysis.

The repo that has the code is on https://pagure.io/mirrors-countme/

The deployment configuration is stored in ansible repo, run through playbook
playbooks/groups/logserver.yml, mostly in role roles/web-data-analysis.

The scripts are running on log01.iad2.fedoraproject.org. If you are a member of
sysadmin-analysis, you should be able to ssh, and have root there.

There are several cron jobs responsible for running the scripts:

- syncHttpLogs in /etc/cron.daily/ rsync logs to
  /var/log/hosts/$HOST/$YEAR/$MONTH/$DAY/http
- combineHttp - in /etc/cron.d/ every day at 6, runs /usr/local/bin/combineHttpLogs.sh
      combines logs from /var/log/hosts to /mnt/fedora_stats/combined-http based on the
      project. We are using /usr/share/awstats/tools/logresolvemerge.pl and I am not
      sure we are using it correctly
- condense-mirrorlogs - in /etc/cron.d/ every day at 6, does some sort of analysis,
  posibly one of the older scripts. It seems to attempt to sort the logs again.
- countme-update - in /etc/cron.d/ every day at 9, runs two scripts,
      countme-update-rawdb.sh that parses the logs and fills in the raw database and
      countme-update-totals.sh that uses the rawdb to calculate the statistics The
      results of countme-update-totals.sh are then copied to a web-folder to make it
      available at https://data-analysis.fedoraproject.org/csv-reports/countme/

Notes on avenues of improvement
-------------------------------

We have several areas we need to improve:

- downloading and syncing the logs, sometimes can fail or hang.
- problems when combining them
- instalation of the scripts, as there has been problem with updates, and currently we
  are doing just a pull of the git repo and running the pip install

Notes on replacing with off-the shelf solutions
-----------------------------------------------

As the raw data we are basing our staticis on are just the access-logs from our
proxy-servers, we could be able to find an off-the shelf solution, that could replace
our brittle scripts.

There are two solutions that psesent themselves, ELK stack and Loki and Promtail by
Grafana.

We are already running ELK stack on our openshift, but our experience so far is that
Elastic Search has even more brittle deployment.

We did some experiments with Loki. The technology seems promissing, as it is much more
simple than ELK stack, with size looking comparable to the raw logs.

Moreover, promtail that does the parsing and uploading of logs has facilities to both
add labels to loglies that will then be indexed and queriable in the database and
collect statistics from the loglines directly that can be gathered by prometheus.

You can query the logs with language simmilar to GraphQL.

We are not going to use it because:

- it doesn't deal well with historical data, so any attempts at initial import of
  logsare pain.
- using promtail enabled metrics wouldn't help us with double-counting of people hitting
  different proxy servers
- configuration is fiddly and tricky to test
- changing batch-process to soft-realtime sounds like a headache