arc/docs/dnf-countme/index.rst

DNF Countme
===========

Purpose
-------

The purpose of this work was about investigating the current solution and it's
bottlenecks to identify what needs to be done to solve the following problems:

- Storage bottleneck when creating the intermediate database file
- Operations efficiency for the infrastructure team

Goals
-----

Short Term
~~~~~~~~~~

The short term goal is about enhancing operational gaps and possible technical
bottlenecks in the current solution.

- Improve intermediate db file to consume less disk space
- Weekly data generation instead of daily
- Improve operations

Long Term
~~~~~~~~~

The long term goal aims to replace the current solution with an actual data driven
application to provide end-user real time analytics (as close as possible) because of
the following limitations:

- Graphical reports are static images served through httpd
- Manual intervention is needed to generate reports outside of the cron job schedule
- Data is not real time
- There is no way to connect third party apps suchs as Jupyter since there is no “data
  service”

The long term goal aims to create a data service and/or use an existing open source
solution such as Prometheus/Kafka/etc. to serve that data from an API and a web app
interface.

The API would be useful for other apps to pull and filter “real time” data instead of
downloading a sqlite db file to then parse it to useful human friendly formats.

Resources
---------

- https://data-analysis.fedoraproject.org/csv-reports/countme/totals.db
- https://data-analysis.fedoraproject.org/csv-reports/mirrors/mirrorsdata-all.csv
- https://pagure.io/velociraptorizer
- https://pagure.io/brontosaurusifier/

Investigation
-------------

The investigation was about identifying possible bottenecks in the current solution,
both technical and opertional.

The Current System
~~~~~~~~~~~~~~~~~~

The current “system” is an ansible role[1] which relies on mirrors-countme and other
scripts to do its job, most of those scripts are being executed from a cron job which
generates static images that are served through a web server.

Someone from the Fedora infrastructure team needs to run that paybook if there is a need
to run any of those tools outside of the cron job schedule which is quite limiting.

[1] - https://pagure.io/fedora-infra/ansible/blob/main/f/roles/web-data-analysis

The Intermediate Database
~~~~~~~~~~~~~~~~~~~~~~~~~

The current process is that the script generates an intermediate database file, usually
referred to as “raw.db”, so another file is created from this one (“totals.db”) which is
used by end users.

The problem is that “raw.db” data is appended for each httpd Apache log line which is
turning into a storage problem due to the increasing growth of that file size.

One possible solution that is on the table is to purge old data from “raw.db” every time
the end user database file gets updated - for example: keep data from the last 30 days
and delete everything else.

Another option is to create weekly database files for the intermediate/”raw.db”
database, using the full year and week number as the filename, for example: YYYY/01.db
instead of appending everything to one “raw.db” file - that would allow us to archive
those files individually if needed.

Conclusions
-----------

We concluded that we have work with the current solution as a short term goal but should
keep track of a system refactoring as a long term goal.

The `short term goal` is about removing storage bottlenecks and enhacing its operational
effciency.

The `long term goal` is about creating a data system that will replace the current
solution entirely which may require another "arc initiative" as well.

Proposed Roadmap
----------------

SOP
~~~

The team should write an SOP for the Fedora Infrastructure team about how and where data
is generated.

The SOP Document should also describe the required steps to generate “on demand” data
based on user request.

Data Generation
~~~~~~~~~~~~~~~

The intermediate database file, also known as “raw.db”, is generated daily through a
cron job.

The cron job should be run weekly instead, because httpd logs are not “real time” and
the system can suffer from eventual data losses by doing it daily.

This can be done by updating cron job file definitions in Fedora’s ansible repository:
https://pagure.io/fedora-infra/ansible/blob/main/f/roles/web-data-analysis/files

Data Cleanup
~~~~~~~~~~~~

The intermediate “raw.db” file aggregates all parsed data from HTTPd logs which is
turning into a storage problem on our log servers.

There are two possible solutions for this problem: split database files based on “week
of the year” or delete data from the intermediate database file that is older than 1
month.

Splitting Database Files
++++++++++++++++++++++++

This scheme would create a file per “week of the year” instead of a single intermediate
database file.

That would allow us to archive older files somewhere else while keeping the most recent
ones in the server (the last 4 weeks for example).

This solution requires changes to how database files are written and the way we read
those files to generate the final database file used by end users.

Database Cleanup
++++++++++++++++

This approach would keep using a single “raw.db” database file but a new step would be
added when adding data in the end user database file.

The team would need to implement a step that would remove old data from the intermediate
database file once the final counter database file is updated.

For example: read “raw.db” -> update “counter.db” -> delete all data from “raw.db” that
is older than one month.

This approach is a bit simpler since it just needs an extra step in the existing code
instead of changing how “raw.db” files are stored and used.