Update the DNF Countme investigation

The DNF Countme investigation was opened for a long time as a PR
https://pagure.io/fedora-infra/arc/pull-request/25 and it needed to updated for
current structure of the document.

I also reviewed the document and fixed any typos or spelling errors.

Signed-off-by: Michal Konečný <mkonecny@redhat.com>
This commit is contained in:
Michal Konečný 2023-02-07 13:48:19 +01:00
parent ac260fa0e5
commit 402488bfdb
2 changed files with 10 additions and 9 deletions

View file

@ -4,10 +4,10 @@ DNF Countme
Purpose
-------
The purpuse of this work was about investigating the current solution and its bottlenecks to identify what needs to be done to solve the following problems:
The purpose of this work was about investigating the current solution and it's bottlenecks to identify what needs to be done to solve the following problems:
* Storage bottleneck when creating the intermediate database file;
* Operations efficiency for the infrastructure team;
* Storage bottleneck when creating the intermediate database file
* Operations efficiency for the infrastructure team
Goals
-----
@ -26,10 +26,10 @@ Long Term
The long term goal aims to replace the current solution with an actual data driven application to provide end-user real time analytics (as close as possible) because of the following limitations:
* Graphical reports are static images served through HTTPd;
* Manual intervention is needed to generate reports outside of the cron job schedule;
* Data is not real time;
* There is no way to connect third party apps suchs as Jupyter since there is no “data service”.
* Graphical reports are static images served through httpd
* Manual intervention is needed to generate reports outside of the cron job schedule
* Data is not real time
* There is no way to connect third party apps suchs as Jupyter since there is no “data service”
The long term goal aims to create a data service and/or use an existing open source solution such as Prometheus/Kafka/etc. to serve that data from an API and a web app interface.
@ -62,7 +62,7 @@ The Intermediate Database
The current process is that the script generates an intermediate database file, usually referred to as “raw.db”, so another file is created from this one (“totals.db”) which is used by end users.
The problem is that “raw.db” data is appended for each HTTPd Apache log line which is turning into a storage problem due to the increasing growth of that file size.
The problem is that “raw.db” data is appended for each httpd Apache log line which is turning into a storage problem due to the increasing growth of that file size.
One possible solution that is on the table is to purge old data from “raw.db” every time the end user database file gets updated - for example: keep data from the last 30 days and delete everything else.
@ -92,7 +92,7 @@ Data Generation
The intermediate database file, also known as “raw.db”, is generated daily through a cron job.
The cron job should be run weekly instead, because HTTPd logs are not “real time” and the system can suffer from eventual data losses by doing it daily.
The cron job should be run weekly instead, because httpd logs are not “real time” and the system can suffer from eventual data losses by doing it daily.
This can be done by updating cron job file definitions in Fedoras ansible repository: https://pagure.io/fedora-infra/ansible/blob/main/f/roles/web-data-analysis/files

View file

@ -15,6 +15,7 @@ Completed review
.. toctree::
:maxdepth: 1
dnf-countme/index
pagure2gitlab/index
mailman3/index
pdc/index