fix parsing errors and sphinx warnings

Signed-off-by: Ryan Lerch <rlerch@redhat.com>
This commit is contained in:
Ryan Lercho 2023-11-16 08:02:56 +10:00 committed by zlopez
parent 8fb9b2fdf0
commit ba720c3d77
98 changed files with 4799 additions and 4788 deletions

View file

@ -4,123 +4,159 @@ DNF Countme
Purpose
-------
The purpose of this work was about investigating the current solution and it's bottlenecks to identify what needs to be done to solve the following problems:
The purpose of this work was about investigating the current solution and it's
bottlenecks to identify what needs to be done to solve the following problems:
* Storage bottleneck when creating the intermediate database file
* Operations efficiency for the infrastructure team
- Storage bottleneck when creating the intermediate database file
- Operations efficiency for the infrastructure team
Goals
-----
Short Term
**********
~~~~~~~~~~
The short term goal is about enhancing operational gaps and possible technical bottlenecks in the current solution.
The short term goal is about enhancing operational gaps and possible technical
bottlenecks in the current solution.
* Improve intermediate db file to consume less disk space
* Weekly data generation instead of daily
* Improve operations
- Improve intermediate db file to consume less disk space
- Weekly data generation instead of daily
- Improve operations
Long Term
*********
~~~~~~~~~
The long term goal aims to replace the current solution with an actual data driven application to provide end-user real time analytics (as close as possible) because of the following limitations:
The long term goal aims to replace the current solution with an actual data driven
application to provide end-user real time analytics (as close as possible) because of
the following limitations:
* Graphical reports are static images served through httpd
* Manual intervention is needed to generate reports outside of the cron job schedule
* Data is not real time
* There is no way to connect third party apps suchs as Jupyter since there is no “data service”
- Graphical reports are static images served through httpd
- Manual intervention is needed to generate reports outside of the cron job schedule
- Data is not real time
- There is no way to connect third party apps suchs as Jupyter since there is no “data
service”
The long term goal aims to create a data service and/or use an existing open source solution such as Prometheus/Kafka/etc. to serve that data from an API and a web app interface.
The long term goal aims to create a data service and/or use an existing open source
solution such as Prometheus/Kafka/etc. to serve that data from an API and a web app
interface.
The API would be useful for other apps to pull and filter “real time” data instead of downloading a sqlite db file to then parse it to useful human friendly formats.
The API would be useful for other apps to pull and filter “real time” data instead of
downloading a sqlite db file to then parse it to useful human friendly formats.
Resources
---------
* https://data-analysis.fedoraproject.org/csv-reports/countme/totals.db
* https://data-analysis.fedoraproject.org/csv-reports/mirrors/mirrorsdata-all.csv
* https://pagure.io/velociraptorizer
* https://pagure.io/brontosaurusifier/
- https://data-analysis.fedoraproject.org/csv-reports/countme/totals.db
- https://data-analysis.fedoraproject.org/csv-reports/mirrors/mirrorsdata-all.csv
- https://pagure.io/velociraptorizer
- https://pagure.io/brontosaurusifier/
Investigation
-------------
The investigation was about identifying possible bottenecks in the current solution, both technical and opertional.
The investigation was about identifying possible bottenecks in the current solution,
both technical and opertional.
The Current System
******************
~~~~~~~~~~~~~~~~~~
The current “system” is an ansible role[1] which relies on mirrors-countme and other scripts to do its job, most of those scripts are being executed from a cron job which generates static images that are served through a web server.
The current “system” is an ansible role[1] which relies on mirrors-countme and other
scripts to do its job, most of those scripts are being executed from a cron job which
generates static images that are served through a web server.
Someone from the Fedora infrastructure team needs to run that paybook if there is a need to run any of those tools outside of the cron job schedule which is quite limiting.
Someone from the Fedora infrastructure team needs to run that paybook if there is a need
to run any of those tools outside of the cron job schedule which is quite limiting.
[1] - https://pagure.io/fedora-infra/ansible/blob/main/f/roles/web-data-analysis
The Intermediate Database
*************************
~~~~~~~~~~~~~~~~~~~~~~~~~
The current process is that the script generates an intermediate database file, usually referred to as “raw.db”, so another file is created from this one (“totals.db”) which is used by end users.
The current process is that the script generates an intermediate database file, usually
referred to as “raw.db”, so another file is created from this one (“totals.db”) which is
used by end users.
The problem is that “raw.db” data is appended for each httpd Apache log line which is turning into a storage problem due to the increasing growth of that file size.
The problem is that “raw.db” data is appended for each httpd Apache log line which is
turning into a storage problem due to the increasing growth of that file size.
One possible solution that is on the table is to purge old data from “raw.db” every time the end user database file gets updated - for example: keep data from the last 30 days and delete everything else.
One possible solution that is on the table is to purge old data from “raw.db” every time
the end user database file gets updated - for example: keep data from the last 30 days
and delete everything else.
Another option is to create weekly database files for the intermediate/”raw.db” database, using the full year and week number as the filename, for example: YYYY/01.db instead of appending everything to one “raw.db” file - that would allow us to archive those files individually if needed.
Another option is to create weekly database files for the intermediate/”raw.db”
database, using the full year and week number as the filename, for example: YYYY/01.db
instead of appending everything to one “raw.db” file - that would allow us to archive
those files individually if needed.
Conclusions
-----------
We concluded that we have work with the current solution as a short term goal but should keep track of a system refactoring as a long term goal.
We concluded that we have work with the current solution as a short term goal but should
keep track of a system refactoring as a long term goal.
The `short term goal` is about removing storage bottlenecks and enhacing its operational effciency.
The `short term goal` is about removing storage bottlenecks and enhacing its operational
effciency.
The `long term goal` is about creating a data system that will replace the current solution entirely which may require another "arc initiative" as well.
The `long term goal` is about creating a data system that will replace the current
solution entirely which may require another "arc initiative" as well.
Proposed Roadmap
----------------
SOP
***
~~~
The team should write an SOP for the Fedora Infrastructure team about how and where data is generated.
The team should write an SOP for the Fedora Infrastructure team about how and where data
is generated.
The SOP Document should also describe the required steps to generate “on demand” data based on user request.
The SOP Document should also describe the required steps to generate “on demand” data
based on user request.
Data Generation
***************
~~~~~~~~~~~~~~~
The intermediate database file, also known as “raw.db”, is generated daily through a cron job.
The intermediate database file, also known as “raw.db”, is generated daily through a
cron job.
The cron job should be run weekly instead, because httpd logs are not “real time” and the system can suffer from eventual data losses by doing it daily.
The cron job should be run weekly instead, because httpd logs are not “real time” and
the system can suffer from eventual data losses by doing it daily.
This can be done by updating cron job file definitions in Fedoras ansible repository: https://pagure.io/fedora-infra/ansible/blob/main/f/roles/web-data-analysis/files
This can be done by updating cron job file definitions in Fedoras ansible repository:
https://pagure.io/fedora-infra/ansible/blob/main/f/roles/web-data-analysis/files
Data Cleanup
************
~~~~~~~~~~~~
The intermediate “raw.db” file aggregates all parsed data from HTTPd logs which is turning into a storage problem on our log servers.
The intermediate “raw.db” file aggregates all parsed data from HTTPd logs which is
turning into a storage problem on our log servers.
There are two possible solutions for this problem: split database files based on “week of the year” or delete data from the intermediate database file that is older than 1 month.
There are two possible solutions for this problem: split database files based on “week
of the year” or delete data from the intermediate database file that is older than 1
month.
Splitting Database Files
^^^^^^^^^^^^^^^^^^^^^^^^
++++++++++++++++++++++++
This scheme would create a file per “week of the year” instead of a single intermediate database file.
This scheme would create a file per “week of the year” instead of a single intermediate
database file.
That would allow us to archive older files somewhere else while keeping the most recent ones in the server (the last 4 weeks for example).
That would allow us to archive older files somewhere else while keeping the most recent
ones in the server (the last 4 weeks for example).
This solution requires changes to how database files are written and the way we read those files to generate the final database file used by end users.
This solution requires changes to how database files are written and the way we read
those files to generate the final database file used by end users.
Database Cleanup
^^^^^^^^^^^^^^^^
++++++++++++++++
This approach would keep using a single “raw.db” database file but a new step would be added when adding data in the end user database file.
This approach would keep using a single “raw.db” database file but a new step would be
added when adding data in the end user database file.
The team would need to implement a step that would remove old data from the intermediate database file once the final counter database file is updated.
For example: read “raw.db” -> update “counter.db” -> delete all data from “raw.db” that is older than one month.
This approach is a bit simpler since it just needs an extra step in the existing code instead of changing how “raw.db” files are stored and used.
The team would need to implement a step that would remove old data from the intermediate
database file once the final counter database file is updated.
For example: read “raw.db” -> update “counter.db” -> delete all data from “raw.db” that
is older than one month.
This approach is a bit simpler since it just needs an extra step in the existing code
instead of changing how “raw.db” files are stored and used.