162 lines
5.7 KiB
ReStructuredText
162 lines
5.7 KiB
ReStructuredText
DNF Countme
|
||
===========
|
||
|
||
Purpose
|
||
-------
|
||
|
||
The purpose of this work was about investigating the current solution and it's
|
||
bottlenecks to identify what needs to be done to solve the following problems:
|
||
|
||
- Storage bottleneck when creating the intermediate database file
|
||
- Operations efficiency for the infrastructure team
|
||
|
||
Goals
|
||
-----
|
||
|
||
Short Term
|
||
~~~~~~~~~~
|
||
|
||
The short term goal is about enhancing operational gaps and possible technical
|
||
bottlenecks in the current solution.
|
||
|
||
- Improve intermediate db file to consume less disk space
|
||
- Weekly data generation instead of daily
|
||
- Improve operations
|
||
|
||
Long Term
|
||
~~~~~~~~~
|
||
|
||
The long term goal aims to replace the current solution with an actual data driven
|
||
application to provide end-user real time analytics (as close as possible) because of
|
||
the following limitations:
|
||
|
||
- Graphical reports are static images served through httpd
|
||
- Manual intervention is needed to generate reports outside of the cron job schedule
|
||
- Data is not real time
|
||
- There is no way to connect third party apps suchs as Jupyter since there is no “data
|
||
service”
|
||
|
||
The long term goal aims to create a data service and/or use an existing open source
|
||
solution such as Prometheus/Kafka/etc. to serve that data from an API and a web app
|
||
interface.
|
||
|
||
The API would be useful for other apps to pull and filter “real time” data instead of
|
||
downloading a sqlite db file to then parse it to useful human friendly formats.
|
||
|
||
Resources
|
||
---------
|
||
|
||
- https://data-analysis.fedoraproject.org/csv-reports/countme/totals.db
|
||
- https://data-analysis.fedoraproject.org/csv-reports/mirrors/mirrorsdata-all.csv
|
||
- https://pagure.io/velociraptorizer
|
||
- https://pagure.io/brontosaurusifier/
|
||
|
||
Investigation
|
||
-------------
|
||
|
||
The investigation was about identifying possible bottenecks in the current solution,
|
||
both technical and opertional.
|
||
|
||
The Current System
|
||
~~~~~~~~~~~~~~~~~~
|
||
|
||
The current “system” is an ansible role[1] which relies on mirrors-countme and other
|
||
scripts to do its job, most of those scripts are being executed from a cron job which
|
||
generates static images that are served through a web server.
|
||
|
||
Someone from the Fedora infrastructure team needs to run that paybook if there is a need
|
||
to run any of those tools outside of the cron job schedule which is quite limiting.
|
||
|
||
[1] - https://pagure.io/fedora-infra/ansible/blob/main/f/roles/web-data-analysis
|
||
|
||
The Intermediate Database
|
||
~~~~~~~~~~~~~~~~~~~~~~~~~
|
||
|
||
The current process is that the script generates an intermediate database file, usually
|
||
referred to as “raw.db”, so another file is created from this one (“totals.db”) which is
|
||
used by end users.
|
||
|
||
The problem is that “raw.db” data is appended for each httpd Apache log line which is
|
||
turning into a storage problem due to the increasing growth of that file size.
|
||
|
||
One possible solution that is on the table is to purge old data from “raw.db” every time
|
||
the end user database file gets updated - for example: keep data from the last 30 days
|
||
and delete everything else.
|
||
|
||
Another option is to create weekly database files for the intermediate/”raw.db”
|
||
database, using the full year and week number as the filename, for example: YYYY/01.db
|
||
instead of appending everything to one “raw.db” file - that would allow us to archive
|
||
those files individually if needed.
|
||
|
||
Conclusions
|
||
-----------
|
||
|
||
We concluded that we have work with the current solution as a short term goal but should
|
||
keep track of a system refactoring as a long term goal.
|
||
|
||
The `short term goal` is about removing storage bottlenecks and enhacing its operational
|
||
effciency.
|
||
|
||
The `long term goal` is about creating a data system that will replace the current
|
||
solution entirely which may require another "arc initiative" as well.
|
||
|
||
Proposed Roadmap
|
||
----------------
|
||
|
||
SOP
|
||
~~~
|
||
|
||
The team should write an SOP for the Fedora Infrastructure team about how and where data
|
||
is generated.
|
||
|
||
The SOP Document should also describe the required steps to generate “on demand” data
|
||
based on user request.
|
||
|
||
Data Generation
|
||
~~~~~~~~~~~~~~~
|
||
|
||
The intermediate database file, also known as “raw.db”, is generated daily through a
|
||
cron job.
|
||
|
||
The cron job should be run weekly instead, because httpd logs are not “real time” and
|
||
the system can suffer from eventual data losses by doing it daily.
|
||
|
||
This can be done by updating cron job file definitions in Fedora’s ansible repository:
|
||
https://pagure.io/fedora-infra/ansible/blob/main/f/roles/web-data-analysis/files
|
||
|
||
Data Cleanup
|
||
~~~~~~~~~~~~
|
||
|
||
The intermediate “raw.db” file aggregates all parsed data from HTTPd logs which is
|
||
turning into a storage problem on our log servers.
|
||
|
||
There are two possible solutions for this problem: split database files based on “week
|
||
of the year” or delete data from the intermediate database file that is older than 1
|
||
month.
|
||
|
||
Splitting Database Files
|
||
++++++++++++++++++++++++
|
||
|
||
This scheme would create a file per “week of the year” instead of a single intermediate
|
||
database file.
|
||
|
||
That would allow us to archive older files somewhere else while keeping the most recent
|
||
ones in the server (the last 4 weeks for example).
|
||
|
||
This solution requires changes to how database files are written and the way we read
|
||
those files to generate the final database file used by end users.
|
||
|
||
Database Cleanup
|
||
++++++++++++++++
|
||
|
||
This approach would keep using a single “raw.db” database file but a new step would be
|
||
added when adding data in the end user database file.
|
||
|
||
The team would need to implement a step that would remove old data from the intermediate
|
||
database file once the final counter database file is updated.
|
||
|
||
For example: read “raw.db” -> update “counter.db” -> delete all data from “raw.db” that
|
||
is older than one month.
|
||
|
||
This approach is a bit simpler since it just needs an extra step in the existing code
|
||
instead of changing how “raw.db” files are stored and used.
|