diff --git a/docs/datanommer_datagrepper/index.rst b/docs/datanommer_datagrepper/index.rst index 658cbfe..acf2ba2 100644 --- a/docs/datanommer_datagrepper/index.rst +++ b/docs/datanommer_datagrepper/index.rst @@ -34,3 +34,92 @@ Here is the list of ideas/things we looked at: pg_timescaledb pg_array_column_postgrest stats + + +Conclusions +----------- + +We have investigated different ways to improve the database storing our 180 +millions messages. While we considered looking at the datagrepper application +itself as well, we considered that replacing datagrepper with another application +would have too large consequences. We have a number of applications in our +realm that rely on datagrepper's API and there is an unknown number of applications +outside our realm that make use of it as well. +Breaking all of these applications is a non-goal for us. For this reason we +focused on postgresql first. + +We looked at different solutions, starting with manually partitioning on year, +then on ``id`` (not ``msg_id``, the primary key field ``id`` which is an integer). +We then looked at using the postgresql plugin `timescaledb` and finally we looked +at using this plugin together with a database model change where the relation +tables are merged into the main ``messages`` table and their is stored using +arrays. + +Based on our investigations, our recommendation is to migrate the postgresql +database to use the `timescaledb` plugin and configure datagrepper to have a +default delta value via ``DEFAULT_QUERY_DELTA``. + +As a picture is worth a thousand words: + +.. image:: ../_static/datanommer_percent_sucess.jpg + :target: ../_images/datanommer_percent_sucess.jpg + + +We checked, setting a ``DEFAULT_QUERY_DELTA`` alone provides already some +performance gain, using `timescaledb` with ``DEFAULT_QUERY_DELTA`` provide the +most gain but using `timescaledb` without ``DEFAULT_QUERY_DELTA`` brings back +the time out issues we are seeing today when datagrepper is queried without a +specified ``delta`` value. + +We also believe that the performance gain observed with `timescaledb` could be +reproduced if we were to do the partitioning ourself on the ``timestamp`` field +of the ``messages`` table. However, it would mean that we have to manually +maintain that partitioning, take care of creating the new partitions as needed +and so on, while `timescaledb` provides all of this for us automatically, thus +simplifying the long term maintenance of that database. + + +Proposed roadmap +~~~~~~~~~~~~~~~~ + +We propose the following roadmap to improve datanommer and datagrepper: + +0/ Announce the upcoming API breakage and outage of datagrepper + +Be loud about the upcoming changes and explain how the API breakage can be +mitigated. + + +1/ Port datanommer to fedora-messaging and openshift + +This will ensure that there are no duplicate messages are saved in the database +(cf our ref:`timescaledb_findings`). +It will also provide a way to store the messages while datagrepper is being +upgraded (which will require an outage). Using lazy queues in rabbitmq may be +a way to store the high number of messages that will pile up during the outage +window (which will be over 24h). + +Rabbitmq lazy queues: https://www.rabbitmq.com/lazy-queues.html + + +2/ Port datagrepper to timescaledb. + +This will improve the performance of the UI. Thanks to rabbitmq, no messages will +be lost, they will only show up in datagrepper at the end of the outage and +with a delayed timestamp. + +3/ Configure datagrepper to have a ``DEFAULT_QUERY_DELTA``. + +This will simply bound a number of queries which otherwise run slow and lead to +timeouts at the application level. + + +4/ Port datagrepper to openshift + +This will make it easier to maintain and/or scale as needed. + + +5/ Port datagrepper to fedora-messaging + +This will allow to make use of the fedora-messaging schemas provided by the +applications instead of relying on `fedmsg_meta_fedora_infrastructure`. diff --git a/docs/datanommer_datagrepper/pg_timescaledb.rst b/docs/datanommer_datagrepper/pg_timescaledb.rst index 8ff6aa2..e89e4a9 100644 --- a/docs/datanommer_datagrepper/pg_timescaledb.rst +++ b/docs/datanommer_datagrepper/pg_timescaledb.rst @@ -57,6 +57,8 @@ Finally, you can check that the extension was activated for your database: \dx +.. _timescaledb_findings: + Findings --------