Give our conclusion on the datanommer/datagrepper research

Signed-off-by: Pierre-Yves Chibon <pingou@pingoured.fr>
2021-02-17 12:18:55 +01:00 · 2021-02-17 12:18:55 +01:00 · b12d36c47b
commit b12d36c47b
parent 1b8ebcc690
2 changed files with 91 additions and 0 deletions
--- a/docs/datanommer_datagrepper/index.rst
+++ b/docs/datanommer_datagrepper/index.rst
@ -34,3 +34,92 @@ Here is the list of ideas/things we looked at:
    pg_timescaledb
    pg_array_column_postgrest
    stats
+
+
+Conclusions
+-----------
+
+We have investigated different ways to improve the database storing our 180
+millions messages. While we considered looking at the datagrepper application
+itself as well, we considered that replacing datagrepper with another application
+would have too large consequences. We have a number of applications in our
+realm that rely on datagrepper's API and there is an unknown number of applications
+outside our realm that make use of it as well.
+Breaking all of these applications is a non-goal for us. For this reason we
+focused on postgresql first.
+
+We looked at different solutions, starting with manually partitioning on year,
+then on ``id`` (not ``msg_id``, the primary key field ``id`` which is an integer).
+We then looked at using the postgresql plugin `timescaledb` and finally we looked
+at using this plugin together with a database model change where the relation
+tables are merged into the main ``messages`` table and their is stored using
+arrays.
+
+Based on our investigations, our recommendation is to migrate the postgresql
+database to use the `timescaledb` plugin and configure datagrepper to have a
+default delta value via ``DEFAULT_QUERY_DELTA``.
+
+As a picture is worth a thousand words:
+
+.. image:: ../_static/datanommer_percent_sucess.jpg
+    :target: ../_images/datanommer_percent_sucess.jpg
+
+
+We checked, setting a ``DEFAULT_QUERY_DELTA`` alone provides already some
+performance gain, using `timescaledb` with ``DEFAULT_QUERY_DELTA`` provide the
+most gain but using `timescaledb` without ``DEFAULT_QUERY_DELTA`` brings back
+the time out issues we are seeing today when datagrepper is queried without a
+specified ``delta`` value.
+
+We also believe that the performance gain observed with `timescaledb` could be
+reproduced if we were to do the partitioning ourself on the ``timestamp`` field
+of the ``messages`` table. However, it would mean that we have to manually
+maintain that partitioning, take care of creating the new partitions as needed
+and so on, while `timescaledb` provides all of this for us automatically, thus
+simplifying the long term maintenance of that database.
+
+
+Proposed roadmap
+~~~~~~~~~~~~~~~~
+
+We propose the following roadmap to improve datanommer and datagrepper:
+
+0/ Announce the upcoming API breakage and outage of datagrepper
+
+Be loud about the upcoming changes and explain how the API breakage can be
+mitigated.
+
+
+1/ Port datanommer to fedora-messaging and openshift
+
+This will ensure that there are no duplicate messages are saved in the database
+(cf our ref:`timescaledb_findings`).
+It will also provide a way to store the messages while datagrepper is being
+upgraded (which will require an outage). Using lazy queues in rabbitmq may be
+a way to store the high number of messages that will pile up during the outage
+window (which will be over 24h).
+
+Rabbitmq lazy queues: https://www.rabbitmq.com/lazy-queues.html
+
+
+2/ Port datagrepper to timescaledb.
+
+This will improve the performance of the UI. Thanks to rabbitmq, no messages will
+be lost, they will only show up in datagrepper at the end of the outage and
+with a delayed timestamp.
+
+3/ Configure datagrepper to have a ``DEFAULT_QUERY_DELTA``.
+
+This will simply bound a number of queries which otherwise run slow and lead to
+timeouts at the application level.
+
+
+4/ Port datagrepper to openshift
+
+This will make it easier to maintain and/or scale as needed.
+
+
+5/ Port datagrepper to fedora-messaging
+
+This will allow to make use of the fedora-messaging schemas provided by the
+applications instead of relying on `fedmsg_meta_fedora_infrastructure`.
--- a/docs/datanommer_datagrepper/pg_timescaledb.rst
+++ b/docs/datanommer_datagrepper/pg_timescaledb.rst
@ -57,6 +57,8 @@ Finally, you can check that the extension was activated for your database:
    \dx


+.. _timescaledb_findings:
+
 Findings
 --------