Give our conclusion on the datanommer/datagrepper research

Signed-off-by: Pierre-Yves Chibon <pingou@pingoured.fr>
2021-02-17 12:18:55 +01:00 · 2021-02-17 12:18:55 +01:00 · b12d36c47b
commit b12d36c47b
parent 1b8ebcc690
2 changed files with 91 additions and 0 deletions
--- a/docs/datanommer_datagrepper/index.rst
+++ b/docs/datanommer_datagrepper/index.rst
@ -34,3 +34,92 @@ Here is the list of ideas/things we looked at:
    pg_timescaledb
    pg_array_column_postgrest
    stats
 Conclusions
 -----------
 We have investigated different ways to improve the database storing our 180
 millions messages. While we considered looking at the datagrepper application
 itself as well, we considered that replacing datagrepper with another application
 would have too large consequences. We have a number of applications in our
 realm that rely on datagrepper's API and there is an unknown number of applications
 outside our realm that make use of it as well.
 Breaking all of these applications is a non-goal for us. For this reason we
 focused on postgresql first.
 We looked at different solutions, starting with manually partitioning on year,
 then on ``id`` (not ``msg_id``, the primary key field ``id`` which is an integer).
 We then looked at using the postgresql plugin `timescaledb` and finally we looked
 at using this plugin together with a database model change where the relation
 tables are merged into the main ``messages`` table and their is stored using
 arrays.
 Based on our investigations, our recommendation is to migrate the postgresql
 database to use the `timescaledb` plugin and configure datagrepper to have a
 default delta value via ``DEFAULT_QUERY_DELTA``.
 As a picture is worth a thousand words:
 .. image:: ../_static/datanommer_percent_sucess.jpg
    :target: ../_images/datanommer_percent_sucess.jpg
 We checked, setting a ``DEFAULT_QUERY_DELTA`` alone provides already some
 performance gain, using `timescaledb` with ``DEFAULT_QUERY_DELTA`` provide the
 most gain but using `timescaledb` without ``DEFAULT_QUERY_DELTA`` brings back
 the time out issues we are seeing today when datagrepper is queried without a
 specified ``delta`` value.
 We also believe that the performance gain observed with `timescaledb` could be
 reproduced if we were to do the partitioning ourself on the ``timestamp`` field
 of the ``messages`` table. However, it would mean that we have to manually
 maintain that partitioning, take care of creating the new partitions as needed
 and so on, while `timescaledb` provides all of this for us automatically, thus
 simplifying the long term maintenance of that database.
 Proposed roadmap
 ~~~~~~~~~~~~~~~~
 We propose the following roadmap to improve datanommer and datagrepper:
 0/ Announce the upcoming API breakage and outage of datagrepper
 Be loud about the upcoming changes and explain how the API breakage can be
 mitigated.
 1/ Port datanommer to fedora-messaging and openshift
 This will ensure that there are no duplicate messages are saved in the database
 (cf our ref:`timescaledb_findings`).
 It will also provide a way to store the messages while datagrepper is being
 upgraded (which will require an outage). Using lazy queues in rabbitmq may be
 a way to store the high number of messages that will pile up during the outage
 window (which will be over 24h).
 Rabbitmq lazy queues: https://www.rabbitmq.com/lazy-queues.html
 2/ Port datagrepper to timescaledb.
 This will improve the performance of the UI. Thanks to rabbitmq, no messages will
 be lost, they will only show up in datagrepper at the end of the outage and
 with a delayed timestamp.
 3/ Configure datagrepper to have a ``DEFAULT_QUERY_DELTA``.
 This will simply bound a number of queries which otherwise run slow and lead to
 timeouts at the application level.
 4/ Port datagrepper to openshift
 This will make it easier to maintain and/or scale as needed.
 5/ Port datagrepper to fedora-messaging
 This will allow to make use of the fedora-messaging schemas provided by the
 applications instead of relying on `fedmsg_meta_fedora_infrastructure`.
--- a/docs/datanommer_datagrepper/pg_timescaledb.rst
+++ b/docs/datanommer_datagrepper/pg_timescaledb.rst
@ -57,6 +57,8 @@ Finally, you can check that the extension was activated for your database:
    \dx
 .. _timescaledb_findings:
 Findings
 --------