diff --git a/docs/datanommer_datagrepper/index.rst b/docs/datanommer_datagrepper/index.rst index 28cae4b..4bd6e09 100644 --- a/docs/datanommer_datagrepper/index.rst +++ b/docs/datanommer_datagrepper/index.rst @@ -25,3 +25,5 @@ Here is the list of ideas/things we looked at: :maxdepth: 1 pg_stat_statements + pg_partitioning + pg_timescaledb diff --git a/docs/datanommer_datagrepper/pg_partitioning.rst b/docs/datanommer_datagrepper/pg_partitioning.rst new file mode 100644 index 0000000..67add82 --- /dev/null +++ b/docs/datanommer_datagrepper/pg_partitioning.rst @@ -0,0 +1,60 @@ +Partitioning the database +========================= + +In the database used by datanommer and datagrepper one table stands out from the +other ones by its size, the ``messages`` table. This can be observed in +:ref:`datanommer`. + +One possibility to speed things up in datagrepper is to partition that table +into a set of smaller sized partitions. + +Here are some resources regarding partitioning postgresql tables: + +* Table partitioning at postgresql's documentation: https://www.postgresql.org/docs/13/ddl-partitioning.html +* How to use table partitioning to scale PostgreSQL: https://www.enterprisedb.com/postgres-tutorials/how-use-table-partitioning-scale-postgresql +* Definition of PostgreSQL Partition: https://www.educba.com/postgresql-partition/ + + +Attempt #1 +---------- + +For our first attempt at partitioning the `messages` table, we thought we would +partition it by year. Having a different partition for each year. +We thus started by adding a ``year`` field to the table and fill it by extracting +the year from the ``timestamp`` field of the table. + +However, one thing to realize when using partitioned table is that each partition +needs to be considered as an independant table. Meaning an unique constraint has +to involve the field on which the table is partitioned. +In other words, if you partition the table by a year field, that year field will +need to be part of the primary key as well as any ``UNIQUE`` constraint on the +table. + +So to partition the `messages` table on ``year``, we had to add the ``year`` +field to the primary key. However, that broke the foreign key constraints on +the ``user_messages`` and ``package_messages`` tables which rely on the ``id`` +field to link the tables. + + +Attempt #2 +---------- + +Since partitioning on ``year`` did not work, we reconsidered and decided to +partition on the ``id`` field instead using `RANGE PARTITION`. + +We partitioned the ``messages`` table on the ``id`` field with partition of 10 +million records each. This has the advantage of making each partition of similar +sizes. + + + + +More resources +-------------- + +These are a few more resources we looked at and thought were worth bookmarking: + +* Automatic partitioning by day - PostgreSQL: https://stackoverflow.com/questions/55642326/ +* pg_partman, partition manager: https://github.com/pgpartman/pg_partman +* How to scale PostgreSQL 10 using table inheritance and declarative partitioning: https://blog.timescale.com/blog/scaling-partitioning-data-postgresql-10-explained-cd48a712a9a1/ + diff --git a/docs/datanommer_datagrepper/pg_timescaledb.rst b/docs/datanommer_datagrepper/pg_timescaledb.rst new file mode 100644 index 0000000..b13d670 --- /dev/null +++ b/docs/datanommer_datagrepper/pg_timescaledb.rst @@ -0,0 +1,55 @@ +Using the timescaledb extension +=============================== + +timescaledb (https://docs.timescale.com/latest/) is a postgresql extension for +time-series database. +Considering a lot of the actions done on datagrepper involve the timestamp field +(for example: all the messages with that topic in this time range), we figured +this extension is worth investigating. + +A bonus point being for this extension to already packaged and available in +Fedora and EPEL. + + +Resources +--------- + +* Setting up/enabling timescaledb: https://severalnines.com/database-blog/how-enable-timescaledb-existing-postgresql-database +* Migrating an existing database to timescaledb: https://docs.timescale.com/latest/getting-started/migrating-data#same-db + + +Installing/enabling/activating +------------------------------ + +To install the plugin, simply run: +:: + + dnf install timescaledb + +The edit ``/var/lib/pgsql/data/postgresql.conf`` to tell postgresql to load it: +:: + + shared_preload_libraries = 'pg_stat_statements,timescaledb' + timescaledb.max_background_workers=4 + + +It will then need a restart of the entire database server: +:: + + systemctl restart postgresql + +You can then check if the extension loaded properly: +:: + + $ sudo -u postgres psql + SELECT * FROM pg_available_extensions ORDER BY name; + +Then, you will need to activate it for your database: +:: + $ sudo -u postgres psql + CREATE EXTENSION IF NOT EXISTS timescaledb CASCADE; + +Finally, you can check that the extension was activated for your database: +:: + $ sudo -u postgres psql + \dx