Update the documentation about the datanommer/datagrepper work

Signed-off-by: Pierre-Yves Chibon <pingou@pingoured.fr>
This commit is contained in:
Pierre-Yves Chibon 2021-02-05 17:37:11 +01:00
parent 1ff0d8fbd2
commit f61f8c482a
3 changed files with 117 additions and 0 deletions

View file

@ -25,3 +25,5 @@ Here is the list of ideas/things we looked at:
:maxdepth: 1
pg_stat_statements
pg_partitioning
pg_timescaledb

View file

@ -0,0 +1,60 @@
Partitioning the database
=========================
In the database used by datanommer and datagrepper one table stands out from the
other ones by its size, the ``messages`` table. This can be observed in
:ref:`datanommer`.
One possibility to speed things up in datagrepper is to partition that table
into a set of smaller sized partitions.
Here are some resources regarding partitioning postgresql tables:
* Table partitioning at postgresql's documentation: https://www.postgresql.org/docs/13/ddl-partitioning.html
* How to use table partitioning to scale PostgreSQL: https://www.enterprisedb.com/postgres-tutorials/how-use-table-partitioning-scale-postgresql
* Definition of PostgreSQL Partition: https://www.educba.com/postgresql-partition/
Attempt #1
----------
For our first attempt at partitioning the `messages` table, we thought we would
partition it by year. Having a different partition for each year.
We thus started by adding a ``year`` field to the table and fill it by extracting
the year from the ``timestamp`` field of the table.
However, one thing to realize when using partitioned table is that each partition
needs to be considered as an independant table. Meaning an unique constraint has
to involve the field on which the table is partitioned.
In other words, if you partition the table by a year field, that year field will
need to be part of the primary key as well as any ``UNIQUE`` constraint on the
table.
So to partition the `messages` table on ``year``, we had to add the ``year``
field to the primary key. However, that broke the foreign key constraints on
the ``user_messages`` and ``package_messages`` tables which rely on the ``id``
field to link the tables.
Attempt #2
----------
Since partitioning on ``year`` did not work, we reconsidered and decided to
partition on the ``id`` field instead using `RANGE PARTITION`.
We partitioned the ``messages`` table on the ``id`` field with partition of 10
million records each. This has the advantage of making each partition of similar
sizes.
More resources
--------------
These are a few more resources we looked at and thought were worth bookmarking:
* Automatic partitioning by day - PostgreSQL: https://stackoverflow.com/questions/55642326/
* pg_partman, partition manager: https://github.com/pgpartman/pg_partman
* How to scale PostgreSQL 10 using table inheritance and declarative partitioning: https://blog.timescale.com/blog/scaling-partitioning-data-postgresql-10-explained-cd48a712a9a1/

View file

@ -0,0 +1,55 @@
Using the timescaledb extension
===============================
timescaledb (https://docs.timescale.com/latest/) is a postgresql extension for
time-series database.
Considering a lot of the actions done on datagrepper involve the timestamp field
(for example: all the messages with that topic in this time range), we figured
this extension is worth investigating.
A bonus point being for this extension to already packaged and available in
Fedora and EPEL.
Resources
---------
* Setting up/enabling timescaledb: https://severalnines.com/database-blog/how-enable-timescaledb-existing-postgresql-database
* Migrating an existing database to timescaledb: https://docs.timescale.com/latest/getting-started/migrating-data#same-db
Installing/enabling/activating
------------------------------
To install the plugin, simply run:
::
dnf install timescaledb
The edit ``/var/lib/pgsql/data/postgresql.conf`` to tell postgresql to load it:
::
shared_preload_libraries = 'pg_stat_statements,timescaledb'
timescaledb.max_background_workers=4
It will then need a restart of the entire database server:
::
systemctl restart postgresql
You can then check if the extension loaded properly:
::
$ sudo -u postgres psql
SELECT * FROM pg_available_extensions ORDER BY name;
Then, you will need to activate it for your database:
::
$ sudo -u postgres psql <database_name>
CREATE EXTENSION IF NOT EXISTS timescaledb CASCADE;
Finally, you can check that the extension was activated for your database:
::
$ sudo -u postgres psql <database_name>
\dx