fix parsing errors and sphinx warnings

Signed-off-by: Ryan Lerch <rlerch@redhat.com>
This commit is contained in:
Ryan Lercho 2023-11-16 08:02:56 +10:00 committed by zlopez
parent 8fb9b2fdf0
commit ba720c3d77
98 changed files with 4799 additions and 4788 deletions

View file

@ -3,15 +3,15 @@
Datanommer
==========
* Reads-in messages from the bus
* Stores them into the database
- Reads-in messages from the bus
- Stores them into the database
Database tables
---------------
Here is how the database schema looks like currently:
::
.. code-block::
datanommer=# \dt
List of relations
@ -24,14 +24,12 @@ Here is how the database schema looks like currently:
public | user | table | datanommer
public | user_messages | table | datanommer
Table sizes
-----------
Here is the size of each table:
::
.. code-block::
datanommer-#
SELECT
@ -49,16 +47,15 @@ Here is the size of each table:
alembic_version | 8192 bytes | 0 bytes
(6 rows)
The 3 columns are:
The 3 columns are::
.. code-block::
Table The name of the table
Size The total size that this table takes
External Size The size that related objects of this table like indices take
::
.. code-block::
datanommer=#
SELECT
@ -109,12 +106,15 @@ The 3 columns are::
sql_features | r | 716 | 64 kB
(37 rows)
The 4 columns are:
The 4 columns are::
.. code-block::
objectname The name of the object
objecttype r for the table, i for an index, t for toast data, ...
#entries The number of entries in the object (e.g. rows)
size The size of the object
(source for these queries: https://wiki-bsse.ethz.ch/display/ITDOC/Check+size+of+tables+and+objects+in+PostgreSQL+database )
(source for these queries:
https://wiki-bsse.ethz.ch/display/ITDOC/Check+size+of+tables+and+objects+in+PostgreSQL+database
)

View file

@ -1,18 +1,17 @@
Default delta
=============
Checking the current status of datagrepper, we realized that not specifying a
`delta` value in the URL led to timeouts while specifying one, makes datagrepper
return properly.
Checking the current status of datagrepper, we realized that not specifying a `delta`
value in the URL led to timeouts while specifying one, makes datagrepper return
properly.
Investigating the configuration options of datagrepper, we found out that
there is a `DEFAULT_QUERY_DELTA` configuration key that allows to specify a
default delta value when one is not specified.
Investigating the configuration options of datagrepper, we found out that there is a
`DEFAULT_QUERY_DELTA` configuration key that allows to specify a default delta value
when one is not specified.
Just setting that configuration key to ``60*60*24*3`` (ie: 3 days) improves the
datagrepper performances quite a bit (as in queries actually return instead of
timing out).
datagrepper performances quite a bit (as in queries actually return instead of timing
out).
That configuration change, does break the API a little bit as with it, it will
limit the messages returned to the last 3 days.
That configuration change, does break the API a little bit as with it, it will limit the
messages returned to the last 3 days.

View file

@ -4,21 +4,18 @@ Datanommer / Datagrepper
Datanommer
----------
* Reads-in messages from the bus
* Stores them into the database
- Reads-in messages from the bus
- Stores them into the database
.. toctree::
:maxdepth: 1
datanommer
Datagrepper
-----------
* Exposes the messages in the database via an API with different filtering
capacity
- Exposes the messages in the database via an API with different filtering capacity
Investigation
-------------
@ -35,49 +32,43 @@ Here is the list of ideas/things we looked at:
pg_array_column_postgrest
stats
Conclusions
-----------
We have investigated different ways to improve the database storing our 180
millions messages. While we considered looking at the datagrepper application
itself as well, we considered that replacing datagrepper with another application
would have too large consequences. We have a number of applications in our
realm that rely on datagrepper's API and there is an unknown number of applications
outside our realm that make use of it as well.
Breaking all of these applications is a non-goal for us. For this reason we
We have investigated different ways to improve the database storing our 180 millions
messages. While we considered looking at the datagrepper application itself as well, we
considered that replacing datagrepper with another application would have too large
consequences. We have a number of applications in our realm that rely on datagrepper's
API and there is an unknown number of applications outside our realm that make use of it
as well. Breaking all of these applications is a non-goal for us. For this reason we
focused on postgresql first.
We looked at different solutions, starting with manually partitioning on year,
then on ``id`` (not ``msg_id``, the primary key field ``id`` which is an integer).
We then looked at using the postgresql plugin `timescaledb` and finally we looked
at using this plugin together with a database model change where the relation
tables are merged into the main ``messages`` table and their is stored using
arrays.
We looked at different solutions, starting with manually partitioning on year, then on
``id`` (not ``msg_id``, the primary key field ``id`` which is an integer). We then
looked at using the postgresql plugin `timescaledb` and finally we looked at using this
plugin together with a database model change where the relation tables are merged into
the main ``messages`` table and their is stored using arrays.
Based on our investigations, our recommendation is to migrate the postgresql
database to use the `timescaledb` plugin and configure datagrepper to have a
default delta value via ``DEFAULT_QUERY_DELTA``.
Based on our investigations, our recommendation is to migrate the postgresql database to
use the `timescaledb` plugin and configure datagrepper to have a default delta value via
``DEFAULT_QUERY_DELTA``.
As a picture is worth a thousand words:
.. image:: ../_static/datanommer_percent_sucess.jpg
:target: ../_images/datanommer_percent_sucess.jpg
We checked, setting a ``DEFAULT_QUERY_DELTA`` alone provides already some
performance gain, using `timescaledb` with ``DEFAULT_QUERY_DELTA`` provide the
most gain but using `timescaledb` without ``DEFAULT_QUERY_DELTA`` brings back
the time out issues we are seeing today when datagrepper is queried without a
specified ``delta`` value.
We checked, setting a ``DEFAULT_QUERY_DELTA`` alone provides already some performance
gain, using `timescaledb` with ``DEFAULT_QUERY_DELTA`` provide the most gain but using
`timescaledb` without ``DEFAULT_QUERY_DELTA`` brings back the time out issues we are
seeing today when datagrepper is queried without a specified ``delta`` value.
We also believe that the performance gain observed with `timescaledb` could be
reproduced if we were to do the partitioning ourself on the ``timestamp`` field
of the ``messages`` table. However, it would mean that we have to manually
maintain that partitioning, take care of creating the new partitions as needed
and so on, while `timescaledb` provides all of this for us automatically, thus
simplifying the long term maintenance of that database.
reproduced if we were to do the partitioning ourself on the ``timestamp`` field of the
``messages`` table. However, it would mean that we have to manually maintain that
partitioning, take care of creating the new partitions as needed and so on, while
`timescaledb` provides all of this for us automatically, thus simplifying the long term
maintenance of that database.
Proposed roadmap
~~~~~~~~~~~~~~~~
@ -86,40 +77,34 @@ We propose the following roadmap to improve datanommer and datagrepper:
0/ Announce the upcoming API breakage and outage of datagrepper
Be loud about the upcoming changes and explain how the API breakage can be
mitigated.
Be loud about the upcoming changes and explain how the API breakage can be mitigated.
1/ Port datanommer to fedora-messaging and openshift
This will ensure that there are no duplicate messages are saved in the database
(cf our ref:`timescaledb_findings`).
It will also provide a way to store the messages while datagrepper is being
upgraded (which will require an outage). Using lazy queues in rabbitmq may be
a way to store the high number of messages that will pile up during the outage
window (which will be over 24h).
This will ensure that there are no duplicate messages are saved in the database (cf our
ref:`timescaledb_findings`). It will also provide a way to store the messages while
datagrepper is being upgraded (which will require an outage). Using lazy queues in
rabbitmq may be a way to store the high number of messages that will pile up during the
outage window (which will be over 24h).
Rabbitmq lazy queues: https://www.rabbitmq.com/lazy-queues.html
2/ Port datagrepper to timescaledb.
This will improve the performance of the UI. Thanks to rabbitmq, no messages will
be lost, they will only show up in datagrepper at the end of the outage and
with a delayed timestamp.
This will improve the performance of the UI. Thanks to rabbitmq, no messages will be
lost, they will only show up in datagrepper at the end of the outage and with a delayed
timestamp.
3/ Configure datagrepper to have a ``DEFAULT_QUERY_DELTA``.
This will simply bound a number of queries which otherwise run slow and lead to
timeouts at the application level.
This will simply bound a number of queries which otherwise run slow and lead to timeouts
at the application level.
4/ Port datagrepper to openshift
This will make it easier to maintain and/or scale as needed.
5/ Port datagrepper to fedora-messaging
This will allow to make use of the fedora-messaging schemas provided by the
applications instead of relying on `fedmsg_meta_fedora_infrastructure`.
This will allow to make use of the fedora-messaging schemas provided by the applications
instead of relying on `fedmsg_meta_fedora_infrastructure`.

View file

@ -4,64 +4,74 @@ Using the array type for user and package queries
Currently, we use auxiliary tables to query for messages related to packages or users,
in the standard RDBS fashion.
We came to some problems when trying to enforce foreign key constrains while using the timescaledb
extension. We decided to try, if just using a column with array type with proper indes would have simmilar performace.
We came to some problems when trying to enforce foreign key constrains while using the
timescaledb extension. We decided to try, if just using a column with array type with
proper indes would have simmilar performace.
Array columns support indexing with Generalized Inverted Index, GIN,
that allows for fast searches on membership and intersection. Because we mostly search for memebership,
Array columns support indexing with Generalized Inverted Index, GIN, that allows for
fast searches on membership and intersection. Because we mostly search for memebership,
array column could be performant enough for our purposes.
Resources
---------
* PG 12 Array type: https://www.postgresql.org/docs/12/arrays.html
* GIN index: https://www.postgresql.org/docs/12/gin.html
* GIN operators for BTREE: https://www.postgresql.org/docs/11/btree-gin.html
- PG 12 Array type: https://www.postgresql.org/docs/12/arrays.html
- GIN index: https://www.postgresql.org/docs/12/gin.html
- GIN operators for BTREE: https://www.postgresql.org/docs/11/btree-gin.html
Installing/enabling/activating
------------------------------
To have comparable results, we enabled timescaledb in same fashion as in our other experiment.
To have comparable results, we enabled timescaledb in same fashion as in our other
experiment.
To add new column
::
alter table messages2 add column packages text[];
.. code-block::
alter table messages2 add column packages text[];
To populate it
::
update messages2 set packages=t_agg.p_agg from
(select msg, array_agg(package) as p_agg from package_messages group by msg) as t_agg where messages.id=t_agg.msg;
.. code-block::
We need to enable the btree_gin extension to be able to create index with array as well as timestamp
::
update messages2 set packages=t_agg.p_agg from
(select msg, array_agg(package) as p_agg from package_messages group by msg) as t_agg where messages.id=t_agg.msg;
CREATE EXTENSION btree_gin;
We need to enable the btree_gin extension to be able to create index with array as well
as timestamp
.. code-block::
CREATE EXTENSION btree_gin;
To create the index
::
.. code-block::
CREATE INDEX idx_msg_user on "messages2" USING GIN ("timestamp", "packages");
To help reuse our testing script, we setup postgrest locally
::
podman run --rm --net=host -p 3000:3000 -e PGRST_DB_URI=$DBURI -e PGRST_DB_ANON_ROLE="datagrepper" -e PGRST_MAX_ROWS=25 postgrest/postgrest:v7.0.
.. code-block::
Because we focused only on package queries, as user colun couldn't be populated due to constraints on size,
we chose two as representative. There is implicit limit to return just 25 rows.
podman run --rm --net=host -p 3000:3000 -e PGRST_DB_URI=$DBURI -e PGRST_DB_ANON_ROLE="datagrepper" -e PGRST_MAX_ROWS=25 postgrest/postgrest:v7.0.
Because we focused only on package queries, as user colun couldn't be populated due to
constraints on size, we chose two as representative. There is implicit limit to return
just 25 rows.
A simple membership:
::
/messages_ts?packages=ov.{{kernel}}
.. code-block::
/messages_ts?packages=ov.{{kernel}}
A simple membership ordered by time.
::
/messages_ts?order=timestamp.desc&packages=ov.{{kernel}}
.. code-block::
/messages_ts?order=timestamp.desc&packages=ov.{{kernel}}
Findings
--------
@ -69,36 +79,40 @@ Findings
Querying just the package membership
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The queries were surprisingly fast, with maximum under 4 seconds and
mean around half a second. This encouraged us to do further experiments.
The queries were surprisingly fast, with maximum under 4 seconds and mean around half a
second. This encouraged us to do further experiments.
Results ::
Results
test_filter_by_package
Requests: 300, pass: 300, fail: 0, exception: 0
For pass requests:
Request per Second - mean: 3.63
Time per Request - mean: 0.522946, min: 0.000000, max: 3.907548
.. code-block::
test_filter_by_package
Requests: 300, pass: 300, fail: 0, exception: 0
For pass requests:
Request per Second - mean: 3.63
Time per Request - mean: 0.522946, min: 0.000000, max: 3.907548
Querying just the package membership ordered by timestamp desc
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Usually we want to see most recent messages. So we ammended the query,
to include "order by timestamp desc". The result was less encouraging,
with longest succesful query taking more than 90 seconds and several timing out.
Usually we want to see most recent messages. So we ammended the query, to include "order
by timestamp desc". The result was less encouraging, with longest succesful query taking
more than 90 seconds and several timing out.
This seems to be the result of GIN index not supporting order in the index.
Results ::
Results
test_filter_by_package
Requests: 300, pass: 280, fail: 0, exception: 20
For pass requests:
Request per Second - mean: 0.53
Time per Request - mean: 7.474040, min: 0.000000, max: 99.880939
.. code-block::
test_filter_by_package
Requests: 300, pass: 280, fail: 0, exception: 20
For pass requests:
Request per Second - mean: 0.53
Time per Request - mean: 7.474040, min: 0.000000, max: 99.880939
Conclusion
----------
While array support seems interesting, and for simple queries very fast, indexes that require ordering
don't seem to be supported. This makes strong case against using them.
While array support seems interesting, and for simple queries very fast, indexes that
require ordering don't seem to be supported. This makes strong case against using them.

View file

@ -1,60 +1,55 @@
Partitioning the database
=========================
In the database used by datanommer and datagrepper one table stands out from the
other ones by its size, the ``messages`` table. This can be observed in
:ref:`datanommer`.
In the database used by datanommer and datagrepper one table stands out from the other
ones by its size, the ``messages`` table. This can be observed in :ref:`datanommer`.
One possibility to speed things up in datagrepper is to partition that table
into a set of smaller sized partitions.
One possibility to speed things up in datagrepper is to partition that table into a set
of smaller sized partitions.
Here are some resources regarding partitioning postgresql tables:
* Table partitioning at postgresql's documentation: https://www.postgresql.org/docs/13/ddl-partitioning.html
* How to use table partitioning to scale PostgreSQL: https://www.enterprisedb.com/postgres-tutorials/how-use-table-partitioning-scale-postgresql
* Definition of PostgreSQL Partition: https://www.educba.com/postgresql-partition/
- Table partitioning at postgresql's documentation:
https://www.postgresql.org/docs/13/ddl-partitioning.html
- How to use table partitioning to scale PostgreSQL:
https://www.enterprisedb.com/postgres-tutorials/how-use-table-partitioning-scale-postgresql
- Definition of PostgreSQL Partition: https://www.educba.com/postgresql-partition/
Attempt #1
----------
For our first attempt at partitioning the `messages` table, we thought we would
partition it by year. Having a different partition for each year.
We thus started by adding a ``year`` field to the table and fill it by extracting
the year from the ``timestamp`` field of the table.
partition it by year. Having a different partition for each year. We thus started by
adding a ``year`` field to the table and fill it by extracting the year from the
``timestamp`` field of the table.
However, one thing to realize when using partitioned table is that each partition
needs to be considered as an independant table. Meaning an unique constraint has
to involve the field on which the table is partitioned.
In other words, if you partition the table by a year field, that year field will
need to be part of the primary key as well as any ``UNIQUE`` constraint on the
table.
So to partition the `messages` table on ``year``, we had to add the ``year``
field to the primary key. However, that broke the foreign key constraints on
the ``user_messages`` and ``package_messages`` tables which rely on the ``id``
field to link the tables.
However, one thing to realize when using partitioned table is that each partition needs
to be considered as an independant table. Meaning an unique constraint has to involve
the field on which the table is partitioned. In other words, if you partition the table
by a year field, that year field will need to be part of the primary key as well as any
``UNIQUE`` constraint on the table.
So to partition the `messages` table on ``year``, we had to add the ``year`` field to
the primary key. However, that broke the foreign key constraints on the
``user_messages`` and ``package_messages`` tables which rely on the ``id`` field to link
the tables.
Attempt #2
----------
Since partitioning on ``year`` did not work, we reconsidered and decided to
partition on the ``id`` field instead using `RANGE PARTITION`.
We partitioned the ``messages`` table on the ``id`` field with partition of 10
million records each. This has the advantage of making each partition of similar
sizes.
Since partitioning on ``year`` did not work, we reconsidered and decided to partition on
the ``id`` field instead using `RANGE PARTITION`.
We partitioned the ``messages`` table on the ``id`` field with partition of 10 million
records each. This has the advantage of making each partition of similar sizes.
More resources
--------------
These are a few more resources we looked at and thought were worth bookmarking:
* Automatic partitioning by day - PostgreSQL: https://stackoverflow.com/questions/55642326/
* pg_partman, partition manager: https://github.com/pgpartman/pg_partman
* How to scale PostgreSQL 10 using table inheritance and declarative partitioning: https://blog.timescale.com/blog/scaling-partitioning-data-postgresql-10-explained-cd48a712a9a1/
- Automatic partitioning by day - PostgreSQL:
https://stackoverflow.com/questions/55642326/
- pg_partman, partition manager: https://github.com/pgpartman/pg_partman
- How to scale PostgreSQL 10 using table inheritance and declarative partitioning:
https://blog.timescale.com/blog/scaling-partitioning-data-postgresql-10-explained-cd48a712a9a1/

View file

@ -1,14 +1,16 @@
Postgresql's pg_stat_statements
===============================
This is a postgresql module allowing to track planning and execution statistics
of all SQL statements executed by a server.
This is a postgresql module allowing to track planning and execution statistics of all
SQL statements executed by a server.
Using this, we can monitor/figure out what the slowest queries executed
on the server are.
Using this, we can monitor/figure out what the slowest queries executed on the server
are.
Resources:
* Postgresql doc: https://www.postgresql.org/docs/13/pgstatstatements.html
* How to enable it: https://www.virtual-dba.com/postgresql-performance-enabling-pg-stat-statements/
* How to use it: https://www.virtual-dba.com/postgresql-performance-identifying-hot-and-slow-queries/
- Postgresql doc: https://www.postgresql.org/docs/13/pgstatstatements.html
- How to enable it:
https://www.virtual-dba.com/postgresql-performance-enabling-pg-stat-statements/
- How to use it:
https://www.virtual-dba.com/postgresql-performance-identifying-hot-and-slow-queries/

View file

@ -2,61 +2,64 @@ Using the timescaledb extension
===============================
timescaledb (https://docs.timescale.com/latest/) is a postgresql extension for
time-series database.
Considering a lot of the actions done on datagrepper involve the timestamp field
(for example: all the messages with that topic in this time range), we figured
this extension is worth investigating.
A bonus point being for this extension to already packaged and available in
Fedora and EPEL.
time-series database. Considering a lot of the actions done on datagrepper involve the
timestamp field (for example: all the messages with that topic in this time range), we
figured this extension is worth investigating.
A bonus point being for this extension to already packaged and available in Fedora and
EPEL.
Resources
---------
* Setting up/enabling timescaledb: https://severalnines.com/database-blog/how-enable-timescaledb-existing-postgresql-database
* Migrating an existing database to timescaledb: https://docs.timescale.com/latest/getting-started/migrating-data#same-db
- Setting up/enabling timescaledb:
https://severalnines.com/database-blog/how-enable-timescaledb-existing-postgresql-database
- Migrating an existing database to timescaledb:
https://docs.timescale.com/latest/getting-started/migrating-data#same-db
Installing/enabling/activating
------------------------------
To install the plugin, simply run:
::
.. code-block::
dnf install timescaledb
The edit ``/var/lib/pgsql/data/postgresql.conf`` to tell postgresql to load it:
::
.. code-block::
shared_preload_libraries = 'pg_stat_statements,timescaledb'
timescaledb.max_background_workers=4
It will then need a restart of the entire database server:
::
.. code-block::
systemctl restart postgresql
You can then check if the extension loaded properly:
::
.. code-block::
$ sudo -u postgres psql
SELECT * FROM pg_available_extensions ORDER BY name;
Then, you will need to activate it for your database:
::
.. code-block::
$ sudo -u postgres psql <database_name>
CREATE EXTENSION IF NOT EXISTS timescaledb CASCADE;
Finally, you can check that the extension was activated for your database:
::
.. code-block::
$ sudo -u postgres psql <database_name>
\dx
.. _timescaledb_findings:
Findings
@ -66,10 +69,8 @@ Partitioned table
~~~~~~~~~~~~~~~~~
After converting the `messages` table to use timescaledb, we've realized that
timescaledb uses table partitioning as well.
This leads to the same issue with the foreign key constraints that we have seen
in the plain partitioning approach we took.
timescaledb uses table partitioning as well. This leads to the same issue with the
foreign key constraints that we have seen in the plain partitioning approach we took.
Foreign key considerations
~~~~~~~~~~~~~~~~~~~~~~~~~~
@ -78,25 +79,25 @@ For a better understanding on the challenges we've encountered with foreign key
constraints, here is a graphical representation of the datanommer database:
.. image:: ../_static/datanommer_db.jpeg
:target: ../_images/datanommer_db.jpeg
:target: ../_images/datanommer_db.jpeg
So here are the issues we've faced:
* To make the `messages` table a hypertable (ie: activate the timescaledb plugin
on it), the tables need to be empty and the data imported in a second step.
* Once the `messages` table is a hypertable, we cannot add foreign key constraints
from the `user_messages` or `package_messages` tables to it. It is just not
supported in timescaledb (cf https://docs.timescale.com/latest/using-timescaledb/schema-management#constraints )
* We tried creating the foreign key constraints before making the `messages` table
a hypertable and then importing the data in (tweaking the primary key and
foreign keys to include the timestamp, following https://stackoverflow.com/questions/64570143/ )
- To make the `messages` table a hypertable (ie: activate the timescaledb plugin on it),
the tables need to be empty and the data imported in a second step.
- Once the `messages` table is a hypertable, we cannot add foreign key constraints from
the `user_messages` or `package_messages` tables to it. It is just not supported in
timescaledb (cf
https://docs.timescale.com/latest/using-timescaledb/schema-management#constraints )
- We tried creating the foreign key constraints before making the `messages` table a
hypertable and then importing the data in (tweaking the primary key and foreign keys
to include the timestamp, following https://stackoverflow.com/questions/64570143/ )
but that resulted in an error when importing the data.
So we ended up with: Keep the same data structure but to not enforce the foreign
key constaints on `user_messages` and `package_messages` to `messages`. As that
database is mostly about inserts and has no updates or deletes, we don't foresee
much problems with this.
So we ended up with: Keep the same data structure but to not enforce the foreign key
constaints on `user_messages` and `package_messages` to `messages`. As that database is
mostly about inserts and has no updates or deletes, we don't foresee much problems with
this.
Duplicated messages
~~~~~~~~~~~~~~~~~~~
@ -105,31 +106,29 @@ When testing datagrepper and datanommer in our test instance with the timescaled
plugin, we saw a number of duplicated messages showing up in the `/raw` endpoint.
Checking if we could fix this server side, we found out that the previous database
schema had an `UNIQUE` constraint on `msg_id` field. However, with the timescaledb
plugin, that constraint is now on both `msg_id` and `timestamp` fields, meaning
a message can be inserted twice in the database if there is a little delay between
the two inserts.
However, migrating datanommer from fedmsg to fedora-messaging should resolve that
issue client side as rabbitmq will ensure there is only one consumer at a time
handling a message.
plugin, that constraint is now on both `msg_id` and `timestamp` fields, meaning a
message can be inserted twice in the database if there is a little delay between the two
inserts.
However, migrating datanommer from fedmsg to fedora-messaging should resolve that issue
client side as rabbitmq will ensure there is only one consumer at a time handling a
message.
Open questions
--------------
* How will upgrading the postgresql version with the timescaledb plugin look like?
It looks like the timescaledb folks are involved enough in postgresql itself that
we think things will work, but we have not had on-hands experience with it.
- How will upgrading the postgresql version with the timescaledb plugin look like?
It looks like the timescaledb folks are involved enough in postgresql itself that we
think things will work, but we have not had on-hands experience with it.
Patch
-----
Here is the patch that needs to be applied to ``datanommer/models/__init__.py``
to get it working with timescaledb's adjusted postgresql model.
Here is the patch that needs to be applied to ``datanommer/models/__init__.py`` to get
it working with timescaledb's adjusted postgresql model.
::
.. code-block::
diff --git a/datanommer.models/datanommer/models/__init__.py b/datanommer.models/datanommer/models/__init__.py
index ada58fa..7780433 100644

View file

@ -1,8 +1,8 @@
Lies, Damn lies and Statistics
==============================
In order to compare the performances of datagrepper in the different configuration
we looked at, we wrote a small script that runs 30 requests in 10 parallel threads.
In order to compare the performances of datagrepper in the different configuration we
looked at, we wrote a small script that runs 30 requests in 10 parallel threads.
These requests are:
@ -15,22 +15,17 @@ These requests are:
We have then 4 different environments:
- prod/openshift: this is an openshift deployment of datagrepper hitting the
production database, without any configuration change.
- prod/aws: this is an AWS deployment of datagrepper, hitting its own local
database, with the ``DEFAULT_QUERY_DELTA`` configuration key set to 3 days.
- partition/aws: this is an AWS deployment of datagrepper, hitting its own
local postgresql database where the ``messages`` table is partition by ``id``
with each partition having 10 million records and the ``DEFAULT_QUERY_DELTA``
configuration key set to 3 days.
- timescaledb/aws: this is an AWS deployment of datagrepper, hitting its own
local postgresql database where the ``messages`` table as been partition via
the `timescaledb` plugin and the ``DEFAULT_QUERY_DELTA`` configuration key set
to 3 days.
- prod/openshift: this is an openshift deployment of datagrepper hitting the production
database, without any configuration change.
- prod/aws: this is an AWS deployment of datagrepper, hitting its own local database,
with the ``DEFAULT_QUERY_DELTA`` configuration key set to 3 days.
- partition/aws: this is an AWS deployment of datagrepper, hitting its own local
postgresql database where the ``messages`` table is partition by ``id`` with each
partition having 10 million records and the ``DEFAULT_QUERY_DELTA`` configuration key
set to 3 days.
- timescaledb/aws: this is an AWS deployment of datagrepper, hitting its own local
postgresql database where the ``messages`` table as been partition via the
`timescaledb` plugin and the ``DEFAULT_QUERY_DELTA`` configuration key set to 3 days.
Results
-------
@ -40,82 +35,54 @@ Here are the results for each environment and request.
prod/openshift
~~~~~~~~~~~~~~
+--------------------+------------------+-------------------+------------------+-----------------+
| | Requests per sec | Mean time per Req | Max time per Req | Percent success |
+====================+==================+===================+==================+=================+
| filter_by_topic | 0.32 | NA | 45.857601 | 0.00% |
+--------------------+------------------+-------------------+------------------+-----------------+
| plain_raw | 0.32 | NA | 31.955371 | 0.00% |
+--------------------+------------------+-------------------+------------------+-----------------+
| filter_by_category | 0.32 | NA | 31.632514 | 0.00% |
+--------------------+------------------+-------------------+------------------+-----------------+
| filter_by_username | 0.32 | NA | 33.549061 | 0.00% |
+--------------------+------------------+-------------------+------------------+-----------------+
| filter_by_package | 0.32 | NA | 34.531207 | 0.00% |
+--------------------+------------------+-------------------+------------------+-----------------+
| get_by_id | 1.57 | 1.575608 | 31.259095 | 86.67% |
+--------------------+------------------+-------------------+------------------+-----------------+
================== ==== ======== ========= ======
================== ==== ======== ========= ======
filter_by_topic 0.32 NA 45.857601 0.00%
plain_raw 0.32 NA 31.955371 0.00%
filter_by_category 0.32 NA 31.632514 0.00%
filter_by_username 0.32 NA 33.549061 0.00%
filter_by_package 0.32 NA 34.531207 0.00%
get_by_id 1.57 1.575608 31.259095 86.67%
================== ==== ======== ========= ======
prod/aws
~~~~~~~~
+--------------------+------------------+-------------------+------------------+-----------------+
| | Requests per sec | Mean time per Req | Max time per Req | Percent success |
+====================+==================+===================+==================+=================+
| filter_by_topic | 7.6 | 1.0068 | 11.2743 | 100.00% |
+--------------------+------------------+-------------------+------------------+-----------------+
| plain_raw | 9.06 | 0.712975 | 3.323922 | 100.00% |
+--------------------+------------------+-------------------+------------------+-----------------+
| filter_by_category | 12.43 | 0.489915 | 1.676223 | 100.00% |
+--------------------+------------------+-------------------+------------------+-----------------+
| filter_by_username | 1.49 | 5.83623 | 10.661274 | 100.00% |
+--------------------+------------------+-------------------+------------------+-----------------+
| filter_by_package | 0 | 52.69256 | 120.229874 | 1.00% |
+--------------------+------------------+-------------------+------------------+-----------------+
| get_by_id | 0.73 | 1.534168 | 60.455334 | 83.33% |
+--------------------+------------------+-------------------+------------------+-----------------+
================== ===== ======== ========== =======
================== ===== ======== ========== =======
filter_by_topic 7.6 1.0068 11.2743 100.00%
plain_raw 9.06 0.712975 3.323922 100.00%
filter_by_category 12.43 0.489915 1.676223 100.00%
filter_by_username 1.49 5.83623 10.661274 100.00%
filter_by_package 0 52.69256 120.229874 1.00%
get_by_id 0.73 1.534168 60.455334 83.33%
================== ===== ======== ========== =======
partition/aws
~~~~~~~~~~~~~
+--------------------+------------------+-------------------+------------------+-----------------+
| | Requests per sec | Mean time per Req | Max time per Req | Percent success |
+====================+==================+===================+==================+=================+
| filter_by_topic | 9.98 | 0.711219 | 3.204178 | 100.00% |
+--------------------+------------------+-------------------+------------------+-----------------+
| plain_raw | 9.70 | 0.641497 | 1.24704 | 100.00% |
+--------------------+------------------+-------------------+------------------+-----------------+
| filter_by_category | 13.32 | 0.455219 | 0.594465 | 100.00% |
+--------------------+------------------+-------------------+------------------+-----------------+
| filter_by_username | 1.3 | 7.084018 | 12.079198 | 100.00% |
+--------------------+------------------+-------------------+------------------+-----------------+
| filter_by_package | 0 | 55.231556 | 120.125013 | 1.00% |
+--------------------+------------------+-------------------+------------------+-----------------+
| get_by_id | 0.48 | 2.198211 | 60.444765 | 76.67% |
+--------------------+------------------+-------------------+------------------+-----------------+
================== ===== ========= ========== =======
================== ===== ========= ========== =======
filter_by_topic 9.98 0.711219 3.204178 100.00%
plain_raw 9.70 0.641497 1.24704 100.00%
filter_by_category 13.32 0.455219 0.594465 100.00%
filter_by_username 1.3 7.084018 12.079198 100.00%
filter_by_package 0 55.231556 120.125013 1.00%
get_by_id 0.48 2.198211 60.444765 76.67%
================== ===== ========= ========== =======
timescaledb/aws
~~~~~~~~~~~~~~~
+--------------------+------------------+-------------------+------------------+-----------------+
| | Requests per sec | Mean time per Req | Max time per Req | Percent success |
+====================+==================+===================+==================+=================+
| filter_by_topic | 14.1 | 0.4286 | 0.514617 | 100.00% |
+--------------------+------------------+-------------------+------------------+-----------------+
| plain_raw | 12.89 | 0.48235 | 0.661073 | 100.00% |
+--------------------+------------------+-------------------+------------------+-----------------+
| filter_by_category | 13.94 | 0.423172 | 0.507337 | 100.00% |
+--------------------+------------------+-------------------+------------------+-----------------+
| filter_by_username | 2.68 | 3.188782 | 5.096244 | 100.00% |
+--------------------+------------------+-------------------+------------------+-----------------+
| filter_by_package | 0.26 | 33.216631 | 57.901159 | 100.00% |
+--------------------+------------------+-------------------+------------------+-----------------+
| get_by_id | 12.69 | 0.749068 | 1.73515 | 100.00% |
+--------------------+------------------+-------------------+------------------+-----------------+
================== ===== ========= ========= =======
================== ===== ========= ========= =======
filter_by_topic 14.1 0.4286 0.514617 100.00%
plain_raw 12.89 0.48235 0.661073 100.00%
filter_by_category 13.94 0.423172 0.507337 100.00%
filter_by_username 2.68 3.188782 5.096244 100.00%
filter_by_package 0.26 33.216631 57.901159 100.00%
get_by_id 12.69 0.749068 1.73515 100.00%
================== ===== ========= ========= =======
Graphs
------
@ -128,24 +95,20 @@ Percentage of success
.. image:: ../_static/datanommer_percent_sucess.jpg
:target: ../_images/datanommer_percent_sucess.jpg
Requests per second
~~~~~~~~~~~~~~~~~~~
.. image:: ../_static/datanommer_req_per_sec.jpg
:target: ../_images/datanommer_req_per_sec.jpg
Mean time per request
~~~~~~~~~~~~~~~~~~~~~
.. image:: ../_static/datanommer_mean_per_req.jpg
:target: ../_images/datanommer_mean_per_req.jpg
Maximum time per request
~~~~~~~~~~~~~~~~~~~~~~~~
.. image:: ../_static/datanommer_max_per_req.jpg
:target: ../_images/datanommer_max_per_req.jpg