Merge branch 'main' of ssh://pagure.io/fedora-infra/arc into main
This commit is contained in:
commit
dff16dd1a9
6 changed files with 196 additions and 5 deletions
|
@ -25,3 +25,5 @@ Here is the list of ideas/things we looked at:
|
||||||
:maxdepth: 1
|
:maxdepth: 1
|
||||||
|
|
||||||
pg_stat_statements
|
pg_stat_statements
|
||||||
|
pg_partitioning
|
||||||
|
pg_timescaledb
|
||||||
|
|
60
docs/datanommer_datagrepper/pg_partitioning.rst
Normal file
60
docs/datanommer_datagrepper/pg_partitioning.rst
Normal file
|
@ -0,0 +1,60 @@
|
||||||
|
Partitioning the database
|
||||||
|
=========================
|
||||||
|
|
||||||
|
In the database used by datanommer and datagrepper one table stands out from the
|
||||||
|
other ones by its size, the ``messages`` table. This can be observed in
|
||||||
|
:ref:`datanommer`.
|
||||||
|
|
||||||
|
One possibility to speed things up in datagrepper is to partition that table
|
||||||
|
into a set of smaller sized partitions.
|
||||||
|
|
||||||
|
Here are some resources regarding partitioning postgresql tables:
|
||||||
|
|
||||||
|
* Table partitioning at postgresql's documentation: https://www.postgresql.org/docs/13/ddl-partitioning.html
|
||||||
|
* How to use table partitioning to scale PostgreSQL: https://www.enterprisedb.com/postgres-tutorials/how-use-table-partitioning-scale-postgresql
|
||||||
|
* Definition of PostgreSQL Partition: https://www.educba.com/postgresql-partition/
|
||||||
|
|
||||||
|
|
||||||
|
Attempt #1
|
||||||
|
----------
|
||||||
|
|
||||||
|
For our first attempt at partitioning the `messages` table, we thought we would
|
||||||
|
partition it by year. Having a different partition for each year.
|
||||||
|
We thus started by adding a ``year`` field to the table and fill it by extracting
|
||||||
|
the year from the ``timestamp`` field of the table.
|
||||||
|
|
||||||
|
However, one thing to realize when using partitioned table is that each partition
|
||||||
|
needs to be considered as an independant table. Meaning an unique constraint has
|
||||||
|
to involve the field on which the table is partitioned.
|
||||||
|
In other words, if you partition the table by a year field, that year field will
|
||||||
|
need to be part of the primary key as well as any ``UNIQUE`` constraint on the
|
||||||
|
table.
|
||||||
|
|
||||||
|
So to partition the `messages` table on ``year``, we had to add the ``year``
|
||||||
|
field to the primary key. However, that broke the foreign key constraints on
|
||||||
|
the ``user_messages`` and ``package_messages`` tables which rely on the ``id``
|
||||||
|
field to link the tables.
|
||||||
|
|
||||||
|
|
||||||
|
Attempt #2
|
||||||
|
----------
|
||||||
|
|
||||||
|
Since partitioning on ``year`` did not work, we reconsidered and decided to
|
||||||
|
partition on the ``id`` field instead using `RANGE PARTITION`.
|
||||||
|
|
||||||
|
We partitioned the ``messages`` table on the ``id`` field with partition of 10
|
||||||
|
million records each. This has the advantage of making each partition of similar
|
||||||
|
sizes.
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
More resources
|
||||||
|
--------------
|
||||||
|
|
||||||
|
These are a few more resources we looked at and thought were worth bookmarking:
|
||||||
|
|
||||||
|
* Automatic partitioning by day - PostgreSQL: https://stackoverflow.com/questions/55642326/
|
||||||
|
* pg_partman, partition manager: https://github.com/pgpartman/pg_partman
|
||||||
|
* How to scale PostgreSQL 10 using table inheritance and declarative partitioning: https://blog.timescale.com/blog/scaling-partitioning-data-postgresql-10-explained-cd48a712a9a1/
|
||||||
|
|
55
docs/datanommer_datagrepper/pg_timescaledb.rst
Normal file
55
docs/datanommer_datagrepper/pg_timescaledb.rst
Normal file
|
@ -0,0 +1,55 @@
|
||||||
|
Using the timescaledb extension
|
||||||
|
===============================
|
||||||
|
|
||||||
|
timescaledb (https://docs.timescale.com/latest/) is a postgresql extension for
|
||||||
|
time-series database.
|
||||||
|
Considering a lot of the actions done on datagrepper involve the timestamp field
|
||||||
|
(for example: all the messages with that topic in this time range), we figured
|
||||||
|
this extension is worth investigating.
|
||||||
|
|
||||||
|
A bonus point being for this extension to already packaged and available in
|
||||||
|
Fedora and EPEL.
|
||||||
|
|
||||||
|
|
||||||
|
Resources
|
||||||
|
---------
|
||||||
|
|
||||||
|
* Setting up/enabling timescaledb: https://severalnines.com/database-blog/how-enable-timescaledb-existing-postgresql-database
|
||||||
|
* Migrating an existing database to timescaledb: https://docs.timescale.com/latest/getting-started/migrating-data#same-db
|
||||||
|
|
||||||
|
|
||||||
|
Installing/enabling/activating
|
||||||
|
------------------------------
|
||||||
|
|
||||||
|
To install the plugin, simply run:
|
||||||
|
::
|
||||||
|
|
||||||
|
dnf install timescaledb
|
||||||
|
|
||||||
|
The edit ``/var/lib/pgsql/data/postgresql.conf`` to tell postgresql to load it:
|
||||||
|
::
|
||||||
|
|
||||||
|
shared_preload_libraries = 'pg_stat_statements,timescaledb'
|
||||||
|
timescaledb.max_background_workers=4
|
||||||
|
|
||||||
|
|
||||||
|
It will then need a restart of the entire database server:
|
||||||
|
::
|
||||||
|
|
||||||
|
systemctl restart postgresql
|
||||||
|
|
||||||
|
You can then check if the extension loaded properly:
|
||||||
|
::
|
||||||
|
|
||||||
|
$ sudo -u postgres psql
|
||||||
|
SELECT * FROM pg_available_extensions ORDER BY name;
|
||||||
|
|
||||||
|
Then, you will need to activate it for your database:
|
||||||
|
::
|
||||||
|
$ sudo -u postgres psql <database_name>
|
||||||
|
CREATE EXTENSION IF NOT EXISTS timescaledb CASCADE;
|
||||||
|
|
||||||
|
Finally, you can check that the extension was activated for your database:
|
||||||
|
::
|
||||||
|
$ sudo -u postgres psql <database_name>
|
||||||
|
\dx
|
|
@ -107,3 +107,6 @@ CREATE INDEX messages2_datanommer_timestamp_topic_idx ON public.messages2 USING
|
||||||
|
|
||||||
ALTER TABLE user_messages DROP CONSTRAINT user_messages_msg_fkey;
|
ALTER TABLE user_messages DROP CONSTRAINT user_messages_msg_fkey;
|
||||||
ALTER TABLE user_messages ADD CO
|
ALTER TABLE user_messages ADD CO
|
||||||
|
|
||||||
|
ALTER TABLE package_messages DROP CONSTRAINT package_messages_msg_fkey;
|
||||||
|
ALTER TABLE package_messages ADD CONSTRAINT package_messages_msg_fkey FOREIGN KEY (msg) REFERENCES messages2(id);
|
||||||
|
|
69
scripts/migration_timescaledb.sql
Normal file
69
scripts/migration_timescaledb.sql
Normal file
|
@ -0,0 +1,69 @@
|
||||||
|
-- Sources
|
||||||
|
-- setting up/enabling: https://severalnines.com/database-blog/how-enable-timescaledb-existing-postgresql-database
|
||||||
|
-- migration: https://docs.timescale.com/latest/getting-started/migrating-data#same-db
|
||||||
|
|
||||||
|
-- dnf install timescaledb
|
||||||
|
-- vim /var/lib/pgsql/data/postgresql.conf
|
||||||
|
-- """
|
||||||
|
-- shared_preload_libraries = 'pg_stat_statements,timescaledb'
|
||||||
|
-- timescaledb.max_background_workers=4
|
||||||
|
-- """
|
||||||
|
-- systemctl restart postgresql
|
||||||
|
-- Check the extension is available
|
||||||
|
-- sudo -u postgres psql
|
||||||
|
|
||||||
|
|
||||||
|
\timing on
|
||||||
|
|
||||||
|
-- Check the extension is available
|
||||||
|
SELECT * FROM pg_available_extensions ORDER BY name;
|
||||||
|
|
||||||
|
-- Enable the extension on the datanommer database (\c datanommer)
|
||||||
|
CREATE EXTENSION IF NOT EXISTS timescaledb CASCADE;
|
||||||
|
|
||||||
|
-- Check that the extension is enabled
|
||||||
|
|
||||||
|
\dx
|
||||||
|
|
||||||
|
|
||||||
|
-- Create the messages2 table that will be have the timescaledb optimization
|
||||||
|
|
||||||
|
CREATE TABLE messages2 (LIKE messages INCLUDING DEFAULTS INCLUDING CONSTRAINTS EXCLUDING INDEXES);
|
||||||
|
|
||||||
|
CREATE SEQUENCE public.messages2_id_seq
|
||||||
|
START WITH 1
|
||||||
|
INCREMENT BY 1
|
||||||
|
NO MINVALUE
|
||||||
|
NO MAXVALUE
|
||||||
|
CACHE 1;
|
||||||
|
|
||||||
|
ALTER TABLE public.messages2 OWNER TO datanommer;
|
||||||
|
ALTER TABLE public.messages2_id_seq OWNER TO datanommer;
|
||||||
|
-- ALTER SEQUENCE public.messages2_id_seq OWNED BY public.messages2.id;
|
||||||
|
ALTER TABLE ONLY public.messages2 ALTER COLUMN id SET DEFAULT nextval('public.messages2_id_seq'::regclass);
|
||||||
|
GRANT SELECT ON TABLE public.messages2 TO datagrepper;
|
||||||
|
GRANT SELECT ON SEQUENCE public.messages2_id_seq TO datagrepper;
|
||||||
|
|
||||||
|
-- Convert the timestamp to the hypertable
|
||||||
|
|
||||||
|
SELECT create_hypertable('messages2', 'timestamp');
|
||||||
|
|
||||||
|
-- Insert the data
|
||||||
|
|
||||||
|
INSERT INTO messages2 (i,"timestamp",certificate,signature,topic,_msg,category,source_name,source_version,msg_id,_headers,username,crypto)
|
||||||
|
SELECT i,"timestamp",certificate,signature,topic,_msg,category,source_name,source_version,msg_id,_headers,username,crypto FROM messages ORDER BY messages.id;
|
||||||
|
|
||||||
|
-- Create the indexes ones the data is in
|
||||||
|
|
||||||
|
CREATE INDEX index_msg2_category ON public.messages2 USING btree (category);
|
||||||
|
CREATE INDEX index_msg2_timestamp ON public.messages2 USING btree ("timestamp");
|
||||||
|
CREATE INDEX index_msg2_topic ON public.messages2 USING btree (topic);
|
||||||
|
CREATE INDEX messages2_datanommer_timestamp_category_idx ON public.messages2 USING btree ("timestamp" DESC, category);
|
||||||
|
CREATE INDEX messages2_datanommer_timestamp_topic_idx ON public.messages2 USING btree ("timestamp" DESC, topic);
|
||||||
|
|
||||||
|
|
||||||
|
ALTER TABLE user_messages DROP CONSTRAINT user_messages_msg_fkey;
|
||||||
|
ALTER TABLE user_messages ADD CONSTRAINT user_messages_msg_fkey FOREIGN KEY (msg) REFERENCES messages2(id);
|
||||||
|
|
||||||
|
ALTER TABLE package_messages DROP CONSTRAINT package_messages_msg_fkey;
|
||||||
|
ALTER TABLE package_messages ADD CONSTRAINT package_messages_msg_fkey FOREIGN KEY (msg) REFERENCES messages2(id);
|
|
@ -313,15 +313,16 @@ class TestAPI:
|
||||||
Return: None for exception
|
Return: None for exception
|
||||||
"""
|
"""
|
||||||
try:
|
try:
|
||||||
if auth == None:
|
resp = requests.get(url, auth=auth, verify=verify, timeout=120)
|
||||||
resp = requests.get(url, verify=verify)
|
|
||||||
else:
|
|
||||||
resp = requests.get(url, auth=auth, verify=verify)
|
|
||||||
except Exception as ex:
|
except Exception as ex:
|
||||||
_log.error("requests.get() failed with exception:", str(ex))
|
_log.error("requests.get() failed with exception:", str(ex))
|
||||||
return None
|
return None
|
||||||
|
|
||||||
_log.debug("response time in seconds: %s", resp.elapsed.total_seconds())
|
_log.debug(
|
||||||
|
"response time in seconds: %s -- %s",
|
||||||
|
resp.elapsed.total_seconds(),
|
||||||
|
url
|
||||||
|
)
|
||||||
|
|
||||||
return resp
|
return resp
|
||||||
|
|
||||||
|
@ -341,6 +342,7 @@ def main():
|
||||||
print("Tests started at %s." % time.asctime())
|
print("Tests started at %s." % time.asctime())
|
||||||
|
|
||||||
for env_name, base_url in [
|
for env_name, base_url in [
|
||||||
|
("datagrepper-timescalebd/aws", "http://datagrepper-timescale.arc.fedorainfracloud.org/datagrepper"),
|
||||||
("datagrepper-test/aws", "http://datagrepper-test.arc.fedorainfracloud.org/datagrepper"),
|
("datagrepper-test/aws", "http://datagrepper-test.arc.fedorainfracloud.org/datagrepper"),
|
||||||
("datagrepper-prod/aws", "http://datagrepper.arc.fedorainfracloud.org/datagrepper"),
|
("datagrepper-prod/aws", "http://datagrepper.arc.fedorainfracloud.org/datagrepper"),
|
||||||
("datagrepper-prod/openshift", "https://datagrepper-monitor-dashboard.app.os.fedoraproject.org"),
|
("datagrepper-prod/openshift", "https://datagrepper-monitor-dashboard.app.os.fedoraproject.org"),
|
||||||
|
|
Loading…
Add table
Add a link
Reference in a new issue