Merge branch 'main' of ssh://pagure.io/fedora-infra/arc into main

This commit is contained in:
Mark O'Brien 2021-02-08 11:24:20 +00:00
commit dff16dd1a9
6 changed files with 196 additions and 5 deletions

View file

@ -25,3 +25,5 @@ Here is the list of ideas/things we looked at:
:maxdepth: 1
pg_stat_statements
pg_partitioning
pg_timescaledb

View file

@ -0,0 +1,60 @@
Partitioning the database
=========================
In the database used by datanommer and datagrepper one table stands out from the
other ones by its size, the ``messages`` table. This can be observed in
:ref:`datanommer`.
One possibility to speed things up in datagrepper is to partition that table
into a set of smaller sized partitions.
Here are some resources regarding partitioning postgresql tables:
* Table partitioning at postgresql's documentation: https://www.postgresql.org/docs/13/ddl-partitioning.html
* How to use table partitioning to scale PostgreSQL: https://www.enterprisedb.com/postgres-tutorials/how-use-table-partitioning-scale-postgresql
* Definition of PostgreSQL Partition: https://www.educba.com/postgresql-partition/
Attempt #1
----------
For our first attempt at partitioning the `messages` table, we thought we would
partition it by year. Having a different partition for each year.
We thus started by adding a ``year`` field to the table and fill it by extracting
the year from the ``timestamp`` field of the table.
However, one thing to realize when using partitioned table is that each partition
needs to be considered as an independant table. Meaning an unique constraint has
to involve the field on which the table is partitioned.
In other words, if you partition the table by a year field, that year field will
need to be part of the primary key as well as any ``UNIQUE`` constraint on the
table.
So to partition the `messages` table on ``year``, we had to add the ``year``
field to the primary key. However, that broke the foreign key constraints on
the ``user_messages`` and ``package_messages`` tables which rely on the ``id``
field to link the tables.
Attempt #2
----------
Since partitioning on ``year`` did not work, we reconsidered and decided to
partition on the ``id`` field instead using `RANGE PARTITION`.
We partitioned the ``messages`` table on the ``id`` field with partition of 10
million records each. This has the advantage of making each partition of similar
sizes.
More resources
--------------
These are a few more resources we looked at and thought were worth bookmarking:
* Automatic partitioning by day - PostgreSQL: https://stackoverflow.com/questions/55642326/
* pg_partman, partition manager: https://github.com/pgpartman/pg_partman
* How to scale PostgreSQL 10 using table inheritance and declarative partitioning: https://blog.timescale.com/blog/scaling-partitioning-data-postgresql-10-explained-cd48a712a9a1/

View file

@ -0,0 +1,55 @@
Using the timescaledb extension
===============================
timescaledb (https://docs.timescale.com/latest/) is a postgresql extension for
time-series database.
Considering a lot of the actions done on datagrepper involve the timestamp field
(for example: all the messages with that topic in this time range), we figured
this extension is worth investigating.
A bonus point being for this extension to already packaged and available in
Fedora and EPEL.
Resources
---------
* Setting up/enabling timescaledb: https://severalnines.com/database-blog/how-enable-timescaledb-existing-postgresql-database
* Migrating an existing database to timescaledb: https://docs.timescale.com/latest/getting-started/migrating-data#same-db
Installing/enabling/activating
------------------------------
To install the plugin, simply run:
::
dnf install timescaledb
The edit ``/var/lib/pgsql/data/postgresql.conf`` to tell postgresql to load it:
::
shared_preload_libraries = 'pg_stat_statements,timescaledb'
timescaledb.max_background_workers=4
It will then need a restart of the entire database server:
::
systemctl restart postgresql
You can then check if the extension loaded properly:
::
$ sudo -u postgres psql
SELECT * FROM pg_available_extensions ORDER BY name;
Then, you will need to activate it for your database:
::
$ sudo -u postgres psql <database_name>
CREATE EXTENSION IF NOT EXISTS timescaledb CASCADE;
Finally, you can check that the extension was activated for your database:
::
$ sudo -u postgres psql <database_name>
\dx

View file

@ -107,3 +107,6 @@ CREATE INDEX messages2_datanommer_timestamp_topic_idx ON public.messages2 USING
ALTER TABLE user_messages DROP CONSTRAINT user_messages_msg_fkey;
ALTER TABLE user_messages ADD CO
ALTER TABLE package_messages DROP CONSTRAINT package_messages_msg_fkey;
ALTER TABLE package_messages ADD CONSTRAINT package_messages_msg_fkey FOREIGN KEY (msg) REFERENCES messages2(id);

View file

@ -0,0 +1,69 @@
-- Sources
-- setting up/enabling: https://severalnines.com/database-blog/how-enable-timescaledb-existing-postgresql-database
-- migration: https://docs.timescale.com/latest/getting-started/migrating-data#same-db
-- dnf install timescaledb
-- vim /var/lib/pgsql/data/postgresql.conf
-- """
-- shared_preload_libraries = 'pg_stat_statements,timescaledb'
-- timescaledb.max_background_workers=4
-- """
-- systemctl restart postgresql
-- Check the extension is available
-- sudo -u postgres psql
\timing on
-- Check the extension is available
SELECT * FROM pg_available_extensions ORDER BY name;
-- Enable the extension on the datanommer database (\c datanommer)
CREATE EXTENSION IF NOT EXISTS timescaledb CASCADE;
-- Check that the extension is enabled
\dx
-- Create the messages2 table that will be have the timescaledb optimization
CREATE TABLE messages2 (LIKE messages INCLUDING DEFAULTS INCLUDING CONSTRAINTS EXCLUDING INDEXES);
CREATE SEQUENCE public.messages2_id_seq
START WITH 1
INCREMENT BY 1
NO MINVALUE
NO MAXVALUE
CACHE 1;
ALTER TABLE public.messages2 OWNER TO datanommer;
ALTER TABLE public.messages2_id_seq OWNER TO datanommer;
-- ALTER SEQUENCE public.messages2_id_seq OWNED BY public.messages2.id;
ALTER TABLE ONLY public.messages2 ALTER COLUMN id SET DEFAULT nextval('public.messages2_id_seq'::regclass);
GRANT SELECT ON TABLE public.messages2 TO datagrepper;
GRANT SELECT ON SEQUENCE public.messages2_id_seq TO datagrepper;
-- Convert the timestamp to the hypertable
SELECT create_hypertable('messages2', 'timestamp');
-- Insert the data
INSERT INTO messages2 (i,"timestamp",certificate,signature,topic,_msg,category,source_name,source_version,msg_id,_headers,username,crypto)
SELECT i,"timestamp",certificate,signature,topic,_msg,category,source_name,source_version,msg_id,_headers,username,crypto FROM messages ORDER BY messages.id;
-- Create the indexes ones the data is in
CREATE INDEX index_msg2_category ON public.messages2 USING btree (category);
CREATE INDEX index_msg2_timestamp ON public.messages2 USING btree ("timestamp");
CREATE INDEX index_msg2_topic ON public.messages2 USING btree (topic);
CREATE INDEX messages2_datanommer_timestamp_category_idx ON public.messages2 USING btree ("timestamp" DESC, category);
CREATE INDEX messages2_datanommer_timestamp_topic_idx ON public.messages2 USING btree ("timestamp" DESC, topic);
ALTER TABLE user_messages DROP CONSTRAINT user_messages_msg_fkey;
ALTER TABLE user_messages ADD CONSTRAINT user_messages_msg_fkey FOREIGN KEY (msg) REFERENCES messages2(id);
ALTER TABLE package_messages DROP CONSTRAINT package_messages_msg_fkey;
ALTER TABLE package_messages ADD CONSTRAINT package_messages_msg_fkey FOREIGN KEY (msg) REFERENCES messages2(id);

View file

@ -313,15 +313,16 @@ class TestAPI:
Return: None for exception
"""
try:
if auth == None:
resp = requests.get(url, verify=verify)
else:
resp = requests.get(url, auth=auth, verify=verify)
resp = requests.get(url, auth=auth, verify=verify, timeout=120)
except Exception as ex:
_log.error("requests.get() failed with exception:", str(ex))
return None
_log.debug("response time in seconds: %s", resp.elapsed.total_seconds())
_log.debug(
"response time in seconds: %s -- %s",
resp.elapsed.total_seconds(),
url
)
return resp
@ -341,6 +342,7 @@ def main():
print("Tests started at %s." % time.asctime())
for env_name, base_url in [
("datagrepper-timescalebd/aws", "http://datagrepper-timescale.arc.fedorainfracloud.org/datagrepper"),
("datagrepper-test/aws", "http://datagrepper-test.arc.fedorainfracloud.org/datagrepper"),
("datagrepper-prod/aws", "http://datagrepper.arc.fedorainfracloud.org/datagrepper"),
("datagrepper-prod/openshift", "https://datagrepper-monitor-dashboard.app.os.fedoraproject.org"),