diff --git a/docs/datanommer_datagrepper/pg_array_column_postgrest.rst b/docs/datanommer_datagrepper/pg_array_column_postgrest.rst new file mode 100644 index 0000000..2be10cc --- /dev/null +++ b/docs/datanommer_datagrepper/pg_array_column_postgrest.rst @@ -0,0 +1,101 @@ +Using the array type for user and package queries +=============================== + +Currently, we use auxiliary tables to query for messages related to packages or users, +in the standard RDBS fashion. + +We came to some problems when trying to enforce foreign key constrains while using the timescaledb +extension. We decided to try, if just using a column with array type with proper indes would have simmilar performace. + + +Resources +--------- + +* PG 12 Array type: https://www.postgresql.org/docs/12/arrays.html +* GIN index: https://www.postgresql.org/docs/12/gin.html +* GIN operators for BTREE: https://www.postgresql.org/docs/11/btree-gin.html + + +Installing/enabling/activating +------------------------------ + +To have comparable results, we enabled timescaledb in same fashion as in our other experiment. + +To add new column +:: + + alter table messages2 add column packages text[]; + +To populate it +:: + + update messages2 set packages=t_agg.p_agg from + (select msg, array_agg(package) as p_agg from package_messages group by msg) as t_agg where messages.id=t_agg.msg; + +We need to enable the btree_gin extension to be able to create index with array as well as timestamp +:: + + CREATE EXTENSION btree_gin; + +To create the index +:: + + CREATE INDEX idx_msg_user on "messages2" USING GIN ("timestamp", "packages"); + +To help reuse our testing script, we setup postgrest locally +:: + + podman run --rm --net=host -p 3000:3000 -e PGRST_DB_URI=$DBURI -e PGRST_DB_ANON_ROLE="datagrepper" -e PGRST_MAX_ROWS=25 postgrest/postgrest:v7.0. + +Because we focused only on package queries, as user colun couldn't be populated due to constraints on size, +we chose two as representative. There is implicit limit to return just 25 rows. + +A simple membership: +:: + + /messages_ts?packages=ov.{{kernel}} + +A simple membership ordered by time. +:: + + /messages_ts?order=timestamp.desc&packages=ov.{{kernel}} + +Findings +-------- + +Querying just the package membership +~~~~~~~~~~~~~~~~~ + +The queries were surprisingly fast, with maximum under 4 seconds and +mean around half a second. This encouraged us to do further experiments. + +Results :: + + test_filter_by_package + Requests: 300, pass: 300, fail: 0, exception: 0 + For pass requests: + Request per Second - mean: 3.63 + Time per Request - mean: 0.522946, min: 0.000000, max: 3.907548 + +Querying just the package membership ordered by timestamp desc +~~~~~~~~~~~~~~~~~ + +Usually we want to see most recent messages. So we ammended the query, +to include "order by timestamp desc". The result was less encouraging, +with longest succesful query taking more than 90 seconds and several timing out. + +This seems to be the result of GIN index not supporting order in the index. + +Results :: + + test_filter_by_package + Requests: 300, pass: 280, fail: 0, exception: 20 + For pass requests: + Request per Second - mean: 0.53 + Time per Request - mean: 7.474040, min: 0.000000, max: 99.880939 + +Conclusion +-------------- + +While array support seems interesting, and for simple queries very fast, indexes that require ordering +don't seem to be supported. This makes strong case against using them.