Currently backups are taking 17-18 hours with 4 threads.
Now that we have 16 cpus defined there, lets bump that up to 8 and see
if that lowers things much. If not we can look at moving to another
compression, but the database is very large so lots of compression is
good to save disk space.
Also filter out another output of the backup job that causes cron
emails.
Signed-off-by: Kevin Fenzi <kevin@scrye.com>
This will log an explain for any query that takes more than 30s.
We likely will need to lower it to get the slow heavy queries that are
hitting koji's db.
This does require a restart, but after this we can change the min
duration with just a reload. If there are too many logs, we can set this
to -1 to never log.
Signed-off-by: Kevin Fenzi <kevin@scrye.com>
Using the 'fix outage' clause in freeze here. ;)
Basically adjust db-koji01 to use more memory and avoid
saturating i/o. With these settings, page loads look faster
and i/o is not saturated. We should try adding more cpus and such,
but that will require a reboot, so avoiding for now.
Signed-off-by: Kevin Fenzi <kevin@scrye.com>
db-koji01 is our only postgresql 15 install so far, but split out the
config from the 12 one we are using on RHEL8 to avoid making changes
there.
Also, lets try tweaking things:
- I am bumping cpus up to 88
- Tweak max workers/etc
- Try a higher i/o level since this db server is running on a virthost
with ssds.
Signed-off-by: Kevin Fenzi <kevin@scrye.com>
First we need to pipe stderr into the grep to filter out the timescaledb
warnings. So, |& does that.
Then, there's no reason to backup the staging database. Disable that.
Signed-off-by: Kevin Fenzi <kevin@scrye.com>
db-datanommer02 uses timescaledb. When you do a pg_dump there's warnings
due to this, but according to upstream they are all completely harmless.
So, to avoid an email to everyone every day, lets just try and supress
these, but yet hopefully not supress real errors if they every occur.
Signed-off-by: Kevin Fenzi <kevin@scrye.com>
db-koji01 has been running with this since before the mass rebuild, and
it seems to make it have a higher load, but process faster and without
stalling when doing backups or when long/bad koji-gc queries for old
versions of texlive hit it.
Signed-off-by: Kevin Fenzi <kevin@scrye.com>
Turns out we were not setting effective_cache_size even tho it was set
for some servers (pagure). Adjust a few parameters on db-koji to try and
get some more performance out of it.
Signed-off-by: Kevin Fenzi <kevin@scrye.com>
I took the default postgresql.conf from postgresql 12 and then added in
various changes we already manually made and variable substitions we
already had setup back in the postgresq 9.2 days.
This will apply to db-koji01, db-qa01, db-datanommer01 at least.
Signed-off-by: Kevin Fenzi <kevin@scrye.com>
Basically, if the variables are defined in the host, use them, otherwise
use the current values.
Signed-off-by: Pierre-Yves Chibon <pingou@pingoured.fr>
By default apache uses prefork and a limit of 250. It's possible that this limit was
the thing causing us issues over the last week. This moves to the event mpm and ups limits
a lot. It also needs to up limits on db connections or the increased workers will just
cause the db server to overload.
With this setup, builders are no longer dropping out, but it's not clear if it's solved
all the issues we have been seeing.
Signed-off-by: Kevin Fenzi <kevin@scrye.com>
The recent changes to postgresql_server increased the required memory past
what one of my VMs had. I've added a conditional in postgresql.conf to put
some memory settings back where they used to be (controlled by
small_postgres_instance, default is false) and created a default to not use
the small_postgres_instance settings unless specified
Turns out I had set this on the master (db-koji01) which is ignored.
We need to set it on the standby. With a value of -1, the standby will wait
for conflicting transactions/locks to complete however long it takes.
If this doesn't work to get us a good backup on db-koji02, no harm and we
can try something else.
Signed-off-by: Kevin Fenzi <kevin@scrye.com>
I'd like to log queries on db-koji01 for a short time to try and see whats causing us such pain.
After we have collected a bunch of queries we can revert this until we sort out what needs
to be changed. We may also change this log all to log just slow queries (per smooge's suggestion).
Hopefully this will get us the info we need to track this down.
Signed-off-by: Kevin Fenzi <kevin@scrye.com>
The current settings cause database dumps to drive the load way up
and make the entire application slow, so we need to adjust.
Using pgtune, these values might well be better.
shared_buffers + effective_cache_size should = total memory.
random_page_cost should be lowered a bunch since we are on ssds there.
1.1 is only slightly more than 1.0 for sequential.
effective_io_concurrency should also be raised a bunch for ssds.
a few other values should be higher based on memory.
Signed-off-by: Kevin Fenzi <kevin@scrye.com>