Database performace has degraded considerable after upgrading to 7.1

Summary
After upgrading Mattermost to version 7.1, the database performance has degraded to the point that searches are no longer functional.

Steps to reproduce
Use Mattermost 7.1, running via Gitlab, running with a t3.xlarge RDS instance in AWS running Postgresql 13.4.

Expected behavior
Mattermost continues to be around as performant as it was in previous versions, and search functions correctly.

Observed behavior
After the update to 7.1, the database performance became a major problem, and a large number of errors and began to show up in the logs. The main effects for users is channels and messages are loading slowly and search is basically non functional, it times out and is unable to get results on known good search terms. Here’s an example of some of the logs we’re seeing:

{“timestamp”:“2022-09-07 05:26:05.956 Z”,“level”:“error”,“msg”:“Unable to get post counts.”,“caller”:“web/context.go:105”,“path”:“/api/v4/analytics/old”,“request_id”:“yesryqp6jf87jpmj4kte7bnhbw”,“ip_addr”:“127.0.0.1”,“user_id”:“prxcr796d78g7mem1h4f9scx7e”,“method”:“GET”,“err_where”:“GetAnalytics”,“http_code”:500,“err_details”:“failed to count Posts: pq: canceling statement due to user request”}

After the problems started we bumped it from a t3.large to an xlarge, but the problems didn’t get any better. We decided to give that a shot based on another post here that had similar issues where it boiled down to database performance (here: Unable to get the file count for the channel). Our instance is bigger than theirs was (we’re around 24,000,000 posts), so it’s possible it would benefit from another power bump, but the cpu utilization for our RDS instance seems to point to a massive performance drop with the 7.1 upgrade.

You can pretty clearly see the point where the upgrade to 7.1 occurred on August 31. We went from a fairly consistent utilization to spiking to 100% during business hours. You can also see when we upgraded the instance to the xlarge, and the new cpu ‘baseline’ usage dropped to around 10% during evenings and weekends, but the heavy spikes remained. There is another service that uses this RDS instance, but it did not receive any upgrades at the same time as Mattermost.

Any ideas or guidance here would be extremely appreciated.

In the above-mentioned topic (I’m the author) I also increased SSD size. I think IOPS drop was the biggest issue for our RDS. Have you tried that?

@agarciamontoro - could you help @Cronus-Z debug this? There have been several perf issues with 7.1 upgrade of late.

Ahh wait, I think I know what got introduced with 7.1. “Insights” :). That explains the CPU spikes.

@Cronus-Z - Could you please set this env var MM_FEATUREFLAGS_INSIGHTSENABLED=false and restart?

This will disable the insights widget, which I believe is creating your perf woes.

@mayqueeen - I’d assume disabling the above knob would fix some of your woes as well. (Although you already bumped your DB resource)

Thanks for the suggestions so far! @agnivade I’ve added the environment variable and while it did something, it hasn’t solved the issue. Only difference in behavior I can see is searches no longer fail, they just seem to run forever. The searching indicator just spins and spins. Still seeing the same errors in the logs though.

Also since my last post I’ve realized the spikes in CPU utilization seem to happen anytime any one user starts a search. I ran a search late when very few people were signed in, and saw the instance spike to nearly 80% cpu usage from my single search. In addition looking through the database logs has tons of the same kinds of errors as the Mattermost logs:

gitlab_mattermost@mattermost_production:[29772]:ERROR: canceling statement due to user request

Also on @mayqueeen idea with IOPS, there is a spike in the read IOPS that looks like it corresponds to spikes in cpu usage on running a search. What did you bump your SSD capacity from/to?

We didn’t bump the storage size yet based on AWS docs that said baseline IO performance won’t improve above the ‘burst’ performance until you hit 1TiB, and we’re only at 200GiB. It’s possible those docs are out of date though, as our IOPS seem to regularly go far above what they claim is the max for burst performance.

Again thanks for the fast help and responses

Edit: Looks like the searches do still fail in the same way on the backend. Still a ‘cancelled statement due to user request’ and the CPU usage drops back to normal levels after a bit. The search just doesn’t look to end in the UI.

@Cronus-Z
Interesting! What is surprising is that there is nothing major changed in search in 7.1 Could you let us know from which version you upgraded?

Would it be possible for you to enable slow queries on your database, and log the exact query that is being slow? The “cancelling statement due to user request” just means that the query has taken more than SqlSettings.QueryTimeout value to execute and the application has cancelled the query.

What we’d need to do is bump up the query timeout to prevent the application cancelling it, and therefore let it appear in the slow query logs.

Once we have the exact SQL query, we can run some debugging on that to see what is in the query that is causing the CPU spikes.

Alright, got slow query logging turned on, and may have something. I ran a search late at night to help narrow down any slow search queries. Theres a handful taking anywhere from 1-6 seconds, and then there’s this monster:

[749]:LOG: duration: 59947.630 ms execute : SELECT p., (SELECT COUNT() FROM Posts WHERE Posts.RootId = (CASE WHEN p.RootId = ‘’ THEN p.Id ELSE p.RootId END) AND Posts.DeleteAt = 0) AS ReplyCount FROM Posts p WHERE (CreateAt < (SELECT CreateAt FROM Posts WHERE Id = $1) AND p.ChannelId = $2 AND p.DeleteAt = $3) ORDER BY p.ChannelId, p.DeleteAt, p.CreateAt DESC LIMIT 30 OFFSET 0

Theres also another instance of that same query taking 9 seconds. One notable thing is a tried changing the query timeout in Mattermost to 60 seconds and it didn’t really help, but this being right at that 60 seems interesting given that.

Also we’ve found one way to make some searches work: restricting it to a specific channel. If you make a search using the ‘in:’ flag, the search is near instant and doesn’t cause any performance hit. As soon as the search has to go across multiple channels it becomes extremely slow and the timeouts start.

In terms of version number, I tried tracking down the exact version number because we are running it through a gitlab instance, and I wasn’t having the best luck in tracking down what Gitlab version has what Mattermost version. But based on the UI changes that also happened during the upgrade, I believe it was a 6.x version.

Great! Yes this is the query to get the posts of a channel. I think this is perhaps a general DB performance degradation we are observing here.

Would you be able to run a VACUUM VERBOSE ANALYZE posts; and let us know if that improves things?

If that doesn’t improve, then would you be able to run a EXPLAIN ANALYZE <query> output of that query. I believe the slow query logs dump the query parameters as well.

Also, if you haven’t seen our Postgres tuning recommendations, this might be worth taking a look: High Availability cluster — Mattermost documentation

Hi @Cronus-Z ,

can you try to install pg_activity and watch its output while your database is under high load? It should show you long running queries and those with high CPU usage and that might help us find out what actions or queries are problematic on your end.