Unable to get the file count for the channel

Summary
We upgraded from 6.3.1 to ‘7.1.0’ yesterday.
We observe significant slowness in our Mattermost and a lot of log messages indicating problems with counting files and posts.

Steps to reproduce
Use 7.1.1 version, with PostgreSQL 13.4 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 7.3.1 20180712 (Red Hat 7.3.1-12), 64-bit (Amazon RDS)

Expected behavior
Smooth work

Observed behavior
We observe huge noise in the logs with similar messages:

{"timestamp":"2022-08-12 11:24:06.467 Z","level":"error","msg":"Unable to get the file count for the channel","caller":"web/context.go:105","path":"/api/v4/channels/9c8ez3ag1jfitkefs6hu8gw3sh/stats","request_id":"z7xi6ac1bjbum8kswthj69hohe","ip_addr":"212.160.172.85","user_id":"7zjfgetha7r9m8bsgxsyt86tgc","method":"GET","err_where":"SqlChannelStore.GetFileCount","http_code":500,"err_details":"failed to count files with channelId=9c8ez3ag1jfitkefs6hu8gw3sh: pq: canceling statement due to user request"}

In postgres logs a lot of errors::

2022-08-12 11:48:18 UTC:192.168.122.58(43480):mattermost@mattermost:[26046]:ERROR: canceling statement due to user request
2022-08-12 11:48:18 UTC:192.168.122.58(43480):mattermost@mattermost:[26046]:STATEMENT:
SELECT
COUNT(*)
FROM
FileInfo
LEFT JOIN Posts as P ON FileInfo.PostId=P.Id
LEFT JOIN Channels as C on C.Id=P.ChannelId
WHERE
FileInfo.DeleteAt = 0
AND C.Id = $1

which probably relates to this part of the code: mattermost-server/channel_store.go at master · mattermost/mattermost-server · GitHub

We store our config in the database. SQL settings part:

"SqlSettings": {
    "DriverName": "postgres",
    "DataSource": "postgres://mattermost:pass@dbname.eu-central-1.rds.amazonaws.com:5432/mattermost?sslmode=disable\u0026connect_timeout=10",
    "DataSourceReplicas": [],
    "DataSourceSearchReplicas": [],
    "MaxIdleConns": 20,
    "ConnMaxLifetimeMilliseconds": 3600000,
    "ConnMaxIdleTimeMilliseconds": 300000,
    "MaxOpenConns": 300,
    "Trace": false,
    "AtRestEncryptKey": "redacted",
    "QueryTimeout": 30,
    "DisableDatabaseSearch": false,
    "MigrationsStatementTimeoutSeconds": 100000,
    "ReplicaLagSettings": []
}

Please advise how to fix it.

Hi @mayqueeen and welcome to the Mattermost forums!

Can you please try to run this query manually when being connected to the database with a PostrgreSQL shell and let me know how long it takes to finish?

Also, please let me know how many records you have in each of these tables:

select count(id) from FileInfo;
select count(id) from Posts;
select count(id) from Channels;

And please provide the output of:

select max(version) from db_migrations;

Hi and thank you for your prompt response.
As for the first query, I got an error about $1:

ERROR:  there is no parameter $1
LINE 9: AND C.Id = $1
                   ^
SQL state: 42P02
Character: 159

After omitting

AND C.Id = $1

the query took 3min 11s and the result is 129831.

As for

I got:
FileInfo count: 131916 (8min 56s)
Posts count: 7580473 (1hr 14 min)
Channels count: 18683 (110ms)

And finally, the output of

is 89.

For posts, I also have the result (I found such query in a different topic and I ran it) of:

EXPLAIN ANALYZE SELECT COUNT(Id) FROM Posts

it took 1hr 29min:

"QUERY PLAN"
"Finalize Aggregate  (cost=489555.50..489555.51 rows=1 width=8) (actual time=5355498.668..5355512.345 rows=1 loops=1)"
"  ->  Gather  (cost=489555.29..489555.50 rows=2 width=8) (actual time=5355490.969..5355507.734 rows=3 loops=1)"
"        Workers Planned: 2"
"        Workers Launched: 2"
"        ->  Partial Aggregate  (cost=488555.29..488555.30 rows=1 width=8) (actual time=5355010.211..5355010.814 rows=1 loops=3)"
"              ->  Parallel Seq Scan on posts  (cost=0.00..480658.23 rows=3158823 width=27) (actual time=218.090..5354586.367 rows=2526466 loops=3)"
"Planning Time: 0.965 ms"
"JIT:"
"  Functions: 14"
"  Options: Inlining false, Optimization false, Expressions true, Deforming true"
"  Timing: Generation 2.236 ms, Inlining 0.000 ms, Optimization 0.879 ms, Emission 11.850 ms, Total 14.965 ms"
"Execution Time: 5355532.145 ms"

Our RDS instance is db.t3.medium (2vCPUs, 4GiB RAM), shared with other (smaller) services.

Alright so the database performance is the killer here. I do have twice the amount of posts in my database here and it only taking 20s here:

$ time echo "select count(id) from posts;" | psql mattermost
  count
----------
 14832941
(1 row)


real    0m19.161s
user    0m0.039s
sys     0m0.008s

I don’t know the AWS instance sizes, but whatever you have here seems to be too small or not well optimized.
If you have access to the database server on a shell level, you could try to look at the output of top or htop while you’re running one of these queries and also watch the output of iostat -x 1 closely (you might need to install the packages to use them first, iostat is part of the package “sysstat”).
If iostat shows a high utiliziation, this usually means that your storage I/O is a problem and if top or htop show high usage or high %wa this means that other resources are overused and would need to be upgraded.

Hello @mayqueeen, yes we are aware that this is a problem and have a WIP PR to fix this: Denormalize Post.ChannelId on FileInfo.ChannelId by lieut-data · Pull Request #20057 · mattermost/mattermost-server · GitHub. But in your case, this seems to be pathologically slow than it should be.

I’d need some more info from your side to debug this.

  1. When you load your mattermost application, please open the browser console, and then click on any channel. You will see several fetch requests, one of which should be https://<URL>/api/v4/channels/<channel_id>/stats.
  2. Choose a channel that takes the longest to respond from that endpoint.
  3. Note the channel_id and plug that into the SQL query.
EXPLAIN (ANALYZE, BUFFERS) 
SELECT
COUNT(*)
FROM
FileInfo
LEFT JOIN Posts as P ON FileInfo.PostId=P.Id
LEFT JOIN Channels as C on C.Id=P.ChannelId
WHERE
FileInfo.DeleteAt = 0
AND C.Id = <channel_id>
  1. Can you please paste the output of this.

cc @agarciamontoro as well. Looks like we need to prioritize that PR.

I checked carefully monitoring graphs in RDS and I realized database IOPS performance was the main issue. I started with getting rid of all posts from Alert channels, since we didn’t need any history there. Then I increased SSD size in RDS (which according to AWS docs, had impact on IOPS performance) and I noticed big improvement in Mattermost responsiveness. Later, I changed the instance type to db.t4g.large (2vCPUs, 8GiB RAM).
Currently, during business hours, query

select count(id) from posts;

takes 15 secs 707 msec (and the value is 7229561)
After business hours, it went down to 1 sec 247 msec.
Also, the log now looks fine, no more Unable to get the file count for the channel messages.

@agriesser, thank you!

Awesome, thanks for confirming and marking a solution!