Bleve Index Creation fails

Summary
Bleve index (re-) creation fails in MM 7.3.0 with PostgreSQL backend.

Steps to reproduce

  • List item Use a MM 7.3.0 (not sure if it is related to that version) with some amount of Posts.
  • Purge any existing Bleve index
  • Create a new index
  • After running for a while the indexing stops and shows an error (see below)

Expected behavior
The index creation should finish without error.

Observed behavior
We just migrated our MM installation from MySQL to PostgreSQL as database backend.
Everything worked fine.
However, we weren’t sure if our Bleve index was up-to-date and thus first purged it and then tried to recreate it in the System Console Bleve section.
Index creation always fails after a while as you can see in the screen shot:

The PostgreSQL log shows repeated entries like this:

2022-10-08 17:25:55.862 CEST [12552] mattermost@mattermost_z43 ERROR:  canceling statement due to user request
2022-10-08 17:25:55.862 CEST [12552] mattermost@mattermost_z43 STATEMENT:  SELECT
                                PostsQuery.*, Channels.TeamId
                        FROM (
                                SELECT
                                        *
                                FROM
                                        Posts
                                WHERE
                                        Posts.CreateAt > $1
                                OR
                                        (Posts.CreateAt = $2 AND Posts.Id > $3)
                                ORDER BY
                                        CreateAt ASC, Id ASC
                                LIMIT
                                        $4
                                )
                        AS
                                PostsQuery
                        LEFT JOIN
                                Channels
                        ON
                                PostsQuery.ChannelId = Channels.Id
                        ORDER BY CreateAt ASC, Id ASC
2022-10-08 17:25:55.862 CEST [12552] mattermost@mattermost_z43 LOG:  disconnection: session time: 0:00:34.738 user=mattermost database=mattermost_z43 host=::1 port=44801
202

This happened one two different MM instances, one migrated from MySQL to PostgreSQL in MM version 7.3.0 and another one migrated from MySQL to PostgreSQL in MM version 5.39.3 and upgraded to 7.3.0 afterwards. Identical failures on both systemd.

Hi zaucker and welcome to the Mattermost forums!

Your database is too slow and therefore the queries time out. There’s a setting in your config.json which you could try to increase, it’s SqlSettings.QueryTimeout.

https://docs.mattermost.com/configure/configuration-settings.html#query-timeout-seconds

But please watch your DB server’s disk and CPU usage while re-running the index creation, there might be some resource exhaustion responsible for what you’re seeing here.

After chaning the value, you’d have to restart your Mattermost server to initialize the connection pool with the new values.

Hi Alexander
thanks very much, I was suspecting a timeout, but couldn’t find the config setting.
I don’t think it is a database performance problem, though, but rather the amount of files and posts to be indexed.
I‘ll try again with a higher timeout setting and will report what I find here.

The indexing works in batches and you can also configure the batch sizes in your config.json, these are the default values for Bleve (at least I found them on one of my 7.2.0 installations where I never touched the Bleve config):

    "BleveSettings": {
        "IndexDir": "",
        "EnableIndexing": false,
        "EnableSearching": false,
        "EnableAutocomplete": false,
        "BatchSize": 10000
    },

If increasing the timeout doesn’t help you could try to reduce the batch size.

Hello @zaucker - since you mentioned that you migrated from MySQL to Postgres, I’d like to take a look at the schema of your posts table as well just to ensure everything is working as expected. Would you be able to post the output of \dS+ posts ?

Hi Agniva
thanks for your offer to help.
The migration from MySQL (or MariaDB to be precise) worked well. I used pgloader for that.

After lowering the batch size to 1000 and increasing the QueryTimeout to 3000 (both probably way more than needed) the indexing work just fine for our first MM instance.
On our larger instance the indexer is still running, but as it aborted after a few minutes before, I expect it to also finish sometime today.

Best regards,

Fritz

The queries aren’t meant to take that much time. That’s why I was curious to look at your schema to confirm that the indexes are all properly set up. Would you be able to post the output of \dS+ posts ?

Sorry, here is the schema:

\dS+ posts
                                                      Table "public.posts"
    Column    |           Type           | Collation | Nullable | Default | Storage  | Compression | Stats target | Description 
--------------+--------------------------+-----------+----------+---------+----------+-------------+--------------+-------------
 id           | character varying(26)    |           | not null |         | extended |             |              | 
 createat     | bigint                   |           |          |         | plain    |             |              | 
 updateat     | bigint                   |           |          |         | plain    |             |              | 
 editat       | bigint                   |           |          |         | plain    |             |              | 
 deleteat     | bigint                   |           |          |         | plain    |             |              | 
 ispinned     | boolean                  |           |          |         | plain    |             |              | 
 userid       | character varying(26)    |           |          |         | extended |             |              | 
 channelid    | character varying(26)    |           |          |         | extended |             |              | 
 rootid       | character varying(26)    |           |          |         | extended |             |              | 
 originalid   | character varying(26)    |           |          |         | extended |             |              | 
 message      | character varying(65535) |           |          |         | extended |             |              | 
 type         | character varying(26)    |           |          |         | extended |             |              | 
 props        | jsonb                    |           |          |         | extended |             |              | 
 hashtags     | character varying(1000)  |           |          |         | extended |             |              | 
 filenames    | character varying(4000)  |           |          |         | extended |             |              | 
 fileids      | character varying(300)   |           |          |         | extended |             |              | 
 hasreactions | boolean                  |           |          |         | plain    |             |              | 
 remoteid     | character varying(26)    |           |          |         | extended |             |              | 
Indexes:
    "posts_pkey" PRIMARY KEY, btree (id)
    "idx_posts_channel_id_delete_at_create_at" btree (channelid, deleteat, createat)
    "idx_posts_channel_id_update_at" btree (channelid, updateat)
    "idx_posts_create_at" btree (createat)
    "idx_posts_create_at_id" btree (createat, id)
    "idx_posts_delete_at" btree (deleteat)
    "idx_posts_hashtags_txt" gin (to_tsvector('english'::regconfig, hashtags::text))
    "idx_posts_is_pinned" btree (ispinned)
    "idx_posts_message_txt" gin (to_tsvector('english'::regconfig, message::text))
    "idx_posts_root_id_delete_at" btree (rootid, deleteat)
    "idx_posts_update_at" btree (updateat)
    "idx_posts_user_id" btree (userid)
Access method: heap

Current run time:

Some infos about the data directory:

mattermost@freddielx:~/mattermost/data $ du -sh .
13G     .
mattermost@freddielx:~/mattermost/data $ find . | wc
  69651  140694 8121842

And here my current config settings:

    "BleveSettings": {
        "IndexDir": "/home/mattermost/mattermost/index",
        "EnableIndexing": true,
        "EnableSearching": false,
        "EnableAutocomplete": false,
        "BulkIndexingTimeWindowSeconds": 3600,
        "BatchSize": 1000
    },

Thanks for the info! So it looks like the schema is correct, and the query should be able to pick the right index. I would love to see a EXPLAIN ANALYZE of the slow query, but I am thinking it might be difficult for you to get the actual parameter values. (This needs some additional work).

As a quick check, would you be able to run VACUUM VERBOSE ANALYZE posts; and check if that improves things?

If that doesn’t work, we can look into getting the parameter values and running an EXPLAIN

cc @agarciamontoro for visibility

This is the output from the query:

VACUUM VERBOSE ANALYZE posts; 
INFO:  vacuuming "public.posts"
INFO:  table "posts": index scan bypassed: 196 pages from table (0.32% of total) have 545 dead item identifiers
INFO:  launched 2 parallel vacuum workers for index cleanup (planned: 2)
INFO:  index "idx_posts_hashtags_txt" now contains 1568585 row versions in 239 pages
DETAIL:  0 index row versions were removed.
0 index pages were newly deleted.
20 index pages are currently deleted, of which 20 are currently reusable.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.05 s.
INFO:  index "idx_posts_message_txt" now contains 1568585 row versions in 10686 pages
DETAIL:  0 index row versions were removed.
0 index pages were newly deleted.
225 index pages are currently deleted, of which 225 are currently reusable.
CPU: user: 0.12 s, system: 0.57 s, elapsed: 4.63 s.
INFO:  table "posts": found 143 removable, 1568585 nonremovable row versions in 62137 out of 62137 pages
DETAIL:  0 dead row versions cannot be removed yet, oldest xmin: 84809
Skipped 0 pages due to buffer pins, 0 frozen pages.
CPU: user: 0.58 s, system: 5.25 s, elapsed: 41.25 s.
INFO:  vacuuming "pg_toast.pg_toast_86252"
INFO:  table "pg_toast_86252": found 0 removable, 291 nonremovable row versions in 59 out of 1033 pages
DETAIL:  0 dead row versions cannot be removed yet, oldest xmin: 84853
Skipped 0 pages due to buffer pins, 0 frozen pages.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.01 s.
INFO:  analyzing "public.posts"
INFO:  "posts": scanned 30000 of 62137 pages, containing 756634 live rows and 319 dead rows; 30000 rows in sample, 1567166 estimated total rows
VACUUM

In the mean-time the indexing finished after 3705 minutes … I am a bit reluctant to re-run it, as it might take another 2.5 days. Or can I abort it in case it is slow again and the old index will then be used?

But if you think it might be useful to perhaps improve things, I think I can do it . I would then enable the slow query log on this Mattermost instance (only had it on in the other smaller instance), although I will probably be the same.

I ran the slow query with the table’s minimal createat and the corresponding id (not sure this makes sense):

SELECT
                                PostsQuery.*, Channels.TeamId
                        FROM (
                                SELECT
                                        *
                                FROM
                                        Posts
                                WHERE
                                        Posts.CreateAt > 1487863828995  OR
                                        (Posts.CreateAt = 1487863828995 AND Posts.Id > 'zk8ng314kpgkf8n7joonrufiga') ORDER BY
                                        CreateAt ASC, Id ASC
                                LIMIT 1000  )
                        AS
                                PostsQuery
                        LEFT JOIN
                                Channels
                        ON
                                PostsQuery.ChannelId = Channels.Id
                        ORDER BY CreateAt ASC, Id ASC;
mattermost_z43=# explain analyze SELECT
                                PostsQuery.*, Channels.TeamId
                        FROM (
                                SELECT
                                        *
                                FROM
                                        Posts
                                WHERE
                                        Posts.CreateAt > 1487863828995  OR
                                        (Posts.CreateAt = 1487863828995 AND Posts.Id > 'zk8ng314kpgkf8n7joonrufiga') ORDER BY
                                        CreateAt ASC, Id ASC
                                LIMIT 1000  )
                        AS
                                PostsQuery
                        LEFT JOIN
                                Channels
                        ON
                                PostsQuery.ChannelId = Channels.Id
                        ORDER BY CreateAt ASC, Id ASC;
                                                                             QUERY PLAN                                                                             
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=544.66..547.16 rows=1000 width=334) (actual time=5.341..5.387 rows=1000 loops=1)
   Sort Key: posts.createat, posts.id
   Sort Method: quicksort  Memory: 328kB
   ->  Hash Left Join  (cost=272.39..494.83 rows=1000 width=334) (actual time=3.615..5.052 rows=1000 loops=1)
         Hash Cond: ((posts.channelid)::text = (channels.id)::text)
         ->  Limit  (cost=0.43..210.24 rows=1000 width=333) (actual time=0.019..1.218 rows=1000 loops=1)
               ->  Index Scan using idx_posts_create_at_id on posts  (cost=0.43..328805.53 rows=1567165 width=333) (actual time=0.017..1.165 rows=1000 loops=1)
                     Filter: ((createat > '1487863828995'::bigint) OR ((createat = '1487863828995'::bigint) AND ((id)::text > 'zk8ng314kpgkf8n7joonrufiga'::text)))
                     Rows Removed by Filter: 1
         ->  Hash  (cost=207.54..207.54 rows=5154 width=28) (actual time=3.578..3.578 rows=5156 loops=1)
               Buckets: 8192  Batches: 1  Memory Usage: 372kB
               ->  Seq Scan on channels  (cost=0.00..207.54 rows=5154 width=28) (actual time=0.011..1.801 rows=5156 loops=1)
 Planning Time: 0.830 ms
 Execution Time: 5.466 ms
(14 rows)

But I guess that is not the most characteristic one and I would have to determine other query parameters.
I am surprised about the sequential scan on Channels as id is its primary key and thus indexed.
I only ran this querey after the VACUUM, unfortunately.

BTW, the indexing was continuously progressing slower and slower (based on the ratio of completion percentage and runtime reported in the System Console).

Could it be that the slow indexing I am seeing doesn’t have to do with indexing Posts, but with indexing of the uploaded files?

Everything seems to be correct. The query is taking the right index.

I ran the slow query with the table’s minimal createat and the corresponding id (not sure this makes sense):

It’s not the actual query, but logically it makes sense.

Let’s see how much it takes the next time it runs indexing.

Could it be that the slow indexing I am seeing doesn’t have to do with indexing Posts, but with indexing of the uploaded files?

From your error, and the postgres logs, it has to do with that specific SQL query taking a long time.

IsShould I manually run the indexer again?

Fritz Zaucker
Platanen 42
4600 Olten
+41 62 212 7712
+41 79 675 0630

Ah no, that’s not what I meant. The indexing job will automatically run again to index new entries. We can see how much time it took then.