Slow loading of channels containing images

Hi,

Looking to move away from Slack for a small team (~15 users) I am trying out Mattermost. I have installed Mattermost via Docker on an Ubuntu server following the steps in this guide (Install Mattermost via Docker — Mattermost documentation) then successfully imported the message history and files from the Slack instance using the mmetl tool and bulk import.

Everything is working fine, except that opening a channel whose recent messages contain images takes a long time (somtimes ~30 seconds). Channels without images in recent messages open instantly.

During the wait, a “Loading…” message is displayed. After loading completes, opening another channel and coming back to the first one is instantaneous. But if the app is closed then opened again, the same channel takes a long time to open once again.

This issues affects all client I’ve tried (web, Windows, Android).

Looking at the top command on the server while the “Loading…” message is displayed on the client show 100% CPU load for the mattermost process for the duration of the message. No message is output to the server logs.

How can I troubleshoot this further?

Hi jc and welcome to the Mattermost forums!

What Mattermost server version are you using and did what storage backend are you using? Local Filestorage or S3? Did you configure any specific settings that differ from the default values with regards to f.ex. using local image proxies in the application code or the like?
Can you also see high disk I/O on the system while the mattermost process is on 100% CPU usage (you could check that with sysstat -x 1) or any high I/O wait (you can check that using top)? What database backend are you using?

Hi Alexander, mattermost version returns Version: 7.1.5. The file storage system I am using is the default Local File System. Most options are still set to their default value, in particular Image Proxy is disabled and the database is postgres.

I installed the sysstat package but it does not include a sysstat command. It comes with an iostat command though, here’s an example output:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          98.00    0.00    1.00    1.00    0.00    0.00

Device            r/s     rkB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wkB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dkB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util
sda              0.00      0.00     0.00   0.00    0.00     0.00   58.00    248.00     1.00   1.69    0.43     4.28    0.00      0.00     0.00   0.00    0.00     0.00    1.00    0.00    0.03   0.80

CPU load can stay at ~100% for over 30 seconds, and all values change very little during all this time:

  • There are very few reads
  • There is a constant stream of writes, but nothing concerning (248.00 kB/s above was the top speed I recorded)
  • iowait is always close to zero.

It sounds like CPU load really is my main concern. Is there some kind of sampling profiler I could use to get an idea of what is going on?

Hi @jlc ,

You can set MetricsSettings.Enable = true in your config.json. Environment configuration settings — Mattermost documentation

Then, during the period when CPU is at 100%, you can take a CPU profile like so: curl http://localhost:8067/debug/pprof/profile?seconds=10 -o cpu.prof

If you can attach the cpu.prof here, we can take a look and see what’s going on.

Sorry, yes, I meant iostat -x 1 and thanks for confirming that it’s really only the CPU and no other relevant system metrics are out of bounds.

You probably wanted to start with an ESR version and that’s why you chose 7.1.5, but it would be interesting to see if you can also reproduce that on a newer version (7.7.1 got released just yesterday).

I took a longer capture (80 seconds) to get the entirety of the high CPU activity:

  • issued the profile command when the client was idle, server CPU at 0%
  • at t=5 seconds I clicked on a channel that I know contains image attachments in the most recent messages. The “Loading” message appeared. Server CPU went up to 100%
  • at t=28 seconds the channel’s messages finally showed up, but with image thumbnails empty. Server CPU still 100%. Thumbnails gradually appear.
  • at t=63 seconds the last thumbnail finally appears, and server CPU drops back to 0%

I have uploaded the capture here: cpu.prof - Google Drive

Here’s the output of go tool pprof -top cpu.prof:

File: mattermost
Type: cpu
Time: Jan 21, 2023 at 10:50pm (CET)
Duration: 80s, Total samples = 54.80s (68.50%)
Showing nodes accounting for 53.35s, 97.35% of 54.80s total
Dropped 181 nodes (cum <= 0.27s)
      flat  flat%   sum%        cum   cum%
    26.79s 48.89% 48.89%     34.77s 63.45%  github.com/disintegration/imaging.resizeHorizontal.func1
     7.53s 13.74% 62.63%      7.63s 13.92%  github.com/disintegration/imaging.(*scanner).scan
     5.90s 10.77% 73.39%     10.08s 18.39%  image/jpeg.(*decoder).reconstructBlock
     4.07s  7.43% 80.82%      4.14s  7.55%  image/jpeg.idct
     2.83s  5.16% 85.99%     18.43s 33.63%  image/jpeg.(*decoder).processSOS
     2.03s  3.70% 89.69%      3.32s  6.06%  image/jpeg.(*decoder).decodeHuffman
     1.75s  3.19% 92.88%      1.81s  3.30%  image/jpeg.(*decoder).receiveExtend
     0.99s  1.81% 94.69%      1.32s  2.41%  image/jpeg.(*decoder).ensureNBits
     0.52s  0.95% 95.64%      0.52s  0.95%  runtime.asyncPreempt
     0.41s  0.75% 96.39%      0.86s  1.57%  runtime.scanobject
     0.29s  0.53% 96.92%      0.29s  0.53%  runtime.memclrNoHeapPointers
     0.24s  0.44% 97.35%      0.33s   0.6%  image/jpeg.(*decoder).readByteStuffedByte
         0     0% 97.35%     34.96s 63.80%  github.com/disintegration/imaging.parallel.func1
         0     0% 97.35%     11.66s 21.28%  github.com/getsentry/sentry-go/http.(*Handler).handle.func1
         0     0% 97.35%     11.66s 21.28%  github.com/gorilla/mux.(*Router).ServeHTTP
         0     0% 97.35%     11.66s 21.28%  github.com/mattermost/gziphandler.GzipHandlerWithOpts.func1.1
         0     0% 97.35%      2.49s  4.54%  github.com/mattermost/mattermost-server/v6/api4.getFilePreview
         0     0% 97.35%      9.14s 16.68%  github.com/mattermost/mattermost-server/v6/api4.getFileThumbnail
         0     0% 97.35%     11.59s 21.15%  github.com/mattermost/mattermost-server/v6/app.(*App).GetFileInfo
         0     0% 97.35%     18.77s 34.25%  github.com/mattermost/mattermost-server/v6/app.(*App).generateMiniPreview
         0     0% 97.35%      7.18s 13.10%  github.com/mattermost/mattermost-server/v6/app.(*App).generateMiniPreviewForInfos.func1
         0     0% 97.35%     18.55s 33.85%  github.com/mattermost/mattermost-server/v6/app.prepareImage
         0     0% 97.35%     18.43s 33.63%  github.com/mattermost/mattermost-server/v6/app/imaging.(*Decoder).DecodeMemBounded
         0     0% 97.35%     11.64s 21.24%  github.com/mattermost/mattermost-server/v6/web.Handler.ServeHTTP
         0     0% 97.35%     18.43s 33.63%  image.Decode
         0     0% 97.35%     18.43s 33.63%  image/jpeg.(*decoder).decode
         0     0% 97.35%     18.43s 33.63%  image/jpeg.Decode
         0     0% 97.35%     11.66s 21.28%  net/http.(*conn).serve
         0     0% 97.35%     11.66s 21.28%  net/http.HandlerFunc.ServeHTTP
         0     0% 97.35%     11.66s 21.28%  net/http.serverHandler.ServeHTTP
         0     0% 97.35%      0.91s  1.66%  runtime.gcBgMarkWorker
         0     0% 97.35%      0.90s  1.64%  runtime.gcBgMarkWorker.func2
         0     0% 97.35%      0.90s  1.64%  runtime.gcDrain
         0     0% 97.35%      0.32s  0.58%  runtime.makeslice
         0     0% 97.35%      0.33s   0.6%  runtime.mallocgc
         0     0% 97.35%      0.93s  1.70%  runtime.systemstack

The flame graph representation seems to imply there are multiple stages to the issue:

Is it expected that the server decodes image attachments from the channel instead of just sending them down the network? Is this trace showing the server re-generating previews and thumbnails?

I have browsed a few folders in /home/mattermost/docker/volumes/app/mattermost/data/<date>/teams/<hash>/channels/<hash>/users/<hash>/<hash> and I can confirm each folder contains three images: img.jpg, img_preview.jpg, img_thumbnail.jpg. Their timestamps are from when I bulk imported the Slack history into Mattermost.

Thanks, that’s interesting. I’m not sure how to properly read this file, so we have to wait for @agnivade to do his magic, but what immediately comes to my mind here is that the server tries to generate new thumbnails but fails to save them and therefore does the same thing over and over again instead of just once. You said that the thumbnails and image files are from when you imported from slack, could it be that the data directory and the files in there are not writable by the mattermost application user?
When you create a new channel and post some images in there, can you also reproduce the behaviour in this channel and does the filesystem for this channel look similar to the one of an existing channel (with regards to the images available there)?

Hi @jlc - thanks for uploading the profile. It looks like the issue has to do with minipreviews. This is a 3rd category of images apart from thumbnails and previews and is stored in the database itself.

The problem seems to stem from the fact that each image does not seem to have a minipreview in the DB, and therefore MM tries to generate one. But it should not do it once it has been generated.

Can you turn on debug logging, and check if you see any lines like below:
error reading image file,
generateMiniPreview: prepareImage failed,
Unable to generate mini preview image,
creating mini preview failed

It seems like MM is not able to save the minipreview in the DB.

It appears I was mistaken when I said in my first post that there wasn’t anything in the logs. I thought the content of System Console > Server Logs was controlled by the setting Console Log Level, but it is actually controlled by File Log Level, I’m sorry I missed that… Now that I’ve set the latter to DEBUG, I can see the following output during the high CPU activity phase:

{"timestamp":"2023-01-23 22:17:03.085 +01:00","level":"debug","msg":"GetImageOrientation failed","caller":"app/file.go:1047","error":"failed to decode exif data: exif: failed to find exif intro marker"}
{"timestamp":"2023-01-23 22:17:03.736 +01:00","level":"debug","msg":"Received HTTP request","caller":"web/handlers.go:156","method":"GET","url":"/api/v4/files/7u95r3xrtpr1mgrf9d3du46r4c/thumbnail","request_id":"zgeprekcfjb3ucwnppxptw5qmh","status_code":"304"}
{"timestamp":"2023-01-23 22:17:04.503 +01:00","level":"debug","msg":"GetImageOrientation failed","caller":"app/file.go:1047","error":"failed to decode exif data: EOF"}
{"timestamp":"2023-01-23 22:17:05.294 +01:00","level":"debug","msg":"Received HTTP request","caller":"web/handlers.go:156","method":"GET","url":"/api/v4/files/yzkhgow1dbngt8hkrgzrxed3tc/thumbnail","request_id":"w8yzos49wfgamr34y7bjfp8g7r","status_code":"304"}
{"timestamp":"2023-01-23 22:17:06.072 +01:00","level":"debug","msg":"GetImageOrientation failed","caller":"app/file.go:1047","error":"failed to decode exif data: EOF"}

(it goes on for 30 seconds)

That GetImageOrientation failed error has already been noticed last fall by another user, who submitted a fix that was merged to the master branch on October 12, 2022. That commit wasn’t picked up for the last updates of ESR 7.1 which is why it still affects default Docker installations of Mattermost.

I pointed my .env file to the 7.7 tag, and I can confirm the issue is gone. There is still high CPU activity, but it is limited to the first time the channel is open, and then even after restarting the client channels open super fast.

Thanks a lot for your help!

Awesome, that’s great to hear - thanks for posting the solution and good to know you’re up and running now! 7.8 (the next ESR version) will be released in February, so you can switch to it in a few weeks in order to be on an ESR branch instead of going with 7.1.