Hi everybody,
My stand-alone Mattermost server stops and gets frozen suddenly.
It is installed on vmware with 8G RAM and 60G storage.
In vmware host I see the RAM usage is less than 4G.
The storage is less than 20G used.
I checked syslog and I found the latest log was this at freeze time:
{“timestamp”:“2023-04-08 13:32:05.149 +03:30”,“level”:“debug”,“msg”:“Received HTTP request”,“caller”:“web/handlers.go:171”,“method”:“POST”, …
Nothing can be found in Mattermost log, kern.log and auth.log.
top command shows this:
1575 madmin 20 0 23612 4380 3452 R 0.7 0.1 0:41.13 top
1 root 20 0 169048 13064 8476 S 0.0 0.2 0:10.61 systemd
2 root 20 0 0 0 0 S 0.0 0.0 0:00.01 kthreadd
3 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 rcu_gp
4 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 rcu_par+
do you use docker or one of the nativ installations? What Mattermost server version do you have installed and can you see anything in the output of dmesg when that happens?
server version is:
Description: Ubuntu 20.04.2 LTS
Release: 20.04
Mattermost server is: 7.8.1
I feel my installation is native, i have mattermost folder in /opt/ folder.
actually at the time of problem, mattermost server is unreachable completely, ubunto network is unavailable and even vmware remote desktop doens’t show anything (only frozen black screen or frozen login page), so i can’t use SSH or terminal. i only can power-off the vm and power it on to recover the connectivity! so dmesg only shows information of after rebooting.
syslog shows this at the same time of problem, and from NULs i find the problem exact time! nothing more i can understand:
Apr 8 21:15:04 mattermost mattermost[1079]: {“timestamp”:“2023-04-08 21:15:04.547 +03:30”,“level”:“debug”,“…”}
NUL NUL NUL NUL NUL NUL NUL NUL NUL NUL NUL NUL NUL NUL NUL NUL NUL Apr 8 21:17:42 mattermost systemd-modules-load[383]: Inserted module ‘lp’
Apr 8 21:17:42 mattermost systemd-modules-load[383]: Inserted module ‘ppdev’
it happens in less than 1 hour sometimes, or happens after several days, it doesn’t have clear pattern
OK, that’s interesting - the NUL entries are maybe caused by a corrupt filesystem as a result of the abrupt reboot. Do you use XFS by any chance? I do only know these NUL entries in the logs from using XFS in the past.
This sounds like a kernel panic or out of memory condition on your system - do you have some kind of monitoring available that would show any changes before the crash happens with regards to memory or CPU usage? 8GB of RAM should be more than enough for running Mattermost inside a virtual machine usually. Is the kernel up2date and compatible with your hypervisor?
Thank you dear for your reply
There isn’t any XFS on the storage, types are devtmpfs, tmpfs, ext4, squashfs and vfat.
Now I’ve been running htop command and monitoring memory and CPU. It is more than 4 days that the server hasn’t any crash. memory is about 625M from 8G and swap is 0 from 2G.
The vm itself has been working for more than 3 years without any problem.
It seems the only way now is to monitor memory and CPU usage
Did you maybe get a kernel or virtualization update recently? Any events happening outside the virtual machine (VM migrations, snapshots, etc.) that could be the reason for that?
No, this phenomena appeared after Mattermost upgrade.
It occurs mostly on Saturdays!
Today I faced it again! Imagine 4 happening between 11:30 till 17:00 (local time), and now it’s more than 9 hours without any problem. Last week on Saturday between 14:00 till 18:00 I had it, then it worked for 7 days none-stopped.
The latest logs in journalctl results are like these:
Apr 15 13:45:01 mattermost CRON[1740]: pam_unix(cron:session): session opened for user root by (uid=0)
Apr 15 13:45:01 mattermost CRON[1741]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Apr 15 13:45:01 mattermost CRON[1740]: pam_unix(cron:session): session closed for user root
Apr 15 13:55:01 mattermost CRON[1766]: pam_unix(cron:session): session opened for user root by (uid=0)
Apr 15 13:55:01 mattermost CRON[1767]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Apr 15 13:55:01 mattermost CRON[1766]: pam_unix(cron:session): session closed for user root
Apr 15 13:59:48 mattermost mattermost[1075]: {“timestamp”:“2023-04-15 13:59:48.820 +03:30”,“level”:…
Apr 15 13:59:48 mattermost mattermost[1075]: {“timestamp”:“2023-04-15 13:59:48.820 +03:30”,“level”:…
Apr 15 13:59:50 mattermost mattermost[1075]: {“timestamp”:“2023-04-15 13:59:50.961 +03:30”,“level”:…
Apr 15 13:59:51 mattermost mattermost[1075]: {“timestamp”:“2023-04-15 13:59:51.503 +03:30”,“level”:…
– Reboot –
These logs do not contain any relevant information unfortunately. I guess you will need to enable debug logging in Mattermost, maybe we can see what the last thing it did was that might have caused the crash (if it’s actually related to Mattermost at all).
Very strange - so you’re saying that the crashes seem to be related to having push notifications turned on? This is indeed strange and it would be interesting to see if enabling them again causes the system to crash again.
Exactly happened!
I enabled push notification 2 days ago, this morning after sending a push notification, server got frozen, no service, even no ping!
I don’t know how to find the root cause.