Constant websocket errors

I did not say anything about images or CSS note being loaded. Must be another user.
Basically I have no real malfunction of Mattermost reported but a load of i/o timeout errors in the logfile.
There must be a reason for them. Just ignoring them seems to be a band aid.

I copied the nginx config from the documentation, therefore there is keepalive 32.

We have max users 1500 but only around 300 a day.

Yes, you’re right, sorry - I mixed that up with the information from the OP here.
My logs are also full with messages like these, almost every two minutes.

80k occurences since about 5 weeks…
My setup does not include nginx, so it’s probably not the culprit here, but what I did to further diagnose this issue is to write down all the currently open connections to 127.0.0.1:8065 into a file:

netstat -anop | grep :8065 > /tmp/conn.txt

And then I waited watching the tail -f mattermost.log output for the next i/o timeout error, which happened a few seconds later and then I saw that the connection that obviously got killed was an idle keepalive connection:

{"timestamp":"2022-07-15 16:18:27.045 +02:00","level":"error","msg":"websocket.NextReader: closing websocket","caller":"app/web_conn.go:834","user_id":"7zzzd5sngpf6bb3nmaugxossma","error":"read tcp 127.0.0.1:8065->127.0.0.1:51178: i/o timeout"}

$ grep 51178 /tmp/conn.txt
conn.txt:tcp        0      0 127.0.0.1:8065          127.0.0.1:51178         ESTABLISHED 2702/mattermost      keepalive (1.10/0/0)
conn.txt:tcp        0      0 127.0.0.1:51178         127.0.0.1:8065          ESTABLISHED 85313/haproxy        off (0.00/0/0)

After digging a bit deeper it looks like the websockets are getting killed when there’s no pong to the ping (Every 100s, a ping is being sent and then it waits another 100s for the pong), but that’s probably not the whole picture, still trying to understand how this works together.

The corresponding code is in app/web_conn.go:

const (
        sendQueueSize          = 256
        sendSlowWarn           = (sendQueueSize * 50) / 100
        sendFullWarn           = (sendQueueSize * 95) / 100
        writeWaitTime          = 30 * time.Second
        pongWaitTime           = 100 * time.Second
        pingInterval           = (pongWaitTime * 6) / 10
        authCheckInterval      = 5 * time.Second
        webConnMemberCacheTime = 1000 * 60 * 30 // 30 minutes
        deadQueueSize          = 128            // Approximated from /proc/sys/net/core/wmem_default / 2048 (avg msg size)
)

[...]

func (wc *WebConn) readPump() {
        defer func() {
                wc.WebSocket.Close()
        }()
        wc.WebSocket.SetReadLimit(model.SocketMaxMessageSizeKb)
        wc.WebSocket.SetReadDeadline(time.Now().Add(pongWaitTime))
        wc.WebSocket.SetPongHandler(func(string) error {
                if err := wc.WebSocket.SetReadDeadline(time.Now().Add(pongWaitTime)); err != nil {
                        return err
                }
                if wc.IsAuthenticated() {
                        wc.App.Srv().Go(func() {
                                wc.App.SetStatusAwayIfNeeded(wc.UserId, false)
                        })
                }
                return nil
        })

        for {
                msgType, rd, err := wc.WebSocket.NextReader()
                if err != nil {
                        wc.logSocketErr("websocket.NextReader", err)
                        return
                }

Would be interesting to see if a custom build with an increased or decreased ping/pong timers betters the situation.

@TI-Admin - As per your post here: Constant websocket errors - #13 by TI-Admin, you mentioned you are not using load balancers, but it seems you are indeed using one.

I have outlined the problem (why it’s happening) and the impact in this comment: Constant websocket errors - #2 by agnivade.

You are not noticing any user impact because the connection gets reconnected right away. But there is an issue nonetheless, otherwise the code wouldn’t log it as an error.

As a first step, I’d suggest to to adjust the idle timeouts to what is suggested here: Constant websocket errors - #7 by agnivade to see if that helps.

You call an nginx proxy a load balancer? If not then I’m using NO load balancer at all.
So which idle timeout settings are you referring to regarding my setup?

@agnivade my log is also full of these errors and this is self-hostd (no AWS ELB). I do have an haproxy as reverse prox/load balancer in front of mattermost and tried to tune all possible timeout values, but the situation did not change, I’m still getting lots of these messages so I’d like to understand what timeouts to tune on what end to get rid of them.

What timeout values can be fine tuned directly on the mattermost server in this regard? The code above suggests that most of them are hardcoded or am I missing some options?

@TI-Admin - Sorry for any confusion. Just trying to help. What I meant was that - “do you have anything in front of your Mattermost instance?” And nginx can be used as a load-balancer as well.

Anyways, coming to the issue, I don’t know what the exact solution here is myself. All we know is that nginx is disconnecting one half of the connection (client-side), and keeping the other half alive(server-side) due to some specific reason. And when this happens, the server connection doesn’t get data and disconnects it.

The blog post I linked to earlier mentions that the idle timeout of the proxy/load-balancer should be half of the idle timeout of the server. But @agriesser, if even that doesn’t help, we’ll need to dig further into this.

Somehow, some nginx setting is causing this to happen.

I’d love to dig further into this and will try to set up a dedicated demo environment where I can try to reproduce the problem. I’ll let you know if I succeed to do that, if I can’t I’ll try to replicate the settings from my production environment to the point where the issue starts happening (if at all…).

1 Like

Wonderful, thank you @agriesser

On my installation, all users are affected, but not all of them are affected in the same amount.

/opt/mattermost/logs# sed -n 's/^.*websocket.NextReader: closing websocket.*user_id":"\(.*\)","error.*/\1/p' mattermost.log  | sort | uniq -c | sort -rn
  17378 mh8cpr6
   6361 ohi4em7e
   2380 rci5uaf8m
   1946 8eqkwgzb
   1583 h3g135wfj
   1577 5if8s5mef
   1563 9mouf11z
   1380 kzgzrbmji
   1317 fix4z134a
   1135 eyworaz4
   1115 pbn935bf
    955 zgfsxq7x
    953 xyz7jwdf
    925 i98me8y
    907 t6xjdqop3f

My user account had only like 150 such entries in the log and so I diagnosed it and this is what I can see for the last few days:

{"timestamp":"2022-07-11 06:15:31.866 +02:00",
{"timestamp":"2022-07-11 14:55:54.248 +02:00",
{"timestamp":"2022-07-11 14:59:25.462 +02:00",
{"timestamp":"2022-07-11 15:04:32.605 +02:00",
{"timestamp":"2022-07-11 15:04:33.636 +02:00",
{"timestamp":"2022-07-11 15:06:19.389 +02:00",
{"timestamp":"2022-07-11 17:46:13.546 +02:00",
{"timestamp":"2022-07-12 05:53:38.584 +02:00",
{"timestamp":"2022-07-13 06:25:25.948 +02:00",
{"timestamp":"2022-07-13 06:43:48.905 +02:00",
{"timestamp":"2022-07-13 06:44:27.865 +02:00",
{"timestamp":"2022-07-13 21:01:40.897 +02:00",
{"timestamp":"2022-07-14 05:51:45.557 +02:00",
{"timestamp":"2022-07-14 06:28:15.350 +02:00",
{"timestamp":"2022-07-14 17:17:50.615 +02:00",
{"timestamp":"2022-07-14 20:03:41.561 +02:00",
{"timestamp":"2022-07-14 20:35:31.844 +02:00",
{"timestamp":"2022-07-15 00:24:29.425 +02:00",
{"timestamp":"2022-07-15 00:31:13.750 +02:00",
{"timestamp":"2022-07-15 07:33:10.310 +02:00",
{"timestamp":"2022-07-15 17:19:57.594 +02:00",
{"timestamp":"2022-07-15 17:35:54.330 +02:00",
{"timestamp":"2022-07-15 18:53:51.199 +02:00",
{"timestamp":"2022-07-15 22:34:50.884 +02:00",
{"timestamp":"2022-07-16 07:07:49.520 +02:00",
{"timestamp":"2022-07-16 07:46:27.055 +02:00",
{"timestamp":"2022-07-16 15:48:36.481 +02:00",
{"timestamp":"2022-07-18 05:41:34.132 +02:00",
{"timestamp":"2022-07-18 06:52:01.837 +02:00",
{"timestamp":"2022-07-19 05:59:29.288 +02:00",

I compare that with my calendar and all the logs where caused by the mobile client on my iPhone when it lost its connection or when I was roaming (leaving the office, switching between offices and devices, etc.) there’s no single log entry during the day when I’m connected via WIFI or LAN, as long as I do not move and the client does not lose its connection. One of the entries always happens when I put my workstation to hibernate, so the timeouts on the reverse proxy seem to be OK here for me; I need to talk to the guy with 17k disconnects though, not sure what’s going on there :slight_smile:

Anyways, seeing this data I was able to reproduce the log entry.
I spun up a demo installation (7.1.1) on my server, added the account on my mobile 2.0 client, connected, all good. Then I turned off wifi, waited for ~ 3 mins and bam:

{"timestamp":"2022-07-19 10:33:00.810 +02:00","level":"error","msg":"websocket.NextReader: closing websocket","caller":"app/web_conn.go:830","user_id":"u8ggjxjp17dejxfhe9fsue5ssh","error":"read tcp 127.0.0.1:8065->127.0.0.1:51704: i/o timeout"}

Just to clarify, your initial command would capture normal websocket disconnects as well. That might skew the results. We are only interested with timeouts.

Yes, turning off the wifi would cause this because it won’t cause a graceful TCP shutdown and both sides of the connection (client and server) would think there’s an active connection, but no data gets sent.

In that case, this is expected and not an error.

I am wondering if all customers are running into this scenario, and not something else. I had assumed that this was an edge-case, but it seems to be pretty common?

Yes, but the non-i/o timeout related entries are neglectable:

# grep "websocket.NextReader: closing websocket.*" mattermost.log  | grep -c "i/o timeout"
87221
# grep "websocket.NextReader: closing websocket.*" mattermost.log  | grep -vc "i/o timeout"
29

I can confirm what the others in this thread said: if the problems have been there in the past, we did not see them due to the fact that they were not visible with loglevel info in the past, so I guess it’s hard to tell if something really changed or maybe an underlying problem is more visible now.
Also if this is happening in such a frequency and is normal behaviour, I think the loglevel should be changed again to hide this information from the users.

@agriesser - I want to test out something. Can you show the output of sysctl -a | grep tcp_keepalive of the host where nginx is setup?

Sure, here you go (but please remember, I’m not using nginx here, on my setup it’s haproxy but obviously it doesn’t matter):

# sysctl -a | grep tcp_keepalive
net.ipv4.tcp_keepalive_intvl = 75
net.ipv4.tcp_keepalive_probes = 9
net.ipv4.tcp_keepalive_time = 7200

Yep, understood. Thanks for the clarification.

So I can see why it’s happening. The keepalive timeout of the kernel is higher than the websocket idle timeout of Mattermost. So the kernel keeps the connection alive in case of a wifi disconnect, but the server will disconnect it sooner.

As an experiment, could you set the values to the following and test it out again:

net.ipv4.tcp_keepalive_intvl = 30
net.ipv4.tcp_keepalive_probes = 1
net.ipv4.tcp_keepalive_time = 30

Feel free to reset the values to their original afterwords.

EDIT: You’ll need to do a sysctl -p to apply the changes.

/etc/sysctl.d# sysctl -p 98-mattermost.conf
net.ipv4.tcp_keepalive_intvl = 30
net.ipv4.tcp_keepalive_probes = 1
net.ipv4.tcp_keepalive_time = 30

Last i/o timeout log entry at 07.45:34.263 - will monitor for additional logs now and report back if that helped. Thanks!

1 Like

Unfortunately the changes did not help, unless I’d have to restart the server for that?

{"timestamp":"2022-07-20 09:18:12.585 +02:00","level":"error","msg":"websocket.NextReader: closing websocket","caller":"app/web_conn.go:830","user_id":"asfbtto11irrxetocup7iyn4pr","error":"read tcp 127.0.0.1:8065->127.0.0.1:51644: i/o timeout"}

A business day with these new settings show no change in the numbers (date + amount of occurences over the last 10 days).

2022-07-10: 1806
2022-07-11: 3758
2022-07-12: 2929
2022-07-13: 2398
2022-07-14: 2331
2022-07-15: 1768
2022-07-16: 1407
2022-07-17: 664
2022-07-18: 2251
2022-07-19: 1817
2022-07-20: 1214

Interesting. Do you have haproxy and Mattermost on the same host?

Yes, everything on one server, haproxy, mattermost, postgresql, elasticsearch, data directory, etc.

I’ve now restarted mattermost, just in case the settings are being inherited by whatever process that still has the old values (you never know). Will check back tomorrow.

No change (as expected, to be honest).

2022-07-20: 1851
2022-07-21: 575

We could also sync up on the community chat if you like for shorter turnaround times if you want me to testdrive things, just let me know.