Constant websocket errors

Summary
Constant websocket errors on version 7.0.1.

Steps to reproduce
Since updating to version 7.0.0 about two weeks ago, our server is constantly getting this websocket error. I’ve checked the logs prior to the update, and there was no instance of this error before. During the weekend we have updated to version 7.0.1 as an attempt to fix it, but with no luck. We were previosuly on version 6.3, and updated straight away to 7.0, following the recommended guide, with nothing special, and without changing any config manually. We have since only added the hostname for the new caller plugin.

I’ve tried looking for solutions, but couldn’t find anything of relevance. The latest report I found was related to a config, that has since been deprecated (the reliable websockets).

Here is a sample of the error:
{"timestamp":"2022-06-27 18:59:36.184 Z","level":"error","msg":"websocket.NextReader: closing websocket","caller":"app/web_conn.go:834","user_id":"rrwhycr46fgb5m143xcpqdoxma","error":"read tcp 127.0.0.1:8080->127.0.0.1:56834: i/o timeout"}

Expected behavior
Since getting those websocket errors, our users have began complaining that mattermost doesn’t work as expected, with images of teams and users not loading and even channels and threads not loading at all.

Here is a sample screenshot that a user sent us, with none of the channels loading.

I’m also attaching the last 2k lines of mattermost log (can provide longer, if necessary), and as can be seen, the error is pretty constant, with the exception of weekends, where there is almost no one using the server.
mattermost.log

Mattermost is hosted on AWS, on a fairly beefy machine for our use-case (t2.large), and only 50 users. So it never goes beyond 2GB of ram usage or 10% CPU.

Hi @Podesta - were your log levels set to Info? We made a change in 6.6 (Use error log level for websocket timeouts (#19609) · mattermost/mattermost-server@9534efe · GitHub) which logs timeout errors in the error log. But they were always logged in the debug level. So nothing really changed except the log level.

This log line would appear whenever the server fails to read any message within a given timeout, and it disconnects the websocket connection. Ideally, there is a constant stream of ping/pong messages happening which prevents this from happening. And there is no reason for the ping/pong to stop. So this is indeed not expected to happen, and hence the error level.

I am wondering whether you are using AWS CLB/ALB in front? We have seen issues with these categories of load-balancers in that the idle timeout settings for AWS and the Mattermost server needs to be appropriately set, otherwise the load-balancer will keep disconnecting the connection: https://sigopt.com/blog/the-case-of-the-mysterious-aws-elb-504-errors/.

When this happens, the server timeout gets triggered because there’s nothing sending any data to it, and it disconnects the connection from its side as well, logging this error.

I’d suggest to check the load balancer settings and verify everything is in order.

I have the same issue but I have a personal server setup with an Apache HTTPD router in front of my sites (one of them being mattermost. In my case I see this happening between my HTTPD frontend server (which also does the SSL) and the mattermost server.

websocket.NextReader: closing websocket caller=“app/web_conn.go:818” user_id=USERID error="read tcp mattermostInternalIp:8065->HTTPDFrontingWebserver:40181: i/o timeout

This also happened with a recent upgrade although it might have been before the 7.0 one. I’m running:

Mattermost Version: 6.7.0
Database Schema Version: 82
Database: mysql
Build Hash: 714d065986fa9f6163b417d9760bf707cc237c71
EE Build Hash: ba596678562c7f9ef1a51bfa7b4dcbb8e5c0e937
Webapp Build Hash: cf2606b483a1ed484fcb64175392cace95014cc5
Build Date: Thu May 12 12:44:55 UTC 2022

This issue only happens to mobile and desktop and web browsers don’t have the same issue with the slow response and the not loading the channels sidebar. I’ll review timeout settings but I’m also not sure why this would have changed recently without those two server configs changing (although obviously Mattermost was upgraded).

That’s a long article. I checked my idle timeout settings, and both Mattermost and the AWS ALB are set at 60 seconds. The article seems to suggest the AWS ALB should be less than your target (Mattermost in this case). Is that how you understand it? For example, I should set AWS ELB to 59 and leave Mattermost at 60? Or maybe 50 and 60 to be safe?

Correct. I think it should be safe to double it. Keep ELB at 60 and Mattermost at 120 and see what happens.

1 Like

Thanks for the detailed reply.

Indeed, our log level was at INFO, which could explain the lack of errors prior to the update.

Unfortunately, we are not using load-balancers, so that shouldn’t be our issue.

Users used to occasionally report, and continue to do so, the failure of sending messages (with the retry/cancel option next to it), which we used to dismiss. But if the issue we are having is not new, and was just hidden by the lack of logs, they might be related.

Users also reported a few unexpected disconnects, with the session expired messa and limit exceeded messages.

We have now enabled logging at DEBUG level, in hopes of better pinpointing what might be the issue.

image

image

There’s a lot of unrelated issues here and we shouldn’t confuse it with the websocket issue.

The limit exceeded error is due to rate limiting and unrelated. Failure to send messages is also not related to websocket as it’s an API call.

If you are not using load-balancers, can you think of something that will stop websocket messages from reaching the server? Because that’s the only thing that will trigger this.

I also have lots of the i/o timeout error messages in the logfile with self hosted Mattermost E10 v7.0.1.

error [2022-07-13 10:59:37.016 +02:00] websocket.NextReader: closing websocket caller=“app/web_conn.go:834” user_id=ji86zwegt7rbfqgfxbs1niqd6y error=“read tcp 127.0.0.1:8065->127.0.0.1:51470: i/o timeout”

I contacted support and hope this problem will be fixed soon.

Hello @TI-Admin - as noted before, the timeout error was always getting logged in the debug level. Only a recent change in 6.6 started to log it at error level. There has been no code changes to cause this.

Something in your infrastructure is causing this - please have a read through Constant websocket errors - #2 by agnivade.

@Podesta - You mentioned that you are not using any load balancers. However from the log message, I see 127.0.0.1:8080->127.0.0.1:56834. So something is connecting to mattermost:8080 from the same host where it’s hosted. Can you clarify what is this?

I’ve read this but we do not use a load balancer and since the infrastructure did not change since we started to use Mattemost almost two years ago, I have no clue how to fix this problem.

@TI-Admin - Your log line says 127.0.0.1:8065->127.0.0.1:51470: i/o timeout. So it seems like something at 127.0.0.1 is connecting to Mattermost? Do you know what that is?

If it was a direct client connection, I’d expect the client IP to be something other than 127.0.0.1.

I thought you tell me. The log is full of those errors and the ports start at 127.0.0.1:32768 up to 127.0.0.1:60998. Is it possibly nginx related which is used as a proxy?

Yes, please check your nginx logs (/var/log/nginx/error.log). Do you see any “upstream connection failed” messages or something like this in there?

Yes, there are lots of GET and POST errors like these:

2022/07/15 04:00:28 [error] 853417#853417: *43108 connect() failed (111: Connection refused) while connecting to upstream, client: 94.134.231.35, server: wetalk.informatik.uni-kiel.de, request: “GET /api/v4/websocket?connection_id=kcu6trax1pb7pk8m3pusrfszar&sequence_number=21 HTTP/1.1”, upstream: “http://127.0.0.1:8065/api/v4/websocket?connection_id=kcu6trax1pb7pk8m3pusrfszar&sequence_number=21”, host: “wetalk.informatik.uni-kiel.de

or

2022/07/15 04:00:30 [error] 853417#853417: *42470 connect() failed (111: Connection refused) while connecting to upstream, client: 134.245.253.153, server: wetalk.informatik.uni-kiel.de, request: “POST /api/v4/users/status/ids HTTP/2.0”
, upstream: “http://127.0.0.1:8065/api/v4/users/status/ids”, host: “wetalk.informatik.uni-kiel.de

Yah, that’s the problem here. The upstream (this is the server behind your nginx) is refusing the connection for whatever reason.

Can you share the upstream configuration of your nginx? Maybe there’s a connection limit or maximum there which is getting reached and therefore further connections are refused; hard to tell without seeing the configuration.

Sadly this is not the real problem.
All these errors happen at 4:00 when I shutdown Mattermost to backup the DB.

The Mattermost i/o websocket timeouts appear at different times and then there are no corresponding errors logged for nginx.

I posted my nginx configuration to support but their answer was “So in this case, unless you are having an issue in your Mattermost, otherwise you can actually ignore this error message.”

Basically it says: ignore error logs in your Mattermost logfile which is not what I expected. Why are they then logged at all?

Ah, OK - but still: could you share the relevant nginx configuration?

Thanks for trying to help!

upstream backend {
server 127.0.0.1:8065;
keepalive 32;
}

proxy_cache_path /var/cache/nginx levels=1:2 keys_zone=mattermost_cache:10m max_size=3g inactive=120m use_temp_path=off;

server {
server_name xxx…uni-kiel.de;

http2_push_preload on; # Enable HTTP/2 Server Push

location ~ /api/v[0-9]+/(users/)?websocket$ {
    proxy_set_header Upgrade $http_upgrade;
    proxy_set_header Connection "upgrade";
    client_max_body_size 50M;
    proxy_set_header Host $http_host;
    proxy_set_header X-Real-IP $remote_addr;
    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    proxy_set_header X-Forwarded-Proto $scheme;
    proxy_set_header X-Frame-Options SAMEORIGIN;
    proxy_buffers 256 16k;
    proxy_buffer_size 16k;
    client_body_timeout 60;
    send_timeout 300;
    lingering_timeout 5;
    proxy_connect_timeout 90;
    proxy_send_timeout 300;
    proxy_read_timeout 90s;
    proxy_http_version 1.1;
    proxy_pass http://backend;
}

location / {
    if (-f $document_root/maintenance.on) {
        return 503;
    }

    client_max_body_size 50M;
    proxy_set_header Connection "";
    proxy_set_header Host $http_host;
    proxy_set_header X-Real-IP $remote_addr;
    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    proxy_set_header X-Forwarded-Proto $scheme;
    proxy_set_header X-Frame-Options SAMEORIGIN;
    proxy_buffers 256 16k;
    proxy_buffer_size 16k;
    proxy_read_timeout 600s;
    proxy_cache mattermost_cache;
    proxy_cache_revalidate on;
    proxy_cache_min_uses 2;
    proxy_cache_use_stale timeout;
    proxy_cache_lock on;
    proxy_http_version 1.1;
    proxy_pass http://backend;
}

error_page 503 @maintenance_page;

location @maintenance_page {
    rewrite ^(.*)$ /maintenance_page.html break;
}

listen 443 ssl http2; # managed by Certbot
ssl_certificate /etc/letsencrypt/live/aaa.uni-kiel.de/fullchain.pem; # managed by Certbot
ssl_certificate_key /etc/letsencrypt/live/aaa.uni-kiel.de/privkey.pem; # managed by Certbot
# include /etc/letsencrypt/options-ssl-nginx.conf; # managed by Certbot
ssl_dhparam /etc/letsencrypt/ssl-dhparams.pem; # managed by Certbot

ssl_session_timeout 1d;

# Enable TLS versions (TLSv1.3 is required upcoming HTTP/3 QUIC).
ssl_protocols TLSv1.2 TLSv1.3;

# Enable TLSv1.3's 0-RTT. Use $ssl_early_data when reverse proxying to
# prevent replay attacks.
#
# @see: https://nginx.org/en/docs/http/ngx_http_ssl_module.html#ssl_early_data
ssl_early_data on;

ssl_ciphers 'ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-SHA384:ECDHE-RSA-AES256-SHA384:ECDHE-ECDSA-AES128-SHA256:ECDHE-RSA-AES128-SHA256';
ssl_prefer_server_ciphers on;
ssl_session_cache shared:SSL:50m;
# HSTS (ngx_http_headers_module is required) (15768000 seconds = 6 months)
add_header Strict-Transport-Security max-age=15768000;
# OCSP Stapling ---
# fetch OCSP records from URL in ssl_certificate and cache them
ssl_stapling on;
ssl_stapling_verify on;

}

server {
if ($host = xxx.uni-kiel.de) {
return 301 https://$host$request_uri;
} # managed by Certbot

listen 80 default_server;
server_name xxx.uni-kiel.de;
return 404; # managed by Certbot

}

How many users do you have and how busy is your server? Limiting the backend to 32 keepalive connections is not necessary here I think, can you try to remove this limitation or is there a specific reason for it? I just saw that this is also part of the official documentation and even in the nginx configuration of the latest docker containers, it defaults to 64 here, but besides that, this is the default nginx config from the documentation and therefore I think it should be fine.
I’m not sure though if nginx still counts established websockets to the keepalive limit - I’d have to check on that.

Since you also described that the images and CSS are not being loaded: This should be part of the nginx cache here and will usually not be forwarded to mattermost. Can you also experience the same problems with the webapp and if so, can you catch an output of the developer console (network view) when this happens?