I spent the last 3 hours looking through the logs and I noticed that somehow I managed to miss what could presumably be the root cause of this issue. I found the same chain of event repeat during 2 times we experienced this issue, in fact, I found a 3rd instance that went unnoticed because it was outside of normal hours. Here is a summary of what I found on the logs:
These appear first and the message repeats for every user (only showing 3 messages). We’ve agreed to ignore them thus I am including them just to give as much info as possible:
{“timestamp”:“2022-10-25 14:50:00.043 -04:00”,“level”:“error”,“msg”:“websocket.NextReader: closing websocket”,“caller”:“app/web_conn.go:830”,“user_id”:" “,“error”:“read tcp 10.29.19.87:443->66.220.203.181:55854: i/o timeout”}
{“timestamp”:“2022-10-25 14:50:00.044 -04:00”,“level”:“error”,“msg”:“websocket.NextReader: closing websocket”,“caller”:“app/web_conn.go:830”,“user_id”:” “,“error”:“read tcp 10.29.19.87:443->192.168.1.14:65094: i/o timeout”}
{“timestamp”:“2022-10-25 14:50:00.044 -04:00”,“level”:“error”,“msg”:“websocket.NextReader: closing websocket”,“caller”:“app/web_conn.go:830”,“user_id”:” ",“error”:“read tcp 10.29.19.87:443->10.29.19.146:55068: i/o timeout”}
Here is the interesting part, these start to show(multiple times) which matches the time where the service goes down:
{“timestamp”:“2022-10-25 14:50:00.152 -04:00”,“level”:“error”,“msg”:“http2: received GOAWAY [FrameHeader GOAWAY len=20], starting graceful shutdown”,“caller”:“log/log.go:195”,“source”:“httpserver”}
{“timestamp”:“2022-10-25 14:50:00.153 -04:00”,“level”:“error”,“msg”:“http2: received GOAWAY [FrameHeader GOAWAY len=20], starting graceful shutdown”,“caller”:“log/log.go:195”,“source”:“httpserver”}
{“timestamp”:“2022-10-25 14:50:00.155 -04:00”,“level”:“error”,“msg”:“http2: received GOAWAY [FrameHeader GOAWAY len=20], starting graceful shutdown”,“caller”:“log/log.go:195”,“source”:“httpserver”}
{“timestamp”:“2022-10-25 14:50:00.161 -04:00”,“level”:“error”,“msg”:“http2: received GOAWAY [FrameHeader GOAWAY len=20], starting graceful shutdown”,“caller”:“log/log.go:195”,“source”:“httpserver”}
Followed by(also multiple times, using different client’s IPs):
{“timestamp”:“2022-10-25 14:50:00.442 -04:00”,“level”:“error”,“msg”:“http: TLS handshake error from clientIP:49336: EOF”,“caller”:“log/log.go:195”,“source”:“httpserver”}
{“timestamp”:“2022-10-25 14:50:00.443 -04:00”,“level”:“error”,“msg”:“http: TLS handshake error from clientIP:58159: write tcp Mattermost_IP:443->IP:58159: write: broken pipe”,“caller”:“log/log.go:195”,“source”:“httpserver”}
{“timestamp”:“2022-10-25 14:50:00.443 -04:00”,“level”:“error”,“msg”:“http: TLS handshake error from clientIP:58160: write tcp Mattermost_IP:443->clientIP:58160: write: broken pipe”,“caller”:“log/log.go:195”,“source”:“httpserver”}
{“timestamp”:“2022-10-25 14:50:00.444 -04:00”,“level”:“error”,“msg”:“http: TLS handshake error from clientIP:56652: write tcp Mattermost_IP->clientIP:56652: write: broken pipe”,“caller”:“log/log.go:195”,“source”:“httpserver”}
After a few minutes these messages stop appearing in the logs and the service is available again but no other log messages shows that Mattermost server actually stops or re-initializes… in other words, no signs from the logs of the service initializing or stopping after the issue “stops”.
What could possibly trigger these http2 FrameHeader GOAWAY? Are they coming from the outside or from Mattermost itself?
I have many questions but the above-mentioned are probably the most relevant. I tried to make sense of RFC#7540 in hopes of better understanding HTTP2 but I must confess that the language used there is a bit too advanced for my understanding.