I spent the last 3 hours looking through the logs and I noticed that somehow I managed to miss what could presumably be the root cause of this issue. I found the same chain of event repeat during 2 times we experienced this issue, in fact, I found a 3rd instance that went unnoticed because it was outside of normal hours. Here is a summary of what I found on the logs:
These appear first and the message repeats for every user (only showing 3 messages). We’ve agreed to ignore them thus I am including them just to give as much info as possible:
{“timestamp”:“2022-10-25 14:50:00.043 -04:00”,“level”:“error”,“msg”:“websocket.NextReader: closing websocket”,“caller”:“app/web_conn.go:830”,“user_id”:" “,“error”:“read tcp 10.29.19.87:443->66.220.203.181:55854: i/o timeout”}
{“timestamp”:“2022-10-25 14:50:00.044 -04:00”,“level”:“error”,“msg”:“websocket.NextReader: closing websocket”,“caller”:“app/web_conn.go:830”,“user_id”:” “,“error”:“read tcp 10.29.19.87:443->192.168.1.14:65094: i/o timeout”}
{“timestamp”:“2022-10-25 14:50:00.044 -04:00”,“level”:“error”,“msg”:“websocket.NextReader: closing websocket”,“caller”:“app/web_conn.go:830”,“user_id”:” ",“error”:“read tcp 10.29.19.87:443->10.29.19.146:55068: i/o timeout”}
Here is the interesting part, these start to show(multiple times) which matches the time where the service goes down:
{“timestamp”:“2022-10-25 14:50:00.152 -04:00”,“level”:“error”,“msg”:“http2: received GOAWAY [FrameHeader GOAWAY len=20], starting graceful shutdown”,“caller”:“log/log.go:195”,“source”:“httpserver”}
{“timestamp”:“2022-10-25 14:50:00.153 -04:00”,“level”:“error”,“msg”:“http2: received GOAWAY [FrameHeader GOAWAY len=20], starting graceful shutdown”,“caller”:“log/log.go:195”,“source”:“httpserver”}
{“timestamp”:“2022-10-25 14:50:00.155 -04:00”,“level”:“error”,“msg”:“http2: received GOAWAY [FrameHeader GOAWAY len=20], starting graceful shutdown”,“caller”:“log/log.go:195”,“source”:“httpserver”}
{“timestamp”:“2022-10-25 14:50:00.161 -04:00”,“level”:“error”,“msg”:“http2: received GOAWAY [FrameHeader GOAWAY len=20], starting graceful shutdown”,“caller”:“log/log.go:195”,“source”:“httpserver”}
Followed by(also multiple times, using different client’s IPs):
{“timestamp”:“2022-10-25 14:50:00.442 -04:00”,“level”:“error”,“msg”:“http: TLS handshake error from clientIP:49336: EOF”,“caller”:“log/log.go:195”,“source”:“httpserver”}
{“timestamp”:“2022-10-25 14:50:00.443 -04:00”,“level”:“error”,“msg”:“http: TLS handshake error from clientIP:58159: write tcp Mattermost_IP:443->IP:58159: write: broken pipe”,“caller”:“log/log.go:195”,“source”:“httpserver”}
{“timestamp”:“2022-10-25 14:50:00.443 -04:00”,“level”:“error”,“msg”:“http: TLS handshake error from clientIP:58160: write tcp Mattermost_IP:443->clientIP:58160: write: broken pipe”,“caller”:“log/log.go:195”,“source”:“httpserver”}
{“timestamp”:“2022-10-25 14:50:00.444 -04:00”,“level”:“error”,“msg”:“http: TLS handshake error from clientIP:56652: write tcp Mattermost_IP->clientIP:56652: write: broken pipe”,“caller”:“log/log.go:195”,“source”:“httpserver”}
After a few minutes these messages stop appearing in the logs and the service is available again but no other log messages shows that Mattermost server actually stops or re-initializes… in other words, no signs from the logs of the service initializing or stopping after the issue “stops”.
What could possibly trigger these http2 FrameHeader GOAWAY
? Are they coming from the outside or from Mattermost itself?
I have many questions but the above-mentioned are probably the most relevant. I tried to make sense of RFC#7540 in hopes of better understanding HTTP2 but I must confess that the language used there is a bit too advanced for my understanding.