Background
We are in the process of developing a software service related to WebSocket. We are using Go version 1.23.1. During the utilization of WebSocket for communication, we have encountered several obstacles.
Sometimes, the WebSocket connection fails to be established normally. This phenomenon consistently appears on some Windows 10 computers, while the majority of Windows 10 computers do not have this problem.
Phenomenon
In order to reproduce this issue more conveniently, I have written the following most simplified code for establishing a WebSocket connection. In the code below, I use the Gin framework to monitor HTTP route requests. After receiving the HTTP requests from clients, I then use the github.com/gorilla/websocket library (not just this library, I have also adopted other WebSocket - related libraries available online. Unfortunately, none of them can solve the problem I mentioned above)
to establish the WebSocket connection.
import (
"github.com/gin-gonic/gin"
"github.com/gorilla/websocket"
"log"
"net/http"
)
var (
upgrader websocket.Upgrader = websocket.Upgrader{
CheckOrigin: func(r *http.Request) bool {
return true
},
}
)
type LogWriter struct{}
func (w *LogWriter) Write(data []byte) (int, error) {
log.Printf("%s", data)
return len(data), nil
}
func wsHandle(c *gin.Context) {
_, err := upgrader.Upgrade(c.Writer, c.Request, nil)
if err != nil {
log.Printf("upgrade error:%s", err)
return
}
}
func main() {
gin.DefaultWriter = &LogWriter{}
gin.DefaultErrorWriter = &LogWriter{}
server := gin.Default()
server.GET("/ws", wsHandle)
err := server.Run("127.0.0.1:10020")
if err != nil {
panic(err)
}
}
I ran the above simple code on the environment where the problem occurred. After the service started to correctly listen on port 10020, I used the following code on the client side to establish a connection with the service.
ws = new WebSocket("ws://127.0.0.1:10020/ws")
Then I noticed that the ws.readyState
remained at 0 all the time and never changed to 1, which indicates that both sides have been in the state of establishing a connection all along.
I used the Wireshark tool to trace the entire process of establishing the connection. The whole process of the communication is shown in the figure below.
Under normal circumstances, there should be a communication record of
Switching Protocols
next. However, it is absent in the abnormal environment, which indicates that a blockage has occurred during the process of establishing communication.
Problem location
I started to set breakpoints in the code for problem location. I have found the specific location where the blocking occurs (in the file https://cs.opensource.google/go/go/+/master:src/net/http/server.go;l=689). The code snippet with the blocking phenomenon is shown as follows.
1. func (cr *connReader) backgroundRead() {
2. n, err := cr.conn.rwc.Read(cr.byteBuf[:])
3. cr.lock()
4. if n == 1 {
5. cr.hasByte = true
6. // We were past the end of the previous request's body already
7. // (since we wouldn't be in a background read otherwise), so
8. // this is a pipelined HTTP request. Prior to Go 1.11 we used to
9. // send on the CloseNotify channel and cancel the context here,
10. // but the behavior was documented as only "may", and we only
11. // did that because that's how CloseNotify accidentally behaved
12. // in very early Go releases prior to context support. Once we
13. // added context support, people used a Handler's
14. // Request.Context() and passed it along. Having that context
15. // cancel on pipelined HTTP requests caused problems.
16. // Fortunately, almost nothing uses HTTP/1.x pipelining.
17. // Unfortunately, apt-get does, or sometimes does.
18. // New Go 1.11 behavior: don't fire CloseNotify or cancel
19. // contexts on pipelined requests. Shouldn't affect people, but
20. // fixes cases like Issue 23921. This does mean that a client
21. // closing their TCP connection after sending a pipelined
22. // request won't cancel the context, but we'll catch that on any
23. // write failure (in checkConnErrorWriter.Write).
24. // If the server never writes, yes, there are still contrived
25. // server & client behaviors where this fails to ever cancel the
26. // context, but that's kinda why HTTP/1.x pipelining died
27. // anyway.
28. }
29. if ne, ok := err.(net.Error); ok && cr.aborted && ne.Timeout() {
30. // Ignore this error. It's the expected error from
31. // another goroutine calling abortPendingRead.
32. } else if err != nil {
33. cr.handleReadError(err)
34. }
35. cr.aborted = false
36. cr.inRead = false
37. cr.unlock()
38. cr.cond.Broadcast()
39. }
40. func (cr *connReader) abortPendingRead() {
41. cr.lock()
42. defer cr.unlock()
43. if !cr.inRead {
44. return
45. }
46. cr.aborted = true
47. cr.conn.rwc.SetReadDeadline(aLongTimeAgo)
48. for cr.inRead {
49. cr.cond.Wait()
50. }
51. cr.conn.rwc.SetReadDeadline(time.Time{})
52. }
The blocking positions occur on line 2 and line 49 respectively. It can be seen that the read - timeout limit of rwc
is set on line 47. However, the rwc
on line 2 is still blocked on the Read
method. Then I traced the code execution process in the normal environment and found that after the execution on line 47 is completed, the Read
method on line 2 is released from the blocked state to the non - blocked state. By comparison, it can be determined that this is the cause of the failure to establish the WebSocket connection normally. It seems to suggest that there are some potential issues in the net/http
library or the libraries at a lower level of the net/http
library…
Solution
I originally intended to further research the underlying code of the net library to find out if there were some bugs that couldn’t be ignored at the bottom level. However, I added a waiting code (as shown below) in front of line 2 of the above - mentioned code, and this immediately solved the problem. The WebSocket connection can now be established normally even on the abnormal computers. It’s really amazing!
1. func (cr *connReader) backgroundRead() {
2. time.Sleep(time.Microsecond)
3. n, err := cr.conn.rwc.Read(cr.byteBuf[:])
4. cr.lock()
What I expect
Although this temporary solution can solve the problem, I still haven’t figured out the root cause of it. I also wonder if there are better ways to address this issue. At the same time, it’s necessary to clarify whether this implies that there are some potential bugs in the net library that need to be fixed. If there are indeed bugs, how should they be fixed? After all, it’s really hard to trust this code when just adding a waiting statement before one line of code can solve the problem.
I hope someone can point out what the root cause of this problem is and come up with a more normal, reliable, and safe way to solve the problem of the inability to establish a normal WebSocket connection.