I’m trying to close a listener then open it again, and I get this error:
bind: address already in use
With this code:
// ln is listening on :8080
err = ln.Close() // succeeds, no error
if err != nil {
log.Fatal(err)
}
ln2, err := net.Listen("tcp", ":8080")
if err != nil {
log.Fatal(err) // bind: address already in use
}
I was wondering if SO_REUSEADDR had something to do with this, but as far as I know, that is already being used under the hood in the Go standard library when creating a new tcp listener.
Any ideas how I can re-bind to that address without delay?
Interestingly, both on my mac and on the go playground, if you use -addr="" or change to defaultAddr to "" in the source (which I think just means it will bind to any open port, yeah?) it will never rebind to the same port, and in fact will bind to the previous attempt’s port + 1. Not sure if this is significant; I don’t know precisely what binding to "" is specced to do.
This only happens for me when my program has restarted itself using exec.Command(os.Args[0], ...) and, in that command, it sets ExtraFiles to a list of file descriptors for listeners. (Similar to this method: http://grisha.org/blog/2014/06/03/graceful-restart-in-golang/) This lets the child process (itself) use the existing listeners without downtime.
In the “restarted” process, then: I close the listeners, immediately create new ones on the same addresses again, and it fails with “address already in use”. But if I pause 5 seconds after closing the listener (before creating the new listener), it succeeds.
The original process where the listeners were created don’t have this problem. In other words, if I don’t “restart” the process, I can close and create the listeners immediately, like @jdh’s program does. But if I do that same thing in a restart, it doesn’t work.
Here’s a program that reproduces what I’m describing:
You must close the os.Files themselves; I assume they have their own file descriptor, so closing the net.Listener doesn’t close them, and if those are left open, I assume the OS leaves that port bound, so you cannot rebind. This issue exists in both the parent and child processes.
There is a race condition in the handoff. Basically, if the child attempts to rebind before the parent closes its shared FD, the child will fail to rebind because the port is still bound.
My example program has flags to expose both of these failures.
Solutions
Close the files.
os.Signal the child process once the parent has closed its stuff, and only then let the child attempt to rebind.
I had discovered (concurrently with you) that calling .File() on a *net.TCPListener returns a duplicated file descriptor. So I started playing with closing those too, but couldn’t get the combination of the placement and ordering of the .Close() calls right. Seems you have, as both parent and child pass on my machine.
This makes sense, though. Both parent and child have two file descriptors for the socket that gets transferred over, so you have to close both in each case.
Let me get over being really excited about this and work your methodology into my program… huge thank you! I couldn’t find any other explanation for this.
Update: I rewired my program and it’s working. Hours of debugging has come to an end.
I think I found the solution to your issue. I’m definitely not a socket expert, but I was able to get your example working. Short version is, the kernel has a TIME_WAIT state when it is closing a TCP connection graceful, meaning you can’t reuse it until that is finished. Normally in C an any easy way to “bypass” this “problem” is to set the SO_REUSEADDR socket option to true, which you won’t have access to in the net package. Luckily, the net package is kind of enough to set this for us on a TCPListener as in your example application by default. Obviously, you still get the bind: address already in use error, which is SO answer explains in depth as to why this still happens, even though SO_REUSEADDR is enabled on the socket.
TLDR; either use straight syscalls so you can enable SO_REUSEPORT on your socket, which isn’t portable to all systems. Or the better solution is to bind the parent process to 0.0.0.0:1234 and the child to 192.168.0.1:1234 (or whatever the machine IP is).
Hey Austin, thanks for the answer! I didn’t know much about TIME_WAIT (or CLOSE_WAIT) until this bug, in researching it. I definitely spent some time with this code:
but it was a dead end; like you said, Go already does this. So then I tried migrating as much to syscall as I could, (syscall.Close(), etc) just to see if I could bypass any Go wrapping, but this failed too so I reverted.
Dave Cheney suggested (on Slack) that I use netstat -anp tcp to check the socket state. I saw that it was actually still in LISTEN state when it should have been in CLOSE_WAIT. So I started playing with closing more files like Joe was talking about above, which did lead to the final solution.