Why did program that interacts with device drivers stop working in Go 1.20?

A colleague and I run a large project that has to control and receive data from a special-purpose PCI-express card, running custom firmware. This is a data acquisition system for high-speed x-ray and gamma-ray detectors, arrays of superconducting sensors (just for context). I can’t say whether Go was the optimal choice for the project–my colleague still argues that Rust would have been better–but I know it’s worked extremely well since we launched this in 2017 (as a replacement for a C++ monstrosity).

We talk to the PCIe card by opening/closing and reading/writing device-special files provided (obviously) by a device driver. There are control registers for configuration and a scatter-gather DMA for transferring the high-speed data (typically 20 to 200 MB/second, depending on one’s instrument configuration) to the computer RAM.

I find that Go 1.16, 1.17, 1.18, and 1.19 all run our program just fine. When I build the program with Go 1.20, however, the program hangs. The build succeeds, and the configuration steps (appear to) work correctly at run time, setting up the scatter-gather DMA cycle. When we try to read from the DMA buffer the first time, the program hangs. And this happens only in Go 1.20!

I’m afraid I cannot provide a minimum reproducible example, owing to the fact that you’d need our specific hardware (running our specific firmware) and the corresponding device drivers. I understand I can’t given enough information to solve the problem.

Still, maybe someone can offer ideas? Is there something special I should know about Go 1.20 that might help me track down the problem? I’ve read the 1.20 release notes a dozen times, but maybe I’m missing the significance of the key point in there?

Some system facts: Ubuntu 22.04, Go 1.20.3, 16 GB RAM. (The same problem has also been noted on a different PC running Ubuntu 20.04.)

$ lsb_release -a
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 22.04.2 LTS
Release:	22.04
Codename:	jammy
$ go version
go version go1.20.3 linux/amd64
$ free
               total        used        free      shared  buff/cache   available
Mem:        16331116     6773036      173288       50804     9384792     9178968
Swap:        2097148       39936     2057212

$ lspci
00:00.0 Host bridge: Intel Corporation 4th Gen Core Processor DRAM Controller (rev 06)
00:01.0 PCI bridge: Intel Corporation Xeon E3-1200 v3/4th Gen Core Processor PCI Express x16 Controller (rev 06)
00:01.1 PCI bridge: Intel Corporation Xeon E3-1200 v3/4th Gen Core Processor PCI Express x8 Controller (rev 06)
00:14.0 USB controller: Intel Corporation 9 Series Chipset Family USB xHCI Controller
00:16.0 Communication controller: Intel Corporation 9 Series Chipset Family ME Interface #1
00:1a.0 USB controller: Intel Corporation 9 Series Chipset Family USB EHCI Controller #2
00:1b.0 Audio device: Intel Corporation 9 Series Chipset Family HD Audio Controller
00:1c.0 PCI bridge: Intel Corporation 9 Series Chipset Family PCI Express Root Port 1 (rev d0)
00:1c.3 PCI bridge: Intel Corporation 9 Series Chipset Family PCI Express Root Port 4 (rev d0)
00:1d.0 USB controller: Intel Corporation 9 Series Chipset Family USB EHCI Controller #1
00:1f.0 ISA bridge: Intel Corporation Z97 Chipset LPC Controller
00:1f.2 SATA controller: Intel Corporation 9 Series Chipset Family SATA Controller [AHCI Mode]
00:1f.3 SMBus: Intel Corporation 9 Series Chipset Family SMBus Controller
01:00.0 Unassigned class [ff00]: Altera Corporation Device 0004 (rev 01)
02:00.0 VGA compatible controller: NVIDIA Corporation TU117GL [T400 4GB] (rev a1)
02:00.1 Audio device: NVIDIA Corporation Device 10fa (rev a1)
04:00.0 Ethernet controller: Qualcomm Atheros Killer E220x Gigabit Ethernet Controller (rev 13)

The Altera (pci 01:00.0) is the PCIe device in question.

I have tried a few steps that seemed potentially relevant:

  • removing the deprecated syscall package, replacing it with golang.org/x/sys/unix
  • calling a function such as C.posix_memalign(...) directly from Go, versus calling a handwritten cgo wrapper function that in turn calls posix_memalign.
  • calling C.read(...) directly versus calling unix.Read(fd, buffer) with buffer being the result of a C.GoBytes(...) call on a previously allocated C pointer.

They all leave the Go 1.16-1.19 versions working, and the 1.20 version hanging.

For now, the workaround is to panic when the user builds with Go 1.20 and tries to use this particular data source. That hardly seems like a long-term solution, though.

1 Like

If you haven’t already, I would recommend you file an issue on the Go issue tracker. I suspect they might have some advice on what to check.

EDIT: I thought I asked this already but I think maybe my post didn’t save or maybe I forgot to save it, but: Can you clarify what you mean by “hang?” Does the process just halt (like 0% CPU usage) or does it seem to get stuck in some sort of loop (like 100% CPU usage)? If it’s the latter, you might be able to find something by running your code in Delve to see where it’s getting “stuck.”

Good question. I have to check on that…

I can verify that “hang” here means halting with 0% CPU usage. With the VS Code Golang debugger (I assume it’s delve under the hood), I find that the hang is on the second line in the following:

	gobuffer := C.GoBytes(unsafe.Pointer(buffer), C.int(bufferLength))
	n, err := unix.Read(int(fd), gobuffer)

In this case, bufferLength is 33554432 (i.e., 2^25), the size of the data pointed to by buffer *C.char, which was allocated by a call to posix_memalign(...), and fd is the file descriptor of the open device-special file from which we read data. Or we would read it, if that call ever returned.

Thanks for the reply. I’ll see if I can come up with a valid issue to file. It’s not ideal when you cannot offer any reproducible problem, but I’m completely out of other ideas!

I did file an issue (#60211). I don’t know if it was the right thing to do, given the opaque nature of my problem. Perhaps it will lead somewhere useful, though.

It looks like Ian was able to find it out! :+1:

I have to admit: I’m impressed that you found an actual bug. This is one of the more head-scratching posts I’ve seen here in a while and I’m glad it looks like you will get a resolution.

I am blown away to learn that this was an actual bug in [a library of] Go 1.20!!

For those not reading Golang issue #60211 and links therein, a summary: it seems that Go 1.20 was not correctly setting the O_NONBLOCK flag if the user opened a file like this:

file, err := os.OpenFile(myFileName, os.O_RDWR|syscall.O_NONBLOCK, 0666)

I’ve been using open-source software for a quarter of a century, and I don’t know that I’ve ever been tripped up by a true bug in a major package. Certainly not in something as major as a widely used programming language! I agree, @Dean_Davidson: I’m impressed with myself, too.

Many thanks to the Go Forum, especially @skillian for urging me to file an issue. I was sure the problem was with me–maybe I was relying on some undocumented behavior that changed? It’s not like I deeply understand this device or its driver, after all.

And thanks to the Golang developers for a quick catch and patch once they realized there was an actual bug. As I read the activity, the bug seems on track to be repaired in Go 1.20.5 (if that release materializes) and surely for 1.21.

1 Like

Is the program suspended in IO?
You can try reading one byte at a time, and determine whether the problem is in IO based on whether the program hangs when reading a certain byte.
posix_memalign seems to be a memory allocation function, see go help build

-race

enable data race detection.

Supported only on linux/amd64, freebsd/amd64, darwin/amd64, darwin/arm64, windows/amd64,

linux/ppc64le and linux/arm64 (only for 48-bit VMA).

-msan

enable interoperation with memory sanitizer.

Supported only on linux/amd64, linux/arm64, freebsd/amd64

and only with Clang/LLVM as the host C compiler.

PIE build mode will be used on all platforms except linux/amd64.

-asan

enable interoperation with address sanitizer.

Supported only on linux/arm64, linux/amd64.

Supported only on linux/amd64 or linux/arm64 and only with GCC 7 and higher

or Clang/LLVM 9 and higher.

You can try to detect if there are any memory issues.

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.