Golang GC performance difference between platforms

Hi,
I am trying to explain why the Golang GC is slower on a platform with more resources. Our stack runs on 2 platforms.
Platform A: 48 CPUs and 384 GB of memory
Platform B: 96 CPUs and 768 GB of memory.

Both the platforms have the exact same CPU type.

Essentially B has double the CPU and memory, when we were doing performance testing on our stack, Platform B had lower throughput. To prove my point I wrote a synthetic benchmark test that also shows the performance difference.

func readSkinny(data []byte) []byte {
        var dest []byte
        dest = append(dest, data...)
        return dest
}
func BenchmarkCopySkinny(b *testing.B) {
        b.ReportAllocs()
        textData :=  bytes.Repeat([]byte{0x10}, 4096)
        b.ResetTimer()
        for i := 0; i < b.N; i++ {
                readSkinny(textData)
        }
        // Benchmark ends here. Stop timer.
        b.StopTimer()
}

The above benchmark

Platform A: BenchmarkCopySkinny-48    	  100000	       782.6 ns/op
Platform B: BenchmarkCopySkinny-96    	  100000	      1741 ns/op

I am using go 1.16 and go 1.15 and I see a performance hit on both the versions.

I would appreciate any insight as to why GoLang GC seems to take longer on the more powerful platform? And possibly a work-around to make Platform B perform at least as well as Platform A.
I have tried playing around with

  • GOMAXPROCS but that does not seem to make any difference.

I have pasted a few screen shots from the trace profile. It looks like GC pause does take longer.
I can also see frommthe flameegraph in the cpu profile, the sweep phase takes longer on platform B.

Platform A vs. Platform B

1 Like

My first guess is that the memory subsystem may be the bottleneck. If you constrain the number of goroutines on Platform B to 48, do the times get closer to platform A?

By constraining the number of go routines, you mean setting by setting GOMAXPROCS yes? The times do get a little closer. I am pasting the results below.

Platform A: BenchmarkCopySkinny-48    	  100000	       792.4 ns/op	    4096 B/op	       1 allocs/op
Platform B: BenchmarkCopySkinny-48    	  100000	      1567 ns/op	    4096 B/op	       1 allocs/op

I was hoping to see it within the same order of magnitude. I suggest you open an issue here: https://github.com/golang/go/issues

Are you running directly in the hardware or in a VM?

Both the platforms are VMs

Are the VMs just for isolation or are multiple VMs running on the hosts? If the latter, the other VMs will affect your performance. If they’re doing anything memory intensive, that will steal resources from the VM from which you’re running the benchmark.

Just an update, I did resolve this issue. I had asked the same question on gophers.slack.com and one angel replied to this question. Asking me to check the NUMA setup in the 2 platforms.

The problem had to do with NUMA setup on Platform B. Platform B had 2 NUMA nodes. Pasting the output below, as opposed to Platform A only had one NUMA node.

The GoLang scheduler is not NUMA aware, this document describes a proposal for making the scheduler NUMA aware.

When I bound the process to one of the numa nodes using numactl --cpunodebind 0 XXX and ran the test both the platforms show the exact some performance.

Platform B numactl output

> numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
node 0 size: 382819 MB
node 0 free: 333743 MB
node 1 cpus: 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
node 1 size: 382904 MB
node 1 free: 369935 MB
node distances:
node   0   1
  0:  10  21
  1:  21  10
1 Like

Very interesting. Thanks for posting the solution back!