Unexpected performance degradation on multiple Go routines

Hi,
I’m wondering why my algorithm for generating data doesn’t perform better when two Go routines have been started to generate the data.
I ran the test on a multi-core machine, which implies that the two Go routines can run in parallel. Each of these two Go routines only generates half of the total data volume to be generated. I therefore assumed that the total running time should be about half as long compared to a single Go routine.
It would be great if someone could review the attached code sample below. There might be something wrong with my approach, but I couldn’t identify it.
Thank you!

package main

import (
	"fmt"
	"math/rand"
	"sync"
	"time"
)

func gendata(numGoRoutines, numElements int) []int {
	var wg sync.WaitGroup
	wg.Add(numGoRoutines)
	rand.Seed(time.Now().UnixNano())

	ssi := make([][]int, numGoRoutines)
	for i := 0; i < numGoRoutines; i++ {
		ssi[i] = make([]int, 0, numElements/numGoRoutines)
		go func(i int) {
			defer wg.Done()
			for j := 0; j < numElements/numGoRoutines; j++ {
				ssi[i] = append(ssi[i], rand.Intn(10))
			}
		}(i)
	}
	wg.Wait()

	// Create the result by appending all the sub-slices into one result slice
	var res []int
	for i := 0; i < numGoRoutines; i++ {
		res = append(res, ssi[i]...)
	}
	return res
}

func main() {
	n := 100000000

	// One Go routine
	g := 1
	t1 := time.Now()
	gendata(g, n)
	t2 := time.Now()
	fmt.Println("Data generation of", n, "elements with", g, "Go routines took", t2.Sub(t1))

	// Two Go routines
	g = 2
	t1 = time.Now()
	gendata(g, n)
	t2 = time.Now()
	fmt.Println("Data generation of", n, "elements with", g, "Go routines took", t2.Sub(t1))
}

math/rand produces a predictable sequence of pseudo-random values. To ensure that successive calls to math/rand.Intn always produce the same values, the source is protected with a mutex, so adding goroutines won’t help.

Instead, use math/rand.NewSource and pass that to rand.New for each goroutine separately. You cannot share a rand.Source between goroutines because only the global source is safe for concurrent use. That should allow multiple cores to generate values.

3 Likes

That’s a good point. It improved the performance. Now, the runtime for the data generation by one goroutine and for the data generation by two goroutines is almost equal.

Nevertheless, I expected the runtime to be roughly halved with the two-goroutine approach.
I also tested with a fixed value for the data to be appended to the slice, e.g.

ssi[i] = append(ssi[i], 1)

The result is the same. The runtime is the same for both variants. It seems there is something else which influences the runtime of the two goroutines. I still wonder why this happens…

I recommend looking into the runtime/pprof package to investigate where the timw is being spent.

1 Like

I did more research and was able to further improve the runtime of the function.
In addition to the change regarding the random source (thanks for that hint) I have changed the following two things:

  1. The calculation of the exit criteria of the for-loop is performed only once before entering the loop
  2. new elements are inserted into the slice using the appropriate slice indices
func gendata(numGoRoutines, numElements int) []int {
	var wg sync.WaitGroup
	wg.Add(numGoRoutines)

	ssi := make([][]int, numGoRoutines)
	for i := 0; i < numGoRoutines; i++ {
		ssi[i] = make([]int, numElements/numGoRoutines)
		go func(i int) {
			defer wg.Done()
            s := rand.NewSource(time.Now().UnixNano())
	        r := rand.New(s)
            cond := numElements / numGoRoutines
			for j := 0; j < cond; j++ {
				ssi[i][j] = r.Intn(10)
			}
		}(i)
	}
	wg.Wait()

	// Create the result by appending all the sub-slices into one result slice
	var res []int
	for i := 0; i < numGoRoutines; i++ {
		res = append(res, ssi[i]...)
	}
	return res
}

Of course! You’re right; I’m frustrated at myself for not noticing this! Every time you append to the slice, you’re rewriting the new length back into ss[i] which is adjacent in memory to ss[i±1] which is likely resulting in false sharing which can nullify any benefit of parallelism.

1 Like

So every single access to ssi causes a cache miss. How does this nullify benefits of parallelism, when one goroutine or many goroutines, all have the same number of cache misses ?

@sabitor the variable name ssi is very poorly chosen. Give it a significant name, and it would improve readability by a lot.

No, not a cache miss: The ss[0] and ss[1] slices are adjacent in memory and every time you append to the slice, the length field will be written to and “clobber” the cache line that both slices live on, so that other cores need to keepbreloading their copies of the slice from main memory.

If you preallocate the slices so that you’re writing into their backing arrays, but not to the actual slice headers, each core is writing to its own memory in different cache lines so you don’t have to deal with the false sharing situation.

1 Like

Why would the other goroutines keep reloading ? How does that work in practice ?
Assume goroutine A modifies a slice X, and writes that to slow memory. That slice’s len and cap will also change.

Then, another goroutine B, wants to do something with another slice Y, in the same cache line. How does that work ? The whole machine will need to read the cache line, once B wants to change Y’s len or cap (before it wants to modify it). All goroutines share the same cache.

What do you mean by reload ? The word reload seems confusing. Once a goroutine B reads the cap/len/pointer of slice Y, that is it. It will no longer re-read those (and the cache line) up until the next time it executes another command reading anything from that cache line. That is a simple cache miss (the cache line that B has was invalidated precisely when A modified X).

There is no mechanism pushing information to B (reload), simply B causes the machine to pull the cache line again (cache miss).

Am I correct ? What am I missing ? It seems to me like I do not understand how invalidating a cache line works, with regards to the goroutines share a common single copy of it, my assumption being that they will only read it again (only once) when they want info from it.

Disclaimer: I am not a performance expert; this stuff interests me, but I recommend consulting an actual expert for anything important. I’m also open to (polite) criticism. We’re all learners here.

I’m not talking about goroutines as much as I am talking about the hardware CPU cores that are executing OS-level threads in parallel. L3 caches seem to be shared between all of a processor’s cores, but L1 caches seem to never be shared. L2 cache on my legacy Core i3 seems to be per-core, but on modern Core i9 (and perhaps i7, i5, and i3) CPUs, it looks like L2 is shared between pairs of cores. Intel processor caches are “exclusive,” so data loaded from main memory is fetched into L1 and don’t get flushed back to main memory until the affected cache lines are evicted from L1 → L2 → L3 → main memory or if a locked operation forces the cores to synchronize.

Reads from/writes to the same memory location must be synchronized with hardware interlocks (in the case of x86, a special lock prefix to the relevant CPU instructions) to guarantee observed sequential consistency. Reads from/writes to different memory locations, even if they’re on the same cache line, do not need explicit locks. If the locations are on the same cache line, the memory system will handle the locking so that different cores’ writes to different addresses don’t clobber each other.

But when goroutine A writes to a memory location that resides on the same cache line as goroutine B (e.g. goroutine A writes its own slice’s updated length after an append), A’s write invalidates B’s copy of the cache line, so it will re-read its slice (and the cache line).

Because the fundamental transfer unit that the memory system deals with between caches and main memory is the cache line, it needs to handle different threads writing to different memory addresses in the same cache line by somehow merging different cores’ modifications to the same cache lines.

When we first learn about the gotchas of parallel programming/multi-threading, we often see an example like this:

  1. Thread A loads your bank balance of $0 from address 0x100,
  2. Thread B also loads your bank balance of $0 from address 0x100,
  3. Thread A adds $100 to its copy of the the $0 balance it loaded
  4. Thread B adds $10 to its copy of the the $0 balance it loaded
  5. Thread A writes its result back to address 0x100 so address 0x100 = $100
  6. Thread B (over)writes its result back to address 0x100 so address 0x100 = $10

But because the memory system is dealing with 64-byte cache lines, if there were no implicit locking on the cache lines, it would be like this:

  1. Thread A loads the cache line containing your bank balance of $0 from address 0x100,
  2. Thread B loads the same cache line containing someone else’s bank balance of $0 from address 0x108,
  3. Thread A adds $100 to its copy of your $0 balance it loaded
  4. Thread B adds $10 to its copy of someone’s the $0 balance it loaded
  5. Thread A writes its result back to the entire cache line of address 0x100 so address 0x100 = $100 and address 0x108 = $0
  6. Thread B (over)writes its result back to the entire cache line of address 0x108 so address 0x100 = $0 and address 0x108 = $10.
1 Like

I was not aware that L1 cache is per CPU. That indeed would create a situation where more CPU means more cache invalidating (one goroutine modifies 1 cache lines, thus all the other CPUs which have a local copy of that cache line need to reload it).

I am sure that the second scenario would never happen because of cache lines, as there is implicit locking on the cache lines (that is what shared memory means).

I was able to optimize it even further by generating the result slice at the beginning and assigning each participating goroutine its own dedicated partition in that slice. This way I could get rid of the last step that generated the result slice out of the intermediate results which came from the different goroutines.

func gendata(numGoRoutines, numElements int) []int {
	var wg sync.WaitGroup
	wg.Add(numGoRoutines)

	res := make([]int, numElements)                  // result slice
	partitionElements := numElements / numGoRoutines // number of elements per goroutine

	for i := 0; i < numGoRoutines; i++ {
		go func(i int) {
			defer wg.Done()
			s := rand.NewSource(time.Now().UnixNano())
			r := rand.New(s)
			offset := i * partitionElements // offset to calculate the index for the data insertion
			for j := 0; j < partitionElements; j++ {
				res[j+offset] = r.Intn(10)
			}
		}(i)
	}
	wg.Wait()

	return res
}