Goroutine slowdown

This is a learning exercise re goroutines.
I have a slice of structs and a func “doClear()” that works on that struct.
When I run the code snippet below it runs in about 355ms. If I add a “go” before the doClear call - to make it a goroutine - it takes about 10 times longer (3.588s). When I run the goroutine version with “go run -race .” it doesn’t report any race conditions. Both version produce the same result.

func (b *Bitmap) ClearPrime(prime uint64) {
  var wg sync.WaitGroup
  for _,s := range b.segments {
    wg.Add(1)
    go s.doClear(&wg, prime)
  }
  wg.Wait()
}

Any clues as to how to track down what is going on here?

Jon

Without some of the context here (for example, what does doClear() do exactly? How big/complex is the computation inside it? How many segments are you testing this on?, etc.), I can only speculate.

The overhead of launching a goroutine is minimal, but not zero. If the computation inside doClear only takes a few nanoseconds the overhead of launching a goroutine, which I guess is in the order of a microsecond, appears much larger.

You could use pprof to pin down the exact functions which are causing this slowdown. You could try using a fixed set of goroutines that take work to be done over a channel (this SO post has some ideas for this).

Hi Samyak,

thanks. It a Sieve of Eratosthenes. I’m trying to get a working solution for the spoj PRIME problem. The doClear is clearing out all the entries in a bitmap for the current prime. I split the sieve bitmap into segments to enable clearing using multiple goroutines. The bitmap is large - the problem spec said upto 1,000,000,000.
I think you’re right … I have to setup goroutines for each segment that wait for the next prime to clear

Jon

If doClear is very fast, then the cost of spawning a goroutine is much higher than the cost of doClear, thus you see the 10 times increase in running time.

You are spawning b.segments goroutines. Check to see how large is this b,Segments. One goroutine seems to be 3 times less costly to create than a linux thread, and does now require a context switch goroutine costs. Many goroutines do add up though.

Another possible cause besides goroutine startup overhead could be cache invalidation, especially in situations where many goroutines access different parts of the same array.

In a nutshell, all CPU cores have local caches. If the same data set is loaded into multiple caches, and if one of the cores changes a part of that data, then the cache segment holding that data is invalidated on all other cores and needs too be refreshed. That’s rather expensive and can slow down concurrent code considerably.

A more detailed explanation is here. Since your code manipulates parts of a single slice, cache invalidation might apply to your scenario.

2 Likes

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.