We have just released pargo, a library for parallel programming in Go

While Go is primarily designed for concurrency, it is also usable to some extent for parallel programming. The Pargo library provides convenience functionality to turn otherwise sequential algorithms into parallel algorithms, with the goal to improve performance.

See https://github.com/exascience/pargo for the library and http://godoc.org/github.com/exascience/pargo for the documentation.

How should I choose threshold for parallel.Range?

Running:

package main

import (
	"fmt"
	"log"
	"runtime"
	"sync"

	"github.com/ExaScience/pargo/parallel"
)

func main() {
	log.SetFlags(log.Lshortfile)

	fmt.Println("GOMAXPROCS", runtime.GOMAXPROCS(0))

	total := 1000000

	run(total)
	run(total - 1)
}

func run(total int) {
	fmt.Println("total:", total)
	for threshold := -5; threshold < 6; threshold++ {
		mu := &sync.Mutex{}
		count := 0

		parallel.Range(0, total, threshold, func(low, high int) {
			mu.Lock()
			defer mu.Unlock()

			count++
		})

		fmt.Printf("threshold\t%v\tcount\t%v\n", threshold, count)
	}
}

produces:

GOMAXPROCS 2
total: 1000000
threshold	-5	count	262144
threshold	-4	count	262144
threshold	-3	count	475712
threshold	-2	count	524288
threshold	-1	count	1000000
threshold	0	count	1000000
threshold	1	count	2
threshold	2	count	4
threshold	3	count	8
threshold	4	count	8
threshold	5	count	16
total: 999999
threshold	-5	count	262144
threshold	-4	count	262144
threshold	-3	count	475711
threshold	-2	count	524288
threshold	-1	count	999999
threshold	0	count	999999
threshold	1	count	3
threshold	2	count	7
threshold	3	count	8
threshold	4	count	15
threshold	5	count	16

I don’t understand what it is intended to do.

Thanks a lot for your interest. It’s in the documentation, although maybe a bit hard to find. See the introduction at http://godoc.org/github.com/ExaScience/pargo/parallel where it says the following:

"All of these functions divide up the input ranges according to a threshold parameter. See ComputeEffectiveThreshold in package pargo for more details. Useful threshold parameter values are 1 to evenly divide up the range across the avaliable logical CPUs (as determined by runtime.GOMAXPROCS(0)); or 2 or higher to additionally divide that number by the threshold parameter. Use 1 if you expect no load imbalance, 2 if you expect some load imbalance, or 3 or more if you expect even more load imbalance.

A threshold parameter value of 0 divides up the input range into subranges of size 1 and yields the most fine-grained parallelism. Fine-grained parallelism (with a threshold parameter of 0, or 2 or higher) only pays off if the work per subrange is sufficiently large to compensate for the scheduling overhead.

A threshold parameter value below zero can be used to specify the subrange size directly, which becomes the absolute value of the threshold parameter value."

Apologies for the documentation reading more like a specification rather than a tutorial. If you have a suggestion how to explain this better, or where to place this explanation, I’d be interested to know.

Thanks,
Pascal

Missed that in the introduction. I guess I stopped reading when I saw the Range functions, which is what I was looking for). It would be helpful if that explanation was included in the docs for pargo.ComputeEffectiveThreshold.


For the threshold of 1 in my example above, the number of ranges chosen when total is 999999 is three. To “evenly divide up the range across the avaliable logical CPUs” I expect the most efficient choice would result in a count of two, with the number of items handled by each subrange differing by 1. Instead, three subranges are used which will result in context switching between the three goroutines as they fight to share two cores.


What exactly do you mean by “load imbalance”? How should I pick threshold and why is this approach better than just stating the number of subranges to use?

You are right, ComputeEffectiveThreshold rounded down instead of rounding up, which leads to the oversubscription you are describing. I just committed a change to the GitHub repository which hopefully fixes this. The documentation is hopefully also better now, with some more reasonable recommendations.

With regard to load imbalances: Evenly dividing up work over available CPUs only works well if it is more or less guaranteed that each piece of work (each task) takes the same amount of time to finish. If the tasks have different execution times, which is more common, an even division results in having to wait for the slowest task, and leaving the other CPUs idle for a potentially significant amount of time. This is what “load imbalance” means. In case of such load imbalances, it’s better to split up the work into smaller tasks, so if a CPU becomes idle, it can pick up one of the remaining tasks to keep itself busy. The more fine-grained the tasks, the more CPUs will remain busy and help the overall work to finish faster. On the other hand, fine-grained tasks also increase the scheduling overhead, so you need to find the right balance between fine-grained and coarse-grained tasks. It’s difficult to come up with good recommendations here that work well for all cases, so you have to experiment.

I have added some links to a Cilk paper and an algorithms book that uses Cilk, on which Pargo is loosely based, in case there is interest to learn more about the underlying concepts.

I hope this helps. Thanks a lot for the feedback, this was very useful.

Pascal

I just committed a change to the GitHub repository which hopefully fixes this.

Seems to work:

GOMAXPROCS 2
total: 1000000
threshold	-5	count	262144
threshold	-4	count	262144
threshold	-3	count	475712
threshold	-2	count	524288
threshold	-1	count	1000000
threshold	0	count	1000000
threshold	1	count	2
threshold	2	count	4
threshold	3	count	8
threshold	4	count	8
threshold	5	count	16
total: 999999
threshold	-5	count	262144
threshold	-4	count	262144
threshold	-3	count	475711
threshold	-2	count	524288
threshold	-1	count	999999
threshold	0	count	999999
threshold	1	count	2
threshold	2	count	4
threshold	3	count	8
threshold	4	count	8
threshold	5	count	16

Your description of load imbalance corresponds with mine.

I took a quick run through the paper. Go’s scheduler is work-stealing.

I also understand controlling the coarseness of tasks to tune efficiency. When I parallelize code, like I did for this tutorial,
I find the minimal scalable unit of work so I can tune coarseness to match the machine.

What I don’t understand is how pargo.ComputeEffectiveThreshold helps. In the end I “have to experiment”, but the only thing I can do is change the threshold, which produces large changes (e.g., from 2 to 3) or no change at all (e.g., from 3 to 4). Some work I did today seemed to perform best with 3 goroutines (on my two-core system). There might be someway to get that out of threshold, but it certainly wouldn’t be easy enough for useful experimentation. Setting the number of goroutines used seems so much easier to understand and gives finer control for managing load imbalances.

I accept that I may be missing something about threshold. How is it better than directly controlling the number of goroutines used?

1 Like

You can always also use parallel.Do to spawn an exact number of goroutines, or use negative threshold values to specify exact thresholds (for example -1 * (len(slice) / 3)).

In the general case, though, it’s better to use recursive divide-and-conquer, to create a tree of tasks. This adapts best to different configurations (different CPU counts, etc.). This may not always give the best performance, but it typically gives a reasonably good performance.

When I first saw the announcement for pargo, I though it could be useful.

When I saw there weren’t any examples, I though, I know just the thing!. I had all sorts of things going for me:

  • it is embarrassing parallel
  • processing is the bottleneck
  • I have written and improved several parallel versions
  • I have tests to make sure still works after using the package
  • I have benchmarks that will tell me how well it performs
  • Its a good excuse to update my article, maybe even using go1.9beta2.
  • I saw this neat parallel.Range function that looked like it would do just what I needed.

Then I tried to figure out how to use threshold. Since I missed the comments in the introduction that gave an idea of how to use threshold, I was left the descriptions of how threshold was calculated in the comments for parallel.Range and pargo.ComputeEffectiveThreshold. This didn’t help me understand how to use it, so I read the code for both functions. I ignored the the seemingly inefficient recursive divide-and-conquer implementation of parallel.Range.

Finally, I wrote a program to try and figure things out. Its output wasn’t helpful for understanding how to use threshold, though it did uncover a bug (now fixed).

I asked questions and got some good answers.

However, I still don’t know why you think threshold is a better way to decide how many parts the range should be split into. Your advice was:

It’s difficult to come up with good recommendations here that work well for all cases, so you have to experiment.

The docs now say to start with a threshold of 4. From the ranges in my example above that would use 8 goroutines.

Now to experiment a bit. Maybe change threshold to 3. Oh. It still uses 8 goroutines. Even though threshold changed, that change did not affect performance.

Ok, then. I’ll try again. May use 5 this time. Now there are twice as many goroutines. For my embarrassingly parallel example with a computational bottleneck, I expect this to drastically reduce performance.

So, the smallest possible change in either direction from the starting recommendation of 4 will either have no effect or double the number of sub-ranges the work is split into.

I have expended time and effort trying to figure how threshold is better than directly setting the number of sub-ranges/goroutines.

Asking several times hasn’t produced arguments or evidence for threshold. Do you have any?

OK, thanks a lot for your patience. :wink:

I now see that ComputeEffectiveThreshold is probably trying too hard to be clever, and it seems more confusing than helpful. I will simplify the API, so it hopefully becomes less confusing. I will look into this tomorrow. (I still believe there are good cases for determining thresholds like this, but you might as well do this externally, and not through some funky parameters where you then have to hope for the right outcome.)

Again, thanks for the feedback.

1 Like

I have just committed a new version of the library. I hope this is closer to how you would expect the API to work. More feedback is always welcome. :wink:

Best,
Pascal

Thanks.

Here’s a version of lottoNumbers (the example from my article) using pargo:

func lottoNumbers(n int) [][]int {
	total := 7 * n
	all := make([]int, total)
	list := make([][]int, n)
	seed := time.Now().UnixNano()
	workers := runtime.GOMAXPROCS(-1) // one for each proc

	parallel.Range(0, total, workers, func(low, high int) {
		r := rand.New(rand.NewSource(seed + int64(low)))

		for i := low; i < high; i++ {
			all[i] = r.Intn(49)
		}
	})

	for i := range list {
		list[i] = all[i : i+7]
	}

	return list
}
$ benchstat reduce-allocs.txt pargo.txt
name            old time/op    new time/op    delta
LottoNumbers-2     336ms ± 8%     326ms ±29%     ~     (p=0.114 n=8+9)

name            old alloc/op   new alloc/op   delta
LottoNumbers-2    80.0MB ± 0%    80.0MB ± 0%     ~     (p=0.115 n=10+10)

name            old allocs/op  new allocs/op  delta
LottoNumbers-2      5.00 ± 0%      9.00 ± 0%  +80.00%  (p=0.000 n=9+10)

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.