Poor deque performance vs Rust. Is this solvable?

scotpip · October 7, 2021, 10:22pm

Hi folks. New to Go, and very much liking what I see. I value simplicity and think I would feel at home here.

But I’ve run into an early perormance problem with my project and am hoping for advice.

The use-case

This is an event-based financial backtesting app that will be streaming millions to billions of ticks and bars and needs to keep track of hundreds of indicator calculations through computationally intensive optimisations.

So I need a performant moving window of the most recent x indicator results, where x ranges from around 10 to around 1000.

Benchmark in Rust: 290 ms

Playing with Rust, the standard lib offers the VecDeque structure, with push_back, pop_front and index access to the internal values of the queue.

For 100,000,000 iterations on a VecDeque of 10 ints, this runs @ 290 ms.

Benchmark in Go: 1.2 secs

I don’t see a similar data structure in Go, so I benchmarked a naive approach using the slice:

p := []int{1,1,1,1,1,1,1,1,1,1,}

for i := 0; i < 100_000_000; i++ {
	p = p[1:]
	p = append(p, 10)
}

This runs in 1.2 secs, or a full 4x slower than Rust.

For some reason using a fixed-length slice is slower, @ 1.4 secs.

On the plus side, Go is almost 5x faster than my benchmark in C#!

I tried a couple of specialised deque libraries in Go to get a sense of what is possible, even though they didn’t support indexing to enable me to access internal values.

The 500 star oleiade/lane library took a woeful 14 seconds.

Another more obscure library that claimed to be highly optimised gave me 1.2 seconds - no better than my slice implementation.

Do I really have to settle for a 4x slowdown?

I realised that with a GC language I’d take a performance hit in return for ease of development. But this seems like a low-allocation operation, so I’m surprised at the width of the performance gap.

Do I really have to settle for a 4x slowdown in my app if I build it in Go?

This is literally the first code I’ve written in the language and my understanding of the internals is still hazy, so I’m hoping I’m overlooking something that would give me a speedup.

christophberger · October 8, 2021, 6:00am

Hi @scotpip ,

Welcome to Go and the Go Forum.

You’re probably right, it’s the GC. This…

…repeatedly shortens the slice on one end and extends it on the other end. append() thus allocates new memory more often than not.

A ring buffer certainly works much faster, as it would need zero allocations inside the loop.

scotpip · October 8, 2021, 9:21am

Like you I assumed a ring buffer would be faster, but as I said I tried a couple of libraries, and one was egregiously slow while the other was no better than the slice implementation.

I was hoping that the slice would be working inside of a larger block of allocated memory. Are you saying that it allocates on every append(). That seems very inefficient?

There is surely a better way to do this? Unfortunately I’m a high-level line-of-business style programmer so I don’t really have the background to roll my own algos. Can anyone give me some pointers (pun intended)?

It’s kind of mission-critical for me - if I can’t find a speedup I may have to reconsider moving to Go. And that would be a pity, because I love the philosophy behind the language.

Patricio_Whittingslo · October 8, 2021, 10:55am

I am no CS major, but have you looked at ring package - container/ring - pkg.go.dev ?

Patricio_Whittingslo · October 8, 2021, 11:03am


import (
	"container/ring"
	"testing"
)

const rLen = 10

func BenchmarkRing(b *testing.B) {
	r := ring.New(rLen)
	for i := 0; i < b.N; i++ {
		r.Move(1)
		r.Value = 10
	}
}

func BenchmarkSliceDeque(b *testing.B) {
	p := make([]int, rLen)
	for i := 0; i < b.N; i++ {
		p = p[1:]
		p = append(p, 10)
	}
}

I believe this does the same thing, no?

Results:

goos: linux
goarch: amd64
pkg: silly
cpu: Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz
BenchmarkRing-4         	751359330	         1.502 ns/op	       0 B/op	       0 allocs/op
BenchmarkSliceDeque-4   	187221034	         6.445 ns/op	      16 B/op	       0 allocs/op
PASS```

scotpip · October 8, 2021, 12:36pm

Patricio

Thanks for the suggestion! I missed the ring package because of the rather unconventional name.

The performance is impressive - matching the Rust VecDeque.

The problem is that unlike the VecDeque like most of these ring structures it doesn’t give access to the internal values, only to the first and last value. As I said in my post, access to internal values is a requirement. I did run a couple of other ring deques, but only to see how much difference it made.

Is there any way to write a similarly performance ring in Go that would enable me to access the other values? D and Nim also have fast VecDeque like structures so it must be a well understood algo.

Alternatively, I have figured out a workaround. It’s crude but effective - I simply allocate all the required memory at the start. Obviously this is profligate with memory, but I’m running on a well provisioned workstation and a proof-of-concept barely registered on the memory monitor. It also gives me pretty efficient access to the internal values.

This runs at just 1.3 slower than Rust, which is a far more acceptable speed differential than the 4x slowdown of my initial attempt.

Simply filling the slice from left to right by index is no faster, and less convenient to use.

I can live with this, but would still appreciate any pointers to a more elegant solution.

Benchmark: 390 ms in Go vs 290 ms in Rust

// To maintain a moving window 10 items long

p := make([]int, 100_000_010)

for i := 0; i < 100_000_000; i++ {
	p[10] = i
	p = p[1:]
}

christophberger · October 8, 2021, 1:51pm

Were any of these implementing a ring buffer?

Why not implementing your own ring buffer? It’s not too complicated and you could build in the features you need.

Edited to add: Sample queue code - Queue/Definition - Rosetta Code

Edited again to add: The code behind the above link is actually a ring buffer that can grow. A fixed-size ring buffer would be even simpler.

No, append() is actually quite smart about allocations. The amount of allocated memory doubles at each allocation, so when a slice grows linearly, the need for new allocations shrinks exponentially. Still, each allocation includes copying over the existing data from the old to the new slice, and I guess this adds up over time.

scotpip · October 8, 2021, 5:02pm

Hi Christoph

Been playing around.

My first idea was to port VecDeque to Go, but it turns out that it’s thousands of lines of code over 8 files, so that’s not going to happen.

Then I dug into the container/ring library, which is a sort of linked list rather than a real ring buffer. Sadly, we were using it wrong - properly used it’s actually rather slow.

Then I played with a couple of other ring buffers that claim to be efficient, and they were slow as well.

So I think I’ll just use my braindead idea of allocating all the memory in advance. Sometimes simple is best, and I can’t see any significant downside - I have memory to spare.

If I wrap it behind an API I’ll be able to forget about the ugliness within and get on with writing my app!

christophberger · October 9, 2021, 8:31am

Oh wow, seems that VecDeque thing is more complicated than it seemed in the first place. After all, those thousands of lines of code must be there for a reason!

I see your point, and yes, if you have the memory to pre-allocate then why not.

skillian · October 10, 2021, 3:19pm

I’m no comp-sci expert, but I wanted to take a stab at this. When you say:

If you have a window of, e.g., 10 elements, and then an 11th element is added, does the size of the collection grow to 11 or does the oldest element get overwritten with the new element so the size remains at 10?

scotpip · October 10, 2021, 5:21pm

Sean

A performant ring buffer with fast access to internal values

I’m glad you asked, because you sparked off an idea.

Most deques and ring buffers seem to use complex algos.

For example, Rust’s VecDeque is around 3k lines of code, at a quick glance - though to be fair it’s a very powerful data structure and much faster than anything I found in the Go standard lib or on Github.

But what if you did the simplest thing possible?

Let’s say you want to keep the last 10 values on a stream of 100,000,000 results, pushing the most recent value onto the tail and popping the oldest value off the head:

The simplest approach would be to set up a 10 slot array, and wrap past the end.

The position of the internal values can be trivially calculated from the position of the tail.

00 01 02 03 04 05 06 07 08 09 | tail = [9]
10 01 02 03 04 05 06 07 08 09 | tail = [0]
10 11 02 03 04 05 06 07 08 09 | tail = [1]

This windowing data structure is the critical performance bottleneck for my financial backtesting app, which will be very data intensive. It will be hit billions of times per run.

This idea is memory efficient and zero allocation, so I though I’d give it a try.

Turns out that it’s north of 2x the speed of the Rust VecDeque, and almost 9x the speed of my initial naive slice solution.

(No idea why this isn’t the common solution for any buffer where you know the length in advance. Perhaps a comp-sci degree tempts you to use your math smarts and over-complicate things?)

It does show that, within reason, design trumps languages and compilers…

Benchmark: 135 ms for 100,000,000 items vs 290 ms for Rust VecDeque

Here’s a simple proof of concept. In an implementation designed for pure functions you’d probably store the length and tail positions in slots at the end of the array so the data would be self-contained.

package main

import (
	"fmt"
	"time"
)

func main() {

	//====================================
	start := time.Now()
	//====================================

	// Array-based ring with the newest
	// value at the tail

	// VERY FAST: 135 ms for 100,000,000 items
	// Rust VecDeque is 290 ms

	const l = 10 // length
	var d[l]int  // data
	t := l -1    // tail

	for i := 0; i < 100_000_027; i++ {

		// Increment tail
		if t == l - 1 {
			t = 0
		} else {
			t++
		}

		// Overwrite oldest value
		d[t] = i
	}

	//====================================
	elapsed := time.Since(start)
	fmt.Printf("Elapsed: %s\n", elapsed)
	//====================================

	fmt.Printf("Data: %d\n", d)
	fmt.Printf("Tail: %d", t)
}

Output:

Elapsed: 132.3204ms
Data: [100000020 100000021 100000022 100000023 100000024 100000025 100000026 100000017 100000018 100000019]
Tail index: 6

system · January 8, 2022, 5:22pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.