Out of curiosity I wrote a program that uses channels to distribute data across different go-routine consumers (pretty much a classical N producers M consumers problem.
I’d love to get some feedback on wether or not this is smart and if I could improve this.
Also interestingly if I run this with
go run -race it will complain because there is the possibility of a data race (which I think should not be the case)
The code can be found here:
(Or if you prefer the vanilla playground here: https://play.golang.org/p/XQEkegTPB-)
The idea here is that there is one
Insert channel that takes a Key/Value pair, and checks a hashmap if the key has already been seen before. If not it will create a new entry in the hashmap with a channel. It will then start a new goroutine that consumes data from that channel and insert the incoming KV-Pair into the consumer channel.
So in my example I have 100 producers, 1 “main” distributor and 5 consumer goroutines. And it works
I am also not 100% confident this is the most efficient way to do this as there is the potential bottleneck of the main distributor goroutine, but it works without any locking which I find neat.
I was thinking about replacing the distributor goroutine with a double-checked mutex that runs in the producer goroutines.
Any feedback or suggestions?
It’s nice code and it’s pretty general, so I am not sure I can give you any suggestion on efficiency. It really depends on how and why you pass data via channels, how you consume the data, how slow the consumers are, how fast the producers are, etc.
Two notes, however:
- There is a data race. Access to a map is not safe. Any addition/removal to a map might reshuffle the buckets (the data structures used transparently inside the map.) Thus, accessing by key as you do in line 28 is unsafe and the race detector correctly report it.
Solution: just pass the channel, not the map key to the function.
- To be precise, your code is not “lock-free” as channels use locks internally. Lock-free structures are structures that only use atomic operations. Remember that the choice between channels and locks is not a performance one but rather a semantic one: channels orchestrate, locks only serialize. But this is nitpicking
Thank you so much for taking the time and looking into it.
Yes the data race I noticed once I was trying to run some simple benchmarks with the program and it did break on the concurrent map access. I am thinking about moving the
All channel to a local variable inside the Observe func to make sure nobody outside the main distribution goroutine can access this - seems like the safer design:
As for the performance, the consumers would be also async. Meaning for each batch of N items I’ll dispatch a new goroutine to process the items without blocking the channel observer goroutine.
Again - thank you very much. I am still trying to grasp what good go code looks like and I was worried I that I am being too clever with this.
Now the next obvious question here would be: How would I test this properly?