Is access to individual bytes in []byte concurrency safe?

I’m trying to understand the limits of atomic access to bytes in a []byte.

I modeled the following in the code below:

  1. I have a data buffer of []byte where the low order nibble of each byte is an owner id obviously ranging from 0 - 15
  2. I have 16 go routines given ids 0 - 15. Imagine each go routine owns the bytes that have a low order nibble == their id.
  1. The routines constantly and concurrently scan all the data looking for their bytes, and when they find one, they change the high order nibble of the byte in the buffer.

Importantly, 15 routines can access the same byte concurrently while one routine can write to the byte, but since one and only one routine will ever write to a given byte, there is no race condition. Just a question of atomic loading and setting, I believe.

It seems to work fine running for a long time on 16+ core AMD x64 hardware (both Ryzen and Threadripper). If this does actually work, it will save us a huge synchronization cost.

Is this actually safe, or is it hardware-dependent, or am I just getting lucky in this test?


package main

import (

func changeMyData(myID byte, data []byte, running *int64, wg *sync.WaitGroup) {
	// each worker will only change bytes that match
	// their id in the low order nibble
	var passNum byte
	for {

		// All workers are concurrently reading all of data, and all workers
		// are concurrently writing to data, but just to the bytes they own.
		for i, b := range data {
			// if id is in the low order nibble, then this byte is owned by this routine.
			if b&0x0F == myID {
				// put passNum in the high order nibble and myID back in the low
				data[i] = passNum<<4 | myID

		// we only check for a stop between complete passes through the data
		// so all all bytes with this id have the same passNum in the high nibble.
		if atomic.LoadInt64(running) == 0 {

func main() {
	// Make the length shorter (min 32) to increse the velocity of changes at every byte.
	// Make longer to have a better chance catching write errors with wrong passNum.
	data := make([]byte, 64)
	var id byte

	for i := range data {
		data[i] = byte(i & 0x0F) // just set the low order nibble to incrementing 0-15

	// turn on the running flag
	running := new(int64)
	*running = 1
	var wg sync.WaitGroup

	for id = 0; id < 0x0F; id++ {
		go changeMyData(id, data, running, &wg)

	// Make this longer to increase the chance of a conflict
	time.Sleep(10 * time.Second)
	atomic.StoreInt64(running, 0)

	// Wait for the workers to finish

	// Now make sure the data is what we expect...

	// make sure the low order nibble is still correct for every byte.
	for bNum, b := range data {
		// expected low order nibble
		id = b & 0x0F
		if byte(bNum&0x0F) != id {
			fmt.Printf("Error: owning ID of bNum %v=%X changed to %X\n", bNum, bNum, id)

	// make sure the high order nibble is the same for every byte owned by a given id.
	// This just means all bytes for an id are the same: same high and same low nibble.
	for id = 0; id < 0x0F; id++ {
		expectedByte := data[int(id)]

		// loop over each byte owned by this id.
		for bNum := int(id); bNum < len(data); bNum += 0x10 {
			if data[bNum] != expectedByte {
				fmt.Printf("Error: byte at bNum=%X got %X but expected %X\n", bNum, data[bNum], expectedByte)

	for bNum, _ := range data {
		fmt.Printf("%3X", bNum)

	fmt.Printf("\n  Value:")
	for _, b := range data {
		fmt.Printf("%3X", b)


I wouldn’t feel comfortable that this is safe. The Go Memory Model - The Go Programming Language

Yeah, I totally hear you. That is why I’m asking the question.

Basically, the question boils down to this:

If a bunch of routines are reading a byte while one routine is writing the byte with no synchronization, will the readers ever see any value other than either a) the old byte or b) the new byte.

If the answer is no, as it appears, this is very good to know.

I can’t give you anything definitive, but I expect you would never see anything other than the old or new value, and I would not be surprised if you lost writes. It would certainly help your situation if there were an api for manipulating bytes in the sync/atomic package. There seems to be 8-bit operations in the underlying runtime asm code, though. Maybe there’s a way that you to call it: src/runtime/internal/atomic/asm_amd64.s - go - Git at Google

ok, cool. Thanks for the help and for the link. I think we are on the same page. You are right, if we had byte atomic, that would address this issue.

Obviously, it would be really nice to be able to use this for some concurrency-safe “lock-free programming”.

I’m curious: What problem are you solving where you’re going to be reading and writing unsynchronized to nibbles in shared memory?

I think this is OK, but I suspect that you actually are paying for a performance hit, but I recommend you measure! Reading and writing single bytes is atomic on x86 platforms, but the underlying memory system uses cache lines as its smallest units moved in and out of main memory; not individual bytes, so I suspect that your []byte slice’s underlying [16]byte array is being “falsely shared.” Depending on Ryzen and Threadripper’s cache layout, that might not be a problem because, for example, they may all have access to the same L3 cache so they can keep modifying the cached memory and not need to “commit” to main memory, so the performance might still be fine (measure!).

If you’d be OK with throwing out a bit more memory for this process, you could try something like this to eliminate all false sharing so that each goroutine owns its own cache line and there’d be no (potential) contention over the cache lines, however, now you’re moving multiple cache in memory where you used to have just one, so that might actually negatively affect your performance (measure!).

1 Like

Hi Sean, Thanks for the concrete tips.

To answer your question: The nibbles were really just an example where I could use the high nibble as test data to see if I was dropping writes.

The actual problem is a discrete optimization problem on a huge collection of data points. More specifically, each whole byte in the vector indicates which one of several searcher threads (<=32) owns a particular row of data. Searchers traverse the data in funky non-linear ways, and for each row they see, they need to know if they own it. If they do, they can do some stuff, then assign it to another owner.

Our particular problem is not big data relative to clustered problems, but the data is big relative to running on one machine. It is thousands of feature columns each with a million float32 data points plus a bunch of related graph, binning, and interconnection data (around 32 Gig total), so we have to be super memory efficient with any structure we add, and obviously, we can’t do anything like reallocating slices or relying on garbage collection for items tied to the data.

I could probably avoid the false sharing if I put the ownership byte in another structure associated with each row, but that would mess up the other structure’s alignment, and take away my use of the GoLang memclear optimization for quickly resetting ownership back to zero across the whole byte slice which happens a lot.

I’m going to check that out and see which one is cheaper overall.

Thanks again for taking the time to help!