Hashcode for a array

Vasu_Mahalingam · September 18, 2017, 2:12am

I am looking to get help with generating a hash key for a []int. Every time, I get a []int, I need to do a lookup in the DB to see if this list already exists. If it exists, will return the index to the list or create a new index for the list. In simple terms, how do we build a key for a mutable list to store it in a Golang Map? Any help is appreciated.

christophberger · September 18, 2017, 5:26am

Since a hash function takes a string as input (or rather, a []byte, since the hash functions in crypto expect []byte), you need to write a function that creates an unambiguous string representation of the list. (For example, the two lists {1, 23} and {12, 3} must not have the same string representation.) Then run the hash function on the string to get the hash.

Vasu_Mahalingam · September 18, 2017, 7:53am

Hi Christopher,

Very helpful tip. I got what I wanted –

func computeHashForList(list []int,
delim string) [32]byte {
var buffer bytes.Buffer
for i, _ := range list {
buffer.WriteString(strconv.Itoa(list[i]))
buffer.WriteString(delim)
}
return (sha256.Sum256([]byte(buffer.String())))
}

I wonder if we are able to get an unambiguous string to feed it to Sum256,
wouldn’t it be simpler if we simply use the unambiguous string as a key to
Map? Will the key size play a role in performance of the map lookup? The
above function returns 32 byte whereby unambiguous string as key can be
of any arbitrary size.

christophberger · September 18, 2017, 8:30am

I would guess so, although this would depend on how long the strings can become at most, and how many lists you need to store in the map. And also consider the additional space requirement if all lists are turned into pure, uncompressed string representations.

On the other hand, the downside of hashes is that they can have collisions (that is, the same hash is generated for two different lists) - although they should be very very rare. Your code would need to check for that situation and fix the collision somehow.

A middle way between using the string representations directly and using hashes could be to compress the string representations (The stdlib has some packages for that).

I think a healthy approach is to follow the advice “implement now, optimize later”.

Vasu_Mahalingam · September 18, 2017, 5:49pm

Thank you for the insights. I foresee the string can grow up to 512 in length and the number of lists can grow up to 1024. With hashes, you brought out a point of having collisions from the sha256 although unlikely but I wonder if I can produce a unique string and do a hash on that which is of size 32 bytes, can there be still collisions? I will test this out and understand Maps/Hash better. Thank you for your advice.

func computeHashKeyForList(list []int,
	delim string) [32]byte {
	var buffer bytes.Buffer
	sort.Ints(list)
	for i, _ := range list {
		buffer.WriteString(
			strconv.Itoa(list[i]))
		buffer.WriteString(delim)
	}
	return (sha256.Sum256(
		[]byte(buffer.String())))
}

func main() {
	A := []int{1, 23}
	B := []int{12,3}
	X := make(map[string][]int)
	C := computeHashKeyForList(A, "0")
	D := computeHashKeyForList(B, "0")
	X[string(C[:])] = A
	X[string(D[:])] = B
}

christophberger · September 19, 2017, 7:32am

Consider that the number of different strings with lengths of up to 512 bytes/characters far outweighs the number of different hash strings of 32 byte length. So the risk of collisions does exist, although it is very low. If two strings produce the same hash value, chances are that one of them contains just random bytes and therefore would never be generated by your list-to-string conversion code.

calmh · September 19, 2017, 11:11am

Note that finding sha256 collisions is something that would make you a world famous mathematician over night. People are working towards that end. It’s not likely you’ll do it by mistake while hashing short strings of integers.

Vasu_Mahalingam · September 19, 2017, 12:49pm

Now with all kinds of collision problems we have with this approach, is there a different approach to attack this problem? I was thinking about using a bitmap array to set the bits of the integers to produce unique bit strings for hash keys but memory usage for bitmap array for the number of lists outweighs the performance I get out of this approach.

Now this is turning into a algorithm question

Another approach I am considering is compress the unique string into a []byte key and use Patricia DB to limit the lookup to size of the prefix. Again, some fundamental questions remain the same; I can’t prove my unique 512 bytes strings are indeed unique.

calmh · September 19, 2017, 12:57pm

You’re applying a level of worry to this that the rest of the world doesn’t apply to TLS security, online banking, cryptocoins, data deduplication, you name it. What is it you’re doing exactly?

Vasu_Mahalingam · September 19, 2017, 1:07pm

I just need a deterministic way to do fast lookups for the integer slice.
These integers fall in 32 bit range and the list can be arbitrary. These
are nothing but oiflists for multicast routes. I am just thinking about
scale and not losing the oiflists I maintain in cache for the fear of
overwriting it which can happen on collisions.

calmh · September 19, 2017, 1:10pm

You don’t need to worry about sha256 collisions.

system · December 19, 2017, 9:18am

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.