Algorithm: map[string][]string -> reversed map[string][]string

telo_tade · October 5, 2023, 1:05pm

You have a dataset that represents the relationships between dogs and cats. The dataset is structured as a map, where each dog is associated with a list of cat names. The cat names for a given dog are unique. Overall, all cat names and dog names are unique, although 2 dogs may belong the same cat, and two cats may belong to the same dog.

Problem: Your goal is to generate a new dataset that reverses the relationships, where each cat is associated with a list of unique dog names.

How would you solve this efficiently ? What do you do with large data ? Does Go offer any support/library data structures for this scenario ? What is the idiomatic way to approach this problem ?

PS: I’ve already experimented with the “iterate through the input map, for each value add it to the new map, at its corresponding key, if not already there”. Looking for something more efficient than this. Similarly, the “use an auxiliary map to check if a dog already belongs to this cat”.

mje · October 5, 2023, 3:46pm

  dogscats := make(map[string][]string)  // map each dog to a catlist.
  for cat, doglist := range catsdogs {
    for _, dog := range doglist {
       dogscats[dog] = append(dogscats[dog], cat)
    }
  }

For many cats per dog, you may be reallocating catlists. If that’s a performance problem,
loop through catsdogs once to count the number of cats per dog so you can allocate the catlists to the correct initial size. Another performance problem may come from growing the map. You can count the unique dogs to pre-allocate the size of dogscats, but doing that you’ll need a set of dogs (as a map) that would need to grow similarly to dogscats, so it may not be worthwhile.

duckduck · October 5, 2023, 3:56pm

Just for fun, here’s @mje 's algorithm in a runnable environment: https://goplay.tools/snippet/6UoNk3saTo8

I’m curious what more efficient stuff people come up with. Maybe even something with memory arena’s or what not?

telo_tade · October 5, 2023, 4:16pm

in case there are multiple duplicates, have an auxiliary data structure, dogCats is a map[string]map[string]interface{}. After you build this auxiliary data structure, you build your result from it.

Yet it is highly dependent on how many duplicates, you allocate more memory than needed, etc.

duckduck · October 5, 2023, 4:39pm

Here’s another version but this time with the dummy data initialised in stead of built with append: https://goplay.tools/snippet/r5lLSEkJTRW

Underlying data structure is map[string][]string.

duckduck · October 6, 2023, 2:27pm

The most efficient solution is probably encoding the dog-cats relationship in a [][]bool-matrix. No need to alter the matrix since you can just iterate over it with columns and rows switched (in effect transposing the matrix).

Running example:

 dogs := [3]string{"dog1", "dog2", "dog3"}
cats := [4]string{"cat1", "cat2", "cat3", "cat4"}
                                                                           
dogm := [3][4]bool{
	[4]bool{true, true, true, true},    // dog1 has cat1, cat2, cat3, cat4
	[4]bool{true, false, true, false},  // dog2 has cat1, cat3
	[4]bool{true, false, false, false}, // dog3 has cat1
}
                                                                           
// print source data
printmatrix(dogs[:], cats[:], dogm)

Currently I’m still trying to write a (perhaps generic) printmatrix-function in stead of the 2 separate ones that avoids copying the matrix (with hard-coded dimensions) to slices.

telo_tade · October 6, 2023, 2:36pm

hi @duckduck

Your idea is a clear example of thinking out of the box, congratulations to you sir. Having said that, I do not think I can use it: if you have N millions of dogs and M millions of cats, or even more, you need to consider the costs of storing a N*M bool matrix in memory, in this case it seems too much.

Not directly related, if a dog entity is just a simple small string, like “dog1”, the difference between storing a bool or the string itself, is not big enough to merit memoization. A bool is a byte, while a string is something like 3 ints (one of them being a pointer to the immutable rune slice itself).

Keep coming with innovative ideas, I am sure this problem can be solved more efficiently than a pair of maps.

mje · October 9, 2023, 3:14pm

The dog/cat example is likely to very sparse. If your real problem is also very sparse, consider a sparse matrix implementation like CSR. Sparse matrix implementations are optimized for (relatively) fast matrix/matrix operations, so you’ll pay a runtime penalty for random element access.

You can also reduce space by using an int64 to hold 64 matrix elements, but this could work against, and certainly complicate, a sparse implementation.

system · January 7, 2024, 3:15pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.