Unicode collation using golang.org/x/text/collate


(prataprc) #1

I am implementing a library for data representation.
Refer: https://github.com/bnclabs/gson

One of the feature support by this package is to compile composite
data (JSON supported) into binary format that can be sorted using
memcmp.

As part of this binary-collation feature I need to support string
sorting based on ICU collation standard. After a bit of googling
came across this awesome package:


And I am using Collator.Key() to compile string into ICU sort-key
that can be used with memcmp.

Wrote couple of test cases for this and it works fine.

My Question is:

After compiling the string value to binary-comparable sort-key, can
I get back the original string value from its sort-key ? Is that a
limitation with ICU standard or a limitation with golang.org/x/text/collate

Thanks,


(Aman Kishore Achpal) #2

I too have encountered the same issue. One possible solution is to store the original string alongside the sortkey (perhaps separated by a known delimiter, or as a tuple) in the “encode” phase and fetch the original string in the “decode” phase. However, this adds the overhead of extra space.

Like you already mentioned, I was unable to find a reverse mapping from sortkey to original string. From a quick perusal of the code, it looks like the text package of GoLang uses CGo to call libicu, which in-turn doesn’t have the reverse mapping function.

Ideally, someone with collation/language expertise can advise us on what the best practices to be followed in this scenario are!


(Aman Kishore Achpal) #3

Following up on my previous comment, here are some relevant links to suggest that you in-fact need to store both the sort-key, and the original data – There doesn’t seem to exist an inverse function that could map back from the sort-key to the original key.


Thanks,
Aman Achpal


(system) #4

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.