unicode.SimpleFold implementation in Go 1.7


(Kilyo) #1

I have a question about the code that’s currently showing in Go 1.7 for unicode, in how it relates to https://github.com/golang/go/issues/13288

https://tip.golang.org/src/unicode/letter.go?s=9122:9150#L335

Why is this a len() against the ascii table?

Line 252, same file, just uses the const MaxASCII.

Is there a reason I am missing that would make the len() call preferred?


(Kilyo) #2

Another question, related to the SimpleFold implementation.
Why is
328 // SimpleFold(‘K’) = ‘k’
329 // SimpleFold(‘k’) = ‘\u212A’ (Kelvin symbol, K)
330 // SimpleFold(’\u212A’) = ‘K’

Considered valid? Per spec referenced in https://tip.golang.org/src/unicode/tables.go#L6
K => k
and
\u212A => k
There’s no valid fold from k to \u212A.


(Kilyo) #3

I found the following checkin after digging through the multiple branch histories:

https://codereview.appspot.com/4571074
It appears that SimpleFold has always performed this rather odd conversion, but I am unable to find any reasoning behind why it is considered valid.
ASCII characters should not convert to non-ASCII characters. According to the W3C, https://www.w3.org/International/wiki/Case_folding, folding for the purposes of comparison should be a two step process. Normalize, and then case. SimpleFold appears to attempt to shortcut that, but based on my understanding of the Unicode spec, doesn’t do so correctly. I believe this to be a bug, and unless someone can explain what I am overlooking, or similar functionality in another language, I will be reporting it as such.


(Kilyo) #4

So I found the comments in main that indicate that SimpleFold is supposed to be cyclical.

Back to my original issue, why the len()?


(Jakob Borg) #5

It looks like a shortcut via a table of foldings for ASCII characters, as that’s probably a quite common case?


(Kilyo) #6

I get that its a shortcut, the question is more along the lines of: if the shortcut is supposed to be for ASCII, why the table at all? If its because of the Kelvin fold, why is the Kelvin fold in that table. Why isn’t the english long ‘s’ in that table too, it decomposes to an ASCII char. I am just trying to understand why the table is so inconsistant, and why it doesnt just use the const thats declared earlier


(Jakob Borg) #7

You were asking “why the len()”. :slight_smile:

Why the ASCII folding table looks exactly like it does I can’t answer. I would hope that the answer is that it is exactly as the Unicode spec requires.


(system) closed #8

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.