Hi all – The golang 1.10 release notes discuss changes to the archive/zip library. Regarding zip file header fields, at one point it says:
In Go 1.10, the Writer now sets the UTF-8 bit only when both the name and the comment field are valid UTF-8 and at least one is non-ASCII. Because non-ASCII encodings very rarely look like valid UTF-8, the new heuristic should be correct nearly all the time.
This makes no logical sense to me. If:
The name is valid UTF-8.
The comment field is valid UTF-8.
At least one of the above is “non-ASCII”.
What does that tell us? Moreover, what if condition 3 is not met, and nothing is “non-ASCII”? AFAIK the Unicode Basic Latin block is isomorphic with ASCII encoding. And in English-speaking countries, non-ASCII/non-Basic Latin characters are rarely used, particularly in names of files, comments, etc.
The modal case would seem to be valid UTF-8 in both fields. What does it mean for one of them to be non-ASCII? How would that make it all UTF-8? That seems like a contradiction. If they’re both UTF-8, then they’re UTF-8. What am I missing here?
It means that if both the file name and comment is ASCII only, the utf-8 bit isn’t set - since the encoding is assumed to actually be ASCII.
If one or both looks like actual utf-8 we assume utf-8, with the heuristic being based on the fact that random binary isn’t likely to look like valid utf-8.
Thus, we don’t set the UTF-8 flag (since it is compatible with ASCII).
Name: “hello”, Comment: “世界”
The name is valid UTF-8: true
The comment is valid UTF-8: true
At least one of the above is non-ASCII: true (since “世界” is not ASCII)
Thus, we set the UTF-8 flag.
Name: “invalid\xff”, Comment: “world”
The name is valid UTF-8: false (since “\xff” is invalid UTF-8)
The comment is valid UTF-8: true
At least one of the above is non-ASCII: true (since “\xff” is not ASCII)
Thus, we don’t set the UTF-8 flag.
The reason for this complex logic is because the ZIP format is a complete mess. If the UTF-8 flag is cleared, then the encoding is completely unknown (possibly Shift-JIS?). Even worse, not all ZIP readers can handle the UTF-8 flag, which is why we avoid setting it eagerly also.