Possible error in 1.10 archive/zip release notes

Hi all – The golang 1.10 release notes discuss changes to the archive/zip library. Regarding zip file header fields, at one point it says:

In Go 1.10, the Writer now sets the UTF-8 bit only when both the name and the comment field are valid UTF-8 and at least one is non-ASCII. Because non-ASCII encodings very rarely look like valid UTF-8, the new heuristic should be correct nearly all the time.

This makes no logical sense to me. If:

  1. The name is valid UTF-8.
  2. The comment field is valid UTF-8.
  3. At least one of the above is “non-ASCII”.

What does that tell us? Moreover, what if condition 3 is not met, and nothing is “non-ASCII”? AFAIK the Unicode Basic Latin block is isomorphic with ASCII encoding. And in English-speaking countries, non-ASCII/non-Basic Latin characters are rarely used, particularly in names of files, comments, etc.

The modal case would seem to be valid UTF-8 in both fields. What does it mean for one of them to be non-ASCII? How would that make it all UTF-8? That seems like a contradiction. If they’re both UTF-8, then they’re UTF-8. What am I missing here?

It means that if both the file name and comment is ASCII only, the utf-8 bit isn’t set - since the encoding is assumed to actually be ASCII.

If one or both looks like actual utf-8 we assume utf-8, with the heuristic being based on the fact that random binary isn’t likely to look like valid utf-8.

(I made the change in archive/zip)

What Jakob said is correct.

Here’s are several examples:

  • Name: “hello”, Comment: “World”.
    • The name is valid UTF-8: true
    • The comment is valid UTF-8: true
    • At least one of the above is non-ASCII: false
    • Thus, we don’t set the UTF-8 flag (since it is compatible with ASCII).
  • Name: “hello”, Comment: “世界”
    • The name is valid UTF-8: true
    • The comment is valid UTF-8: true
    • At least one of the above is non-ASCII: true (since “世界” is not ASCII)
    • Thus, we set the UTF-8 flag.
  • Name: “invalid\xff”, Comment: “world”
    • The name is valid UTF-8: false (since “\xff” is invalid UTF-8)
    • The comment is valid UTF-8: true
    • At least one of the above is non-ASCII: true (since “\xff” is not ASCII)
    • Thus, we don’t set the UTF-8 flag.

The reason for this complex logic is because the ZIP format is a complete mess. If the UTF-8 flag is cleared, then the encoding is completely unknown (possibly Shift-JIS?). Even worse, not all ZIP readers can handle the UTF-8 flag, which is why we avoid setting it eagerly also.

See https://go-review.googlesource.com/75591 and https://go-review.googlesource.com/72792 for more background.

1 Like

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.