Theoretical Question about runes and 5/6 byte Unicode

Reg · September 26, 2020, 8:08pm

If by some strange happenstance I encountered an old 5 or 6 byte unicode string and I loaded into a rune variable, would go massage that into it’s 4 byte (32int) format or would I have a problem? I don’t expect this to happen, hence the word “theoretical” in the title, but it would be a good thing to know just in case the unlikely happened.

NobbZ · September 26, 2020, 8:26pm

32 bit are enough to represent full unicode range. And in utf8 encoding there shouldn’t be any sequence of bytes to represent a single codepoint that is longer than 4 byte.

petrus · September 26, 2020, 8:39pm

Go expects valid (well-formed) UTF-8 encoding. Otherwise, it is an error. Unicode code points are 21 bits: 0x0 to 0xD7FF and 0xE000 to 0x10FFFF, inclusive. Therefore the maximum length of a valid UTF-8 encoding is 4 bytes.

Unicode FAQ: What is the definition of UTF-8?

UTF-8 is the byte-oriented encoding form of Unicode. For details of its definition, see Section 2.5, Encoding Forms and Section 3.9, Unicode Encoding Forms ” in The Unicode Standard . See, in particular, Table 3-6 UTF-8 Bit Distribution and Table 3-7 Well-formed UTF-8 Byte Sequences , which give succinct summaries of the encoding form. Make sure you refer to the latest version of the Unicode Standard, as the Unicode Technical Committee has tightened the definition of UTF-8 over time to more strictly enforce unique sequences and to prohibit encoding of certain invalid characters.

In Go:

Package utf8

import "unicode/utf8"

const (
RuneError = '\uFFFD' // the "error" Rune or "Unicode replacement character"
MaxRune = '\U0010FFFF' // Maximum valid Unicode code point.
UTFMax = 4 // maximum number of bytes of a UTF-8 encoded Unicode character.
)

func DecodeRuneInString

func DecodeRuneInString(s string) (r rune, size int)

DecodeRuneInString is like DecodeRune but its input is a string. If s is empty it returns (RuneError, 0). Otherwise, if the encoding is invalid, it returns (RuneError, 1). Both are impossible results for correct, non-empty UTF-8.

An encoding is invalid if it is incorrect UTF-8, encodes a rune that is out of range, or is not the shortest possible UTF-8 encoding for the value. No other validation is performed.

For example,

package main

import (
	"fmt"
	"unicode/utf8"
)

func main() {
	// invalid (too long, 5 bytes) UTF-8 style encoding
	b := []byte{0b11111010, 0b10101010, 0b10101010, 0b10101010, 0b10101010}
	fmt.Printf("%b\n", b)
	s := string(b)
	fmt.Printf("%s\n", s)
	for _, r := range s {
		fmt.Println(r, string(r), r == utf8.RuneError)
	}

	// invalid (exceeds maximum) code point
	r := rune(utf8.MaxRune + 42)
	buf := make([]byte, 8)
	n := utf8.EncodeRune(buf[:cap(buf)], r)
	buf = buf[:n]
	fmt.Println(buf, string(buf), string(buf) == string(utf8.RuneError))
}

[11111010 10101010 10101010 10101010 10101010]
�����
65533 � true
65533 � true
65533 � true
65533 � true
65533 � true
[239 191 189] � true

Reg · September 26, 2020, 9:02pm

The latest versions of Unicode are now limited to 4 bytes but that wasn’t always case. It did in a previous standard go up to 6 bytes. The question was a theoretical of what would happen if by some strange happenstance it encounter some legacy source that contained an old 5 or 6 byte format that just happened to use that many bytes for a character at some point in a string.

I fully agree that this is very unlikely, hence the “theoretical”, but not impossible and I don’t think you can call an old standard malformed, it would just be an old standard, not improperly formed unless you want to imply the full statement as being “improperly formed by the most current standard”, then yes, that would be true but it doesn’t answer the question of, “What would happen if…?”

petrus · September 26, 2020, 10:52pm

I read your question very carefully. I instantly recognized it as a placemat encoding question. I learnt the full UTF-8 history from Rob Pike and the Unicode Consortium.

I answered your “What if” question. It’s not going to work. Unicode UTF-8 encoding is a subset of placemat encoding. Rob Pike wrote the Go utf8 and unicode standard packages. Go conforms to current standards. The most recent Go implementation of the Unicode standard is Unicode 12.0.

Write your own placemat decoder.

The history of UTF-8

Reg · September 26, 2020, 11:19pm

Fortunately it’s not a problem I have encountered so I don’t have write anything just yet but if the situation arises I’ll know exactly where things stand and what needs to be done.

Thanks!

system · December 25, 2020, 11:19pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.