I’m trying to understand unicode handling in Go. The following code has a string with a single character. Converting the string to a byte slice and printing it shows that this character is encoded using two bytes. I range over the string and print some stuff.
package main
import (
"fmt"
)
func main() {
s := "Ã"
b := []byte(s)
fmt.Println(b)
for _, r := range s {
fmt.Printf("%d=%c %T", r, r, r)
}
fmt.Println()
}
This prints:
[195 131]
195=Ã int32
Printing using %c shows the correct character.
%d however shows only the first byte (195). What happened to 131? Shouldn’t that be in r as well. r is of type int32. I thought the whole point of a rune being int32 was that it could contain all they bytes
for the character.
And, if r only contains 195, why does %c know how to print the correct character?
à is an unfortunate character to use to explain what’s happening. If you use any other character, it’s easier to see what’s going on, for example, ã: Go Playground - The Go Programming Language, which produces the following output:
[195 163]
227
227=ã int32
What’s happening is the first byte: 195 is the first of a 2-byte encoding of the 227th Unicode code point (see UTF-8 on Wikipedia to see how the bits work).
It just so happens that à is the 195th code point, so it looks like you’re only getting the first byte of the 2-byte UTF-8 encoded rune!
The bytes in the UTF-8 representation of a string are not the same as the bytes of runes.
rune is a 4-byte entity (alias for int32).
Each rune returned by the for/range loop corresponds to 1, 2, 3 or 4 bytes of the string, according to the values of certain bits of each byte.
The unicode code point for à is 0xc3 (decimal 195).
According to the wikipedia article, this is mapped to a 2-byte utf-8 representation as follows:
The first utf-8 byte has high bits 110 (=192) and low 5 bits the first 5 bits of the code point (0b11).
The second utf-8 byte has high 10 (=128) and low 6 bits the last 6 bits of the code point (0b11).
The resulting two-byte utf-8 string is 0xc383: decimal bytes 195 (=192+3), and 131 (=128+3).