Unicode and range

MikeSolem · January 9, 2023, 9:27pm

I’m trying to understand unicode handling in Go. The following code has a string with a single character. Converting the string to a byte slice and printing it shows that this character is encoded using two bytes. I range over the string and print some stuff.

package main

import (
	"fmt"
)

func main() {
	s := "Ã"
	b := []byte(s)
	fmt.Println(b)
	for _, r := range s {
		fmt.Printf("%d=%c  %T", r, r, r)
	}
	fmt.Println()
}

This prints:

[195 131]
195=Ã  int32

Printing using %c shows the correct character.

%d however shows only the first byte (195). What happened to 131? Shouldn’t that be in r as well. r is of type int32. I thought the whole point of a rune being int32 was that it could contain all they bytes
for the character.

And, if r only contains 195, why does %c know how to print the correct character?

skillian · January 10, 2023, 1:59am

Ã is an unfortunate character to use to explain what’s happening. If you use any other character, it’s easier to see what’s going on, for example, ã: Go Playground - The Go Programming Language, which produces the following output:

[195 163]
227
227=ã  int32

What’s happening is the first byte: 195 is the first of a 2-byte encoding of the 227th Unicode code point (see UTF-8 on Wikipedia to see how the bits work).

It just so happens that Ã is the 195th code point, so it looks like you’re only getting the first byte of the 2-byte UTF-8 encoded rune!

alex99 · January 10, 2023, 1:13pm

The bytes in the UTF-8 representation of a string are not the same as the bytes of runes.
rune is a 4-byte entity (alias for int32).
Each rune returned by the for/range loop corresponds to 1, 2, 3 or 4 bytes of the string, according to the values of certain bits of each byte.

The unicode code point for Ã is 0xc3 (decimal 195).
According to the wikipedia article, this is mapped to a 2-byte utf-8 representation as follows:
The first utf-8 byte has high bits 110 (=192) and low 5 bits the first 5 bits of the code point (0b11).
The second utf-8 byte has high 10 (=128) and low 6 bits the last 6 bits of the code point (0b11).
The resulting two-byte utf-8 string is 0xc383: decimal bytes 195 (=192+3), and 131 (=128+3).

mje · January 10, 2023, 5:36pm

Required reading: Strings, bytes, runes and characters in Go - The Go Programming Language

MikeSolem · January 10, 2023, 6:32pm

Ah, ok. So the rune is the code point and is not UTF-8. That makes a lot of sense. Thanks guys!

system · April 10, 2023, 6:32pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.