UTF8 Rune Count with Fixed-Width Files

Dean_Davidson · September 22, 2023, 8:18pm

If you have had to work with extended characters, you have no doubt come up against len(someString) causing problems. Because most of the time when you want the length of the string you want (more or less) the number of characters in the string, not bytes. This code:

str := "Ñ世界"
fmt.Println("len =", len(str))
fmt.Println("runes =", utf8.RuneCountInString(str))

Produces the following:

len = 8
runes = 3

You might be familiar with this excellent post (which should be required reading for every go dev; at some point, it will come in handy):

In my case I have been dealing with fixed-width order uploads where the width was determined by number of runes, not bytes. Right now, I’m doing something like this:

func processRow(row string) error {
	rowLen := utf8.RuneCountInString(row)
	// For contrived testing purposes, let's just
	// pretend this has 3 fields. Each 1 char wide.
	if rowLen != 3 {
		return fmt.Errorf("invalid row length: %v", rowLen)
	}
	// Get all runes
	runes := make([]rune, rowLen)
	i := 0
	for _, v := range row {
		runes[i] = v
		i++
	}
	// Construct our columns based on widths
	col1 := string(runes[0:1])
	col2 := string(runes[1:2])
	col3 := string(runes[2:3])
	fmt.Printf("Row columns: %v, %v, %v.", col1, col2, col3)
	return nil
}

It seems kind of inefficient to take a string, convert it to a slice of runes, only to then convert those runes back to strings. I know this is premature optimization but I’m just curious if anybody else has had to tackle similar problems. If so, how did you deal with it?

skillian · September 26, 2023, 11:36am

Are you looking for utf8.DecodeRuneInString?

cols := [3]string

for rowIndex, colIndex := 0, 0; rowIndex < len(row); colIndex++ {
    _, n := utf8.DecodeRuneInString(row[rowIndex:])
    cols[colIndex] = row[rowIndex:rowIndex+n]
    rowIndex += n
}

Dean_Davidson · September 26, 2023, 4:49pm

I’ve used utf8.DecodeRuneInString as well, but my understanding is that it is identical to for range in this case. From that doc I linked to above:

Besides the axiomatic detail that Go source code is UTF-8, there’s really only one way that Go treats UTF-8 specially, and that is when using a for range loop on a string.

We’ve seen what happens with a regular for loop. A for range loop, by contrast, decodes one UTF-8-encoded rune on each iteration. Each time around the loop, the index of the loop is the starting position of the current rune, measured in bytes, and the code point is its value.

Thus given my contrived processRow function, this:

func main() {
	processRow("Ñ世界")
}

… produces the following output:

Row columns: Ñ, 世, 界.

Which is fine. I guess I was hoping I didn’t have to decode runes and could just slice my string based on rune index rather than byte index. I think the real problem is: it’s odd that I’m dealing with column widths that are based on rune count, not byte count. Most storage engines deal in byte count but most users think about things in terms of rune/character count (which is itself complicated because what constitutes a “character” isn’t as cut and dry as I once thought it was!).

I think I’ll stop trying to prematurely optimize this. I’ve got something that is working well enough. Just wanted to double-check that there wasn’t some far superior way of thinking about this that I wasn’t aware of.

skillian · September 26, 2023, 8:38pm

I’m intrigued by whatever it is you’re working on, but I’m not yet sure what exactly you’re trying to optimize for. Because UTF-8 is a variable-width encoding, there isn’t a way to index to a specific rune in a UTF-8 encoded string without scanning through and counting through the variably-sized code points. If you need to index to a specific rune, I’d want to know more background about the problem because you might want to instead work with your strings as []rune slices.

Dean_Davidson · September 27, 2023, 5:02pm

I’m more just wondering how other people have approached this. This is for an international client with a B2B portal. They are trying to modernize but everything in their business runs on AS400; same with their clients. So, our first step was to create a web portal that’s deployed to one of the big-3 cloud providers.

We allow order uploads via this portal and one of the formats is fixed-width. But - the clients consider “width” to more or less be “characters”, which is slightly complicated. I noted a simple case but consider this:

func main() {
	str := "ä"
	for _, v := range str {
		fmt.Println(string(v))
	}
}

… produces:

a
̈

Because “rune” maps more or less to UTF8 code point and it takes 2 code points to represent that character. That go blog I linked to has a really good explanation (bold emphasis mine):

We’ve been very careful so far in how we use the words “byte” and “character”. That’s partly because strings hold bytes, and partly because the idea of “character” is a little hard to define. The Unicode standard uses the term “code point” to refer to the item represented by a single value. The code point U+2318, with hexadecimal value 2318, represents the symbol ⌘. (For lots more information about that code point, see its Unicode page.)

To pick a more prosaic example, the Unicode code point U+0061 is the lower case Latin letter ‘A’: a.

But what about the lower case grave-accented letter ‘A’, à? That’s a character, and it’s also a code point (U+00E0), but it has other representations. For example we can use the “combining” grave accent code point, U+0300, and attach it to the lower case letter a, U+0061, to create the same character à. In general, a character may be represented by a number of different sequences of code points, and therefore different sequences of UTF-8 bytes.

The concept of character in computing is therefore ambiguous, or at least confusing, so we use it with care. To make things dependable, there are normalization techniques that guarantee that a given character is always represented by the same code points, but that subject takes us too far off the topic for now. A later blog post will explain how the Go libraries address normalization.

“Code point” is a bit of a mouthful, so Go introduces a shorter term for the concept: rune. The term appears in the libraries and source code, and means exactly the same as “code point”, with one interesting addition.

I don’t have a specific question or problem at this point. Just wondering what other people have done. It seems like grapheme clusters (“user-perceived characters”) in Unicode® Standard Annex #29 more or less addresses “what is a character in the user-perceived sense?”. There’s a module that implements this:

And regarding my original comment about it seeming inefficient to convert to a slice of runes only to convert back to a string, I think I was being ridiculous about memory allocation concerns. I did some benchmarking on iterating over runes vs re-slicing the string. The latter (while less ergonomic) was faster but it almost certainly doesn’t matter.

skillian · September 27, 2023, 6:10pm

Check out the norm package here: norm package - golang.org/x/text/unicode/norm - Go Packages. You can use Normalization Form C or Normalization Form KC to normalize the text to a common form (ä (a + ̈) and ä should both normalize to ä).

See also: UAX #15: Unicode Normalization Forms

system · December 26, 2023, 6:10pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.