Golang not supporting UTF8 for some reason

k71 · November 22, 2020, 10:19pm

Hi,

I have been ripping my hair out trying to get my Golang API to parse a MySQL database columns through gorm and have gotten nowhere when it comes to the entries that have foreign language values. There doesn’t seem to be any guides on this anywhere online so I’m hoping I can get some help here.

I’ve got a database that has many tables containing Japanese characters. For example, I have this string in one of the rows:
美味しい

However, when doing a SELECT on the database for this entry that has this text value, it keeps coming back as:
ç¾Žå‘³ã—ã„

I’ve ensured the database is correctly created with utf8mb4 and even have this particular column set with CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci on the table. For my call to gorm.Open, I’ve set the following parameters at the end to ensure that it’s parsing utf8mb4: ?charset=utf8mb4&parseTime=true

How can I get Golang to support UTF8 properly when the database is properly set?

If it helps, here’s the UTF8 is in the database as expected:

    mysql> select alt_text from picture_details where id=136;
    +--------------+
    | alt_text     |
    +--------------+
    | 美味しい |
    +--------------+
    1 row in set (0.00 sec)

And yes, I’ve tried to change the text to a manually encoded string of \xe7\xbe\x8e\xe5\x91\xb3\xe3\x81\x97\xe3\x81\x84 just to see if that’d resolve it (it didn’t).

Thanks!

petrus · November 23, 2020, 12:18am

It is the Go programming language: The Go Programming Language Specification.

Rob is a coauthor of both Go and UTF-8. Therefore, Go properly supports UTF-8.

You don’t provide a small, reproducible code example so we don’t know what your problem is.

However, you appear to interpreting the characters as a byte slice instead of a string.

package main

import (
	"fmt"
)

func main() {
	jp := `美味しい` // UTF-8 as string
	fmt.Println(jp)
	bs := []byte(jp) // UTF-8 as byte slice
	fmt.Printf("%c\n", bs)
	s := string(bs) // UTF-8 as string
	fmt.Println(s)
}

美味しい
[ç ¾  å  ³ ã   ã  ]
美味しい

k71 · November 23, 2020, 2:02am

Hi petrus,

Thanks for the response. It’s definitely good to know it’s supposed to work with UTF8. Because it’s pulling test data from a database and simply printing it via fmt.Println, I didn’t have much code to share outside a grom operation and models, but I’ll provide what I have below.

It’s interesting to see that it’s a []byte array as you pointed out. However it appears there are some extra characters or something showing up in the byte array. I basically took the code snippet you showed as an example:


	s2 := string(pictures[0].AltText)
	fmt.Println(s2)

	jp := `美味しい` // UTF-8 as string
	fmt.Println(jp)
	bs := []byte(jp) // UTF-8 as byte slice
	fmt.Printf("%c\n", bs)
	s := string(bs) // UTF-8 as string
	fmt.Println(s)

Outputs:

ç¾Žå‘³ã—ã„
美味しい
[ç ¾  å  ³ ã   ã  
                  ]
美味しい

I decided to find out what the hex/binary were for these bytes to see if there were similarities. For the entry from the database + the hardcoded string, I just printed out the values and compared, but found nothing really similar outside the fact they were the same length:

pictures[0].AltText: c3 c2 c5 c3 e2 c2 c3 c2 e2 c3 c2 e2 
bs: e7 be 8e e5 91 b3 e3 81 97 e3 81 84

Basically, here’s what I have in regards to the model + Gorm commands I’m running (I narrowed it down to only focus on the alt_text, ignoring the other details of the model and scan):

type PictureDetails struct {
	AltText  string `json:"altText"`
	Language string `json:"language"`
	...
}

type Picture struct {
	AltText        string           `gorm:"<-:false" json:"altText,omitempty"`
	Language       string           `gorm:"<-:false" json:"language,omitempty"`
	PictureDetails []PictureDetails `gorm:"-" json:"pictureDetails,omitempty"`
	...
}

	var pictures []models.Picture
	db.Table("pictures").
		Select("pictures.*, picture_details.alt_text, picture_details.picture_language").
		Joins("left join picture_details on picture_details.picture_id = pictures.id").
		Where("picture_language = ?", language).
		Scan(&pictures)

The API returns an array for the given language as:

[
 {
  altText: "<Japanese here>"
  ...
 }, 
 ...
]

k71 · November 23, 2020, 2:20am

woooow, I got it…

I looked deeper into my database code and saw the database itself wasn’t set to utf8mb4 for some reason. I apparently missed the most important part to this whole thing.
Due to it being a test database, I had to recreate it to:
CREATE DATABASE IF NOT EXISTS my_database CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
Of course, my existing database will have to be altered with ALTER DATABASE my_database CHARACTER SET = utf8mb4 COLLATE = utf8mb4_unicode_ci; instead

But that’s not all. I also had to add this to my docker-compose command:
--skip-character-set-client-handshake

I had a time where I had one or the other, but I didn’t put both in at the same time like I just now did. Apparently they were both needed

Thanks for the assistance and pointing out the fun fact about it becoming a byte array. My api is now returning Japanese (finally)!

system · February 21, 2021, 2:20am

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.