Go 字符串编码？UTF-8？Unicode？看完就通！-技术圈

Go byte rune string

string类型在golang中以utf-8的编码形式存在，而string的底层存储结构，划分到字节即byte，划分到字符即rune。本文将会介绍字符编码的一些基础概念，详细讲述三者之间的关系，并提供部分字符串相关的操作实践。

一、基础概念

介绍Unicode，UTF-8之间的关系与编码规则

1、Unicode

Unicode是一种在计算机上使用的字符编码。它为每种语言中的每个字符设定了统一并且唯一的二进制编码，以满足跨语言、跨平台进行文本转换、处理的要求。本质上Unicode表示了一种字符与二进制编码的一一对应关系，所以是一种单字符的编码。

对于字符串来说，如果使用Unicode进行存储，则每个字符使用的存储长度是不固定的，而且是无法进行精确分割的。如中文字符“南”使用的Unicode编码为0x5357，对于该编码可以整体理解为一个字符“南”，也可以理解为0x53（S）和0x57（W）。因而单纯使用Unicode是无法进行字符串编码的，因为计算机无法去识别要在几个字节处做分割，哪几个字节要组成一个字符。所以需要一种Unicode之上，存在部分冗余位的编码方式，以准确表示单个字符，并在多个字符进行组合的时候，能够正确进行分割，即UTF-8。

2、UTF-8

UTF-8是针对Unicode的一种可变长度字符编码，它可以用来表示Unicode标准中的任何字符。因而UTF-8是Unicode字符编码的一种实现方式，Unicode强调单个字符的一一对应关系，UTF-8是Unicode的组合实现方式，此外还有UTF-16，UTF-32等类似编码，普适性较UTF-8稍弱。

编码规则

• ASCII字符（不包含扩展128+）0000 0000-0000 007F （0～7bit）

• 0xxxxxxx

• 0000 0080-0000 07FF （8～11bit）

• 110xxxxx 10xxxxxx

• 0000 0800-0000 FFFF （12～16bit）

• 1110xxxx 10xxxxxx 10xxxxxx

• 0001 0000-0010 FFFF （17～21bit）

• 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

总结

1. 对于ASCII（不包含扩展128+）字符，UTF-8编码、Unicode编码、ASCII码均相同（即单字节以0开头）
2. 对于非ASCII（不包含扩展128+）字符，若字符有n个字节（编码后）。则首字节的开头为n个1和1个0，其余字节均以10开头。除去这些开头固定位，其余位组合表示Unicode字符。

转换（2+字节UTF-8）

UTF-8 to Unicode

将UTF-8 按字节进行分割，以编码规则去掉每个字节头部的占位01，剩下位进行组合即Unicode字符

Unicode to UTF-8

从低位开始每次取6位，前加10组成尾部一个字节。直到不足六位，加上对应的n个1和1个0，首字节的大端不足位补0，如补充字节后位数不够则再增加一字节，规则同上。

（按规则预估字节数，优先写好每个字节的填充位，从末端补充即可）

实践

UTF-8 to Unicode

字符”南“，UTF-8十六进制编码为 0xe58d97，二进制编码为 11100101 10001101 10010111

去掉第一字节头部的1110，二三字节头部的10，则为 0101 0011 010 10111，Unicode编码 0x5357

Unicode to UTF-8

字符”南“，Unicode十六进制编码为 0x5357，二进制编码为 0101 0011 0101 0111 （15位）。则转换为UTF-8后占用3个字节，即1110xxxx 10xxxxxx 10xxxxxx。

从后向前填充：11100101 10001101 10010111

3、UCA（Unicode Collation Algorithm)

UCA是Unicode字符的核对算法，目前最新版本15.0.0(2022-05-03 12:36)。以14.0.0为准，数据文件主要包含两个部分，即 allkeys 和 decomps，表示字符集的排序、大小写、分解关系等，详细信息可阅读Unicode官方文档。不同版本之间的UCA是存在差异的，如两个字符，在14.0.0中定义了大小写关系，但在5.0.0中是不具备大小写关系的。在仅支持5.0.0的应用中，14.0.0 增加的字符是可能以硬编码的方式存在的，具体情况要看实现细节。因而对于跨平台，多语言的业务，各个服务使用的UCA很可能不是同一个版本。因而对于部分字符，其排序规则、大小写转换的不同，有可能会产生不一致的问题。

二、byte rune string

1、类型定义

三者都是Go中的内置类型，在 builtin 包中有类型定义

// byte is an alias for uint8 and is equivalent to uint8 in all ways. It is
// used, by convention, to distinguish byte values from 8-bit unsigned
// integer values.
type byte = uint8

// rune is an alias for int32 and is equivalent to int32 in all ways. It is
// used, by convention, to distinguish character values from integer values.
type rune = int32

// string is the set of all strings of 8-bit bytes, conventionally but not
// necessarily representing UTF-8-encoded text. A string may be empty, but
// not nil. Values of string type are immutable.
type string string

byte是uint8类型的别名，通常用于表示一个字节（8bit）。

rune是int32类型的别名，通常用于表示一个字符（32bit）。

string是8bit字节的集合，通常是表示UTF-8编码的字符串。

从官方概念来看，string表示的是byte的集合，即八位的一个字节的集合，通常情况下使用UTF-8的编码方式，但不绝对。而rune表示用四个字节组成的一个字符，rune值为字符的Unicode编码。

str := "南"

对于一个字符串“南”，其在UTF-8编码下有三个字节0xe58d97，所以转化为字节数组

byteList := []byte{0xe5,0x8d,0x97}

三个字节共同表示一个字符，因而rune实际上为其对应的Unicode对应的编码0x5357

runeList := []rune{0x5357}

上述三段中的str，byteList，runeList虽然分别为字符串、字节数组、字符数组不同类型，但实际上表示的都是汉字“南”。

2、类型转换

类型转换时候使用的语法，是无法直接定位到具体实现过程的。需要查看 plan9 汇编结果以找到类型转换具体调用的源码。

func main() {
    byteList := []byte{0xe5, 0x8d, 0x97}
    str := string(byteList)
    fmt.Println(str)
}

如上示例代码，定义字节数组（表示汉字“南”），转化为string类型后进行输出。

go tool compile -S -N -l main.go

命令行对上述代码进行编译，禁止内联，禁止优化，输出汇编代码如下（仅关注类型转换）：

0x0074 00116 (main.go:7)        MOVD    ZR, 8(RSP)
0x0078 00120 (main.go:7)        MOVD    R0, 16(RSP)
0x007c 00124 (main.go:7)        MOVD    R1, 24(RSP)
0x0080 00128 (main.go:7)        PCDATA  $1, ZR
0x0080 00128 (main.go:7)        CALL    runtime.slicebytetostring(SB)
0x0084 00132 (main.go:7)        MOVD    32(RSP), R0
0x0088 00136 (main.go:7)        MOVD    40(RSP), R1
0x008c 00140 (main.go:7)        MOVD    R0, "".str-80(SP)
0x0090 00144 (main.go:7)        MOVD    R1, "".str-72(SP)

可见，类型转换实际上是调用了runtime包中的slicebytetostring方法

三种类型相互转换均可通过汇编的方式找到源码位置，此处仅以[]byte->string举例。

rune to []byte（string）

encoderune 函数接受一个rune值，通过UTF-8的编码规则，将其转化为[]byte并写入p，同时返回写入的字节数。

// encoderune writes into p (which must be large enough) the UTF-8 encoding of the rune.
// It returns the number of bytes written.
func encoderune(p []byte, r rune) int {
    // Negative values are erroneous. Making it unsigned addresses the problem.
    switch i := uint32(r); {
    case i <= rune1Max:
        p[0] = byte(r)
        return 1
    case i <= rune2Max:
        _ = p[1] // eliminate bounds checks
        p[0] = t2 | byte(r>>6)
        p[1] = tx | byte(r)&maskx
        return 2
    case i > maxRune, surrogateMin <= i && i <= surrogateMax:
        r = runeError
        fallthrough
    case i <= rune3Max:
        _ = p[2] // eliminate bounds checks
        p[0] = t3 | byte(r>>12)
        p[1] = tx | byte(r>>6)&maskx
        p[2] = tx | byte(r)&maskx
        return 3
    default:
        _ = p[3] // eliminate bounds checks
        p[0] = t4 | byte(r>>18)
        p[1] = tx | byte(r>>12)&maskx
        p[2] = tx | byte(r>>6)&maskx
        p[3] = tx | byte(r)&maskx
        return 4
    }
}

rune向byte和string类型的转换实际上都是基于 encoderune 函数，该函数通过硬编码和位运算的方式实现了Unicode值向UTF-8编码（[]byte）的转换。因而不再关注rune，仅关注[]byte和string的转换逻辑。

[]byte to string

// slicebytetostring converts a byte slice to a string.
// It is inserted by the compiler into generated code.
// ptr is a pointer to the first element of the slice;
// n is the length of the slice.
// Buf is a fixed-size buffer for the result,
// it is not nil if the result does not escape.
func slicebytetostring(buf *tmpBuf, ptr *byte, n int) (str string) {
    /*
        部分情况（race、msan、n=0,1等不关注）
    */

    var p unsafe.Pointer
    if buf != nil && n <= len(buf) {
        p = unsafe.Pointer(buf)
    } else {
        p = mallocgc(uintptr(n), nil, false)
    }
    stringStructOf(&str).str = p
    stringStructOf(&str).len = n
    memmove(p, unsafe.Pointer(ptr), uintptr(n))
    return
}

*tmpBuf是一个定长为32的字节数组，当长度超过32，无法直接通过tmpBuf进行承接，则需要重新分配一块内存去存储string。

const tmpStringBufSize = 32

type tmpBuf [tmpStringBufSize]byte

stringStructOf 用于将字符串类型转为string内置的stringStruct类型，以设置字符串指针与len。

type stringStruct struct {
   str unsafe.Pointer
   len int
}

func stringStructOf(sp *string) *stringStruct {
    return (*stringStruct)(unsafe.Pointer(sp))
}

无论使用tmpBuf还是在堆上新分配，都需要通过memmove进行底层数据拷贝。

string to []byte

func stringtoslicebyte(buf *tmpBuf, s string) []byte {
   var b []byte
   if buf != nil && len(s) <= len(buf) {
      *buf = tmpBuf{}
      b = buf[:len(s)]
   } else {
      b = rawbyteslice(len(s))
   }
   copy(b, s)
   return b
}

本质上也是基于string的len，选择性使用tmpBuf或新分配内存，后使用copy进行底层数据拷贝

三、操作实践

1、类型转换性能优化

Go底层对[]byte和string的转化都需要进行内存拷贝，因而在部分需要频繁转换的场景下，大量的内存拷贝会导致性能下降。

type stringStruct struct {
   str unsafe.Pointer
   len int
}

type slice struct {
   array unsafe.Pointer
   len   int
   cap   int
}

本质上底层数据存储都是基于uintptr，可见string与[]byte的区别在于[]byte额外有一个cap去指定slice的容量。所以string可以看作[2]uintptr，[]byte看作[3]uintptr，类型转换只需要转换成对应的uintptr数组即可，不需要进行底层数据的频繁拷贝。

以下是fasthttp基于此思想提供的一个解决方案，用于string与[]byte的高性能转换。

// b2s converts byte slice to a string without memory allocation.
// See https://groups.google.com/forum/#!msg/Golang-Nuts/ENgbUzYvCuU/90yGx7GUAgAJ .
//
// Note it may break if string and/or slice header will change
// in the future go versions.
func b2s(b []byte) string {
    /* #nosec G103 */
    return *(*string)(unsafe.Pointer(&b))
}

// s2b converts string to a byte slice without memory allocation.
//
// Note it may break if string and/or slice header will change
// in the future go versions.
func s2b(s string) (b []byte) {
    /* #nosec G103 */
    bh := (*reflect.SliceHeader)(unsafe.Pointer(&b))
    /* #nosec G103 */
    sh := (*reflect.StringHeader)(unsafe.Pointer(&s))
    bh.Data = sh.Data
    bh.Cap = sh.Len
    bh.Len = sh.Len
    return b
}

由于[]byte转换到string时直接抛弃cap即可，因而可以直接通过unsafe.Pointer进行操作。

string转换到[]byte时，需要进行指针的拷贝，并将Cap设置为Len。此处是该方案的一个细节点，因为string是定长的，转换后data后续的数据是否可写是不确定的。如果Cap大于Len，在进行append的时候不会触发slice的扩容，而且由于后续内存不可写，就会在运行时导致panic。

2、UCA不一致

UCA定义在 unicode/tables.go 中，头部即定义了使用的UCA版本。

// Version is the Unicode edition from which the tables are derived.
const Version = "13.0.0"

经过追溯，go 1 起的tables.go即使用了6.0.0的版本，位置与现在稍有不同。

根据MySQL官方文档关于UCA的相关内容

MySQL使用不同编码，UCA的版本并不相同，因而很大概率会存在底层数据库使用的UCA与业务层使用的UCA不一致的情况。在一些大小写不敏感的场景下，可能会出现字符的识别问题。如业务层认为两个字符为一对大小写字符，而由于MySQL使用的UCA版本较低，导致MySQL通过小写进行不敏感查询无法查询到大写的数据。

由于常用字符集基本不会发生变化，所以对于普通业务，UCA的不一致基本不会造成影响。

推荐阅读

golang处理gb2312转utf-8编码的问题

福利

我为大家整理了一份从入门到进阶的Go学习资料礼包，包含学习建议：入门看什么，进阶看什么。关注公众号「polarisxu」，回复 ebook 获取；还可以回复「进群」，和数万 Gopher 交流学习。