So you're a Go developer and you're building your latest and greatest web app. You decide to add some extra flare to your JSON API by adding a 😀 to the end of your success message. You define a success message and marshal it as JSON.
jsonMsg, _ := json.Marshal(map[string]interface{}{
"ok": true,
"message": "Success 😀",
})
fmt.Println(string(jsonMsg))
// -> {"message":"Success 😀","ok":true}
We got the JSON message we expected, including our 😀, so that worked. But how does the smiley face in our code editor end up in a JSON message without breaking something or requiring some kind of Base64-encoded image in the message? It turns out there are actually numerous standards and systems that must work together to make handling Unicode text, including Emoji, seamless.
The first reason it works is because 😀 is not just any image, but an image that's part of a font (called a glyph) that your code editor and most other software use to display text. The image data is already present on your computer and was likely included in your operating system or downloaded with your code editor.
The second reason it works is because your code editor, the Go compiler, the Go runtime, the JSON standard, and the Go JSON library all use Unicode text encoding, specifically UTF-8 encoding. Unicode text is encoded as a series of "code points": numbers that tell your software which glyphs to display. UTF-8 is a Unicode encoding that stores text as a series of 8-bit values and represents code points as one to four 8-bit values. For a helpful summary of text encoding terminology, check out this Stack Overflow post.
Why is any of this important? For me, this exploration started when I was attempting to improve handling of Unicode surrogate pair values in the MongoDB Go driver's Extended JSON unmarshaler. Don't worry if that sounds like word salad now, we'll explore those concepts more later.
Unicode in Go
Building on what we just learned about Unicode text encoding, let's quickly review how Go text types work. A Go string
is a wrapper around a byte array that typically holds UTF-8-encoded text (although it can technically contain any arbitrary bytes). A Go rune
represents a single Unicode code point and is an alias of int32
. A string
can be directly converted to and from a []byte
or []rune
. For more information about text representation in Go, read Rob Pike's 2013 blog post Strings, bytes, runes, and characters in Go.
Let's see that encoding in action by printing the []byte
representation of our JSON message. Note that json.Marshal
returns the encoded JSON as a []byte
, so we just need to print the returned variable instead of converting it to a string
.
fmt.Println(jsonMsg)
// -> [123 34 109 101 115 115 97 103 101 34 58 34 83 117 99 99 101 115 115 33 32 240 159 152 128 34 44 34 115 117 99 99 101 115 115 34 58 116 114 117 101 125]
Well, we just got a bunch of numbers. Where's the 😀? Let's use the IndexRune
function in the "bytes"
package to find the index of the 😀 glyph in the byte slice.
idx := bytes.IndexRune(jsonMsg, '😀')
fmt.Println(idx)
// -> 21
Great, now we know where it is! Let's make sure reading the byte at index idx
gives us the 😀 we expected.
fmt.Println(string(jsonMsg[idx]))
// -> ð
Something's not right, we got ð
, not a smiley face. Remember earlier when we talked about how some Unicode code points need more than one byte when encoded as UTF-8? Let's try reading a few different size byte slices and see what we get.
fmt.Println(string(jsonMsg[idx : idx+1]))
// -> �
Hmm, that's not a smiley face. We got �
, the Unicode replacement character, which means Go couldn't figure out how to decode the bytes as a valid UTF-8 string. Let's try reading more bytes.
fmt.Println(string(jsonMsg[idx : idx+2]))
// -> �
Nope.
fmt.Println(string(jsonMsg[idx : idx+3]))
// -> �
Not yet.
fmt.Println(string(jsonMsg[idx : idx+4]))
// -> 😀
There, we got a smiley face by decoding 4 bytes! What happens if we try to decode 5 bytes?
fmt.Println(string(jsonMsg[idx : idx+5]))
// -> 😀"
OK, now we're getting extra code points. That makes sense because the maximum size a UTF-8 code point can be is 4 bytes. Going back to the original question, let's look at the UTF-8 bytes that represent a 😀.
fmt.Println(jsonMsg[idx : idx+4])
// -> [240 159 152 128]
We can confirm that the value we found is indeed a 😀 by building the byte slice literal and decoding it as a UTF-8 string.
fmt.Println(string([]byte{240, 159, 152, 128}))
// -> 😀
We just successfully built a valid UTF-8 string from individual byte values!
In the Go runtime, strings are UTF-8 encoded byte arrays, but what about the actual ".go" file? Let's try reading the "main.go" file we're writing and see how the 😀 glyph is encoded.
main, _ := ioutil.ReadFile("main.go")
idx := bytes.IndexRune(main, '😀')
fmt.Println(string(main[idx : idx+4]))
// -> 😀
fmt.Println(main[idx : idx+4])
// -> [240 159 152 128]
It's the same as the Go string
encoding! In fact, the Go compiler expects ".go" files to be UTF-8-encoded. The consistency of UTF-8 text encoding across different software definitely makes handling 😀 in Go easier.
Unicode in JSON
Great, we understand how to encode and decode our 😀 glyph in Go source code and Go strings! But weren't we talking about JSON messages? That's right, we marshalled a JSON message with a 😀 and were trying to figure out how it worked. The short answer is that the JSON Character Encoding specification says that JSON can contain any UTF-8-encoded text, including 😀. If that's true, we should be able to unmarshal a JSON message declared as a Go string
literal and converted to a []byte
. Let's try.
var msg map[string]interface{}
json.Unmarshal([]byte(`{"message":"Success 😀","ok":true}`), &msg)
fmt.Println(msg)
// -> map[message:Success 😀 ok:true]
Because a Go string
is UTF-8-encoded text, converting one to a []byte
and using the "encoding/json"
package to unmarshal it works as expected.
Weird Unicode in JSON
We've mostly answered our original question, but let's consider a world where some legacy software doesn't handle JSON messages with multi-byte UTF-8 code points correctly. What if we really, really need to limit our JSON messages to only ASCII values, but we still want to send the 😀?
JSON actually supports encoding Unicode code points as "escape sequences" in strings, like "\u2603"
, which is the Unicode escape sequence for the snowman emoji (☃
). How do we get the Unicode escape sequence for 😀?
The Go "encoding/json"
library doesn't support marshaling JSON with only ASCII values (there's an interesting discussion about that here), so let's try using the QuoteToASCII
function from the "strconv"
package to get a Unicode escape sequence.
fmt.Println(strconv.QuoteToASCII("😀"))
// -> "\U0001f600"
Unicode escape sequences in JSON must start with "\u"
(lowercase), so let's lowercase that "\U"
and try to unmarshal our message, replacing the literal 😀 with the Unicode escape sequence.
var msg map[string]interface{}
json.Unmarshal([]byte(`{"message":"Success \u0001f600","ok":true}`), &msg)
fmt.Println(msg)
// -> map[message:Success f600 ok:true]
Spoiler
Unicode escape sequences output by QuoteToASCII
are always valid in Go string
literals, but are not always valid in JSON strings. Go string
literals support UTF-32 escape sequences that start with an uppercase "\U"
(e.g. "\U0001f600"
), but JSON does not.
Oops, that didn't work. What happened to our Unicode escape sequence and our 😀? It turns out that JSON Unicode escape sequences must start with "\u"
followed by exactly 4 hexadecimal digits. The JSON unmarshaler reads the first 4 hex digits ("0001"
), decodes it as the Unicode "start of heading" code point (a non-printing control character), then reads the following values "f600"
as a literal string. In fact, we can't encode 😀 as a single JSON Unicode escape sequence because its code point value requires 17 bits, which is more than the 16 bits we can write as 4 hex digits. The Unicode spec helpfully includes a quirky feature called surrogate pairs that lets us encode Unicode code points larger than 16 bits as a pair of UTF-16 values. We can use the EncodeRune
function from the "unicode/utf16"
package to get those two UTF-16 runes, then use strconv.QuoteToASCII
again to get the corresponding Unicode escape sequence.
high, low := utf16.EncodeRune('😀')
pairStr := string([]rune{high, low})
fmt.Println(strconv.QuoteToASCII(pairStr))
// -> "\ufffd\ufffd"
That seems like it worked, but if we take a closer look we see that both Unicode escape sequences are "\ufffd"
, which is the escape sequence for the Unicode replacement character �
. We've run into one of the quirks of Unicode surrogate pairs, which is that each "surrogate half" is not individually a valid Unicode code point. As a result, when we convert our []rune
into a string
, Go interprets each surrogate half as two separate invalid Unicode code points and replaces each with the Unicode replacement character. If we try to run just the []rune
to string
conversion code, we see the same result.
fmt.Println(string([]rune{high, low}))
// -> ��
Alright, no string
this time. Let's use the Unicode format verb from the "fmt"
package to get the raw Unicode code point for each surrogate half.
fmt.Printf("%U %U\n", r1, r2)
// -> U+D83D U+DE00
Great, let's write those as JSON Unicode escape sequences and try it again.
var msg map[string]interface{}
json.Unmarshal([]byte(`{"message":"Success \uD83D\uDE00","ok":true}`), &msg)
fmt.Println(msg)
// -> map[message:Success 😀 ok:true]
It worked! To be clear, Unicode surrogate pairs are confusing and not necessary in the vast majority of cases, but they can help us understand how different Unicode encodings work.
Wrapping Up
If you write Go code, you use Unicode and UTF-8 all the time. Go's strong UTF-8 support probably isn't random, as Rob Pike and Ken Thompson, two of the original authors of Go, wrote the original implementation of UTF-8 in the Plan 9 operating system in 1992. Today, most common software handles UTF-8-encoded text properly, but when something doesn't work, understanding the layers of text encoding can be incredibly helpful to pinpoint the problem.
Additional Reading
- https://www.christianfscott.com/rust-chars-vs-go-runes/
- https://dmitripavlutin.com/what-every-javascript-developer-should-know-about-unicode/
Cover photo background by Nick Fewings on Unsplash
Top comments (1)
Hi, How do we decode []bytes to emoji