DEV Community

Cover image for Implement base64 encoding using Rust - [Part 2] Handle unicode characters
Quack Quack
Quack Quack

Posted on

Implement base64 encoding using Rust - [Part 2] Handle unicode characters

Introduction

In my previous post, i implemented base64 encoding but it only works for non-unicode characters. In this post, we will enhance our base64_encode function to make it support the whole planet =)).

What we did wrong?

Let's have a look at RFC4648 and find out what we've been missing the whole time.

The encoding process represents 24-bit groups of input bits as output
strings of 4 encoded characters. Proceeding from left to right, a
24-bit input group is formed by concatenating 3 8-bit input groups.
These 24 bits are then treated as 4 concatenated 6-bit groups, each
of which is translated into a single character in the base 64
alphabet.

And a look at our code from previous post

    for char in input.chars() {
        ...
    }
Enter fullscreen mode Exit fullscreen mode

See the bold section from the quote? that's where we went wrong. In other word, we do not guarantee that our character could be represented using 1 byte, only ASCII character could be represented using 1 byte, but unicode is not the case. For example, 'a pile of poo' 💩 is represented using 4 bytes (F0 9F 92 A9).

What is the fix?

It's easy to fix, we just need to modify the code to loop over each byte in the str. Lucky for us, Rust has a built-in method bytes which create an Iterator to loop over each byte in the str. The good thing about Iterator is that they are lazy and won't do anything until being consumed, we can still keep our memory usage minimal.

    for byte in input.bytes(){

    }
Enter fullscreen mode Exit fullscreen mode

Add more testing

A pile of poo 💩 should be encoded correctly to prove the correctness of our program.

assert_eq!("8J+SqQ==", base64_encode("💩"));
Enter fullscreen mode Exit fullscreen mode

Top comments (0)