This post is a quick overview of how I turned an initially poor solution to a problem into a better one, and the thought process behind it.
The problem
I'm working on pulling information out of the Certificate Transparency Logs. The logs are made up of a JSON list of log entries. A single entry is a JSON object with two fields, leaf_input
and extra_data
. The leaf_input
field contains base64 encoded binary data. I want to pull information out of this binary data.
To outline the structure of the binary data I'll use the following notation, indicating byte numbers:
- [n] - indicates byte n, so [0] is the first byte.
- [n..=m] - indicates a range of bytes, inclusive, so [2..=4] means bytes 2, 3 and 4.
- [n..] - indicates the remaining bytes in the binary data after and including index n.
[0] [1] [2..=9] [10..=11] [12..=14] [15..]
| | | | | |
| | | | | |- (rest) the rest of the bytes
| | | | |
| | | | |- (length) the length of the certificate
| | | |
| | | | - the log entry type, not relevant in this post
| | |
| | | - timestamp, not relevant in this post
| |
| | - signature type, not relevant in this post
|
| - version, not relevant in this post
The relevant parts here are the length (bytes [12..=14]) and the rest (indicated by [15..]
above). rest contains an X.509 certificate we want, and the three bytes in the length define an integer which tells us how many bytes that certificate makes up in rest. Once we read length bytes from rest, the remainder of rest is data that we do not need.
A solution
So we need to turn the three bytes which make up length into an integer. An integer whose size is 3 bytes would be a 24-bit int. Rust does not support 24-bit ints. The closest is a u32
, a 32-bit unsigned int. You can't just turn 3 bytes into a u32
however. The first solution I came up with was something like the following code.
// "bytes" is the full binary data inside the "leaf_input" field
fn get_cert_length(bytes: &[u8]) -> u32 {
let length_bytes_slice = &bytes[12..=14];
let mut length_bytes = b.iter().cloned().collect::<Vec<u8>>();
length_bytes.reverse();
length_bytes.push(0);
return u32::from_le_bytes(length_bytes.as_slice().try_into().unwrap());
}
This is not great! We allocate a new Vec
and we have to pad it with a zero to make it 4 bytes in length. Because we cannot push a value onto the start of a Vec
we first reverse it. Then we convert it into an array of type [u8; 4]
- this is what length_bytes.as_slice().try_into().unwrap()
does. Finally, we read the bytes in little endian format as a u32
. Way too much happening here for what on the surface seems like a simple conversion!
I thought about it some more and came up with a better solution.
A better solution
First, here is an outline of my thought process. Let's say bytes 12 to 14 look as follows, in hex notation:
[00, 05, 4c]
Then the value for the length
is 00054c in hex, which is 1356 in decimal. 00054c is 000000 + 000500 + 00004c. If we take the bytes 00, 05 and 4c individually, how do we get 00054c?
Well, each time we move a number in the hex string one space to the left, we multiply the number by a power of 16. For example, moving 000005, which is 5 in decimal, to 000050, which is 80 in decimal, we had to multiply 5 by 16. If we move it two places, we would need to multiply it by 256, which is 16 squared, so 000005 to 000500 (1280 in decimal) is 5 * 256. Concretely, to convert 000500 hex to decimal:
With this knowledge we can update get_cert_length
as follows:
fn get_cert_length(bytes: &[u8]) -> u32 {
(bytes[12] as u32 * 65536) + (bytes[13] as u32 * 256) + bytes[14] as u32
}
Much cleaner than the original version! The values we multiply each byte with are always a power of 2 so we can also use bit-shifts instead of the multiplications:
fn get_cert_length(bytes: &[u8]) -> u32 {
((bytes[12] as u32) << 16) + ((bytes[13] as u32) << 8) + bytes[14] as u32
}
I ended up using the bit-shift version in the end.
Top comments (7)
There was recently a post in URLO about this, Custom u24 type. Essentially, another option was to do the following:
Or if you don't want to type three different indices:
Wow! That's even better, didn't realize you could do that, copy_from_slice, thanks. Not sure why I didn't even think of the first one 😄
After searching around in URLO, I also found this neat post on The byte order fallacy which is mildly relevant. By using
from_le_bytes
orfrom_be_bytes
, we can see the byte order in a concise way and it would be trivial to change too (as opposed to bit-shifts which require a tiny bit more of work and thought).The compiler is able to optimize both forms of
get_cert_length
to same code.Check this playground
Yip that makes sense. The shift vs multiply versions were more about which one looks less magical 😄
Just FYI, the compiler will try to optimize operations like multiplications into bit shifts if it can (godbolt.org/z/9Pgztq), so type out what's clearest first! (In this case, it's probably the bit shifts anyway).
Agreed, that's why I went with the bit shifts. Shifting by 16 is a little less magical than multiplying by 65536 😄 neither are great in this case though and I went with what @lonami mentioned.
I need to drill deeper into the standard lib apis for sure!