This is a cross-posting from my personal blog
I built a web application with file upload functionality. Some Vue.js in the front and a CouchDB in the back. Everything should be pretty simple and straigt forward.
But…
When I uploaded image files, they somehow got mangled. The uploaded file was bigger than the original and the new "file format" was not readable by any means. I got intrigued. What is it, that happens to the files? The changes seemed very random but reproducible, so I created a few test files to see what exactly changes and when.
My first file looked like this:
0123456789
ABCDEFGHIJKLMNOPQRSTUVWXYZ
abcdefghijklmnopqrstuvwxyz
To my surprise, the file stayed the same! My curiosity grew. In the meantime I found a very intriguing pattern in uploads hexdump: C3 BF C3. It was everywhere. In another file, I found similar patterns with C2. So I wrote my next test file. This time a binary file:
00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 |................|
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |.... !"#$%&'()01|
32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 |23456789@ABCDEFG|
48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 |HIPQRSTUVWXY`abc|
64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 |defghipqrstuvwxy|
80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 |................|
96 97 98 99 a0 a1 a2 a3 a4 a5 a6 a7 a8 a9 aa ab |................|
ac ad ae af b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 ba bb |................|
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
EDIT: As you probably already noticed, I counted up like in Base10 but it is actually Base16. So I skipped A-F until reaching A0. This might look weird but didn't affect the test.
The result after uploading was
00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 |................|
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |.... !"#$%&'()01|
32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 |23456789@ABCDEFG|
48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 |HIPQRSTUVWXY`abc|
64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 |defghipqrstuvwxy|
c2 80 c2 81 c2 82 c2 83 c2 84 c2 85 c2 86 c2 87 |................|
c2 88 c2 89 c2 90 c2 91 c2 92 c2 93 c2 94 c2 95 |................|
c2 96 c2 97 c2 98 c2 99 c2 a0 c2 a1 c2 a2 c2 a3 |................|
c2 a4 c2 a5 c2 a6 c2 a7 c2 a8 c2 a9 c2 aa c2 ab |................|
c2 ac c2 ad c2 ae c2 af c2 b0 c2 b1 c2 b2 c2 b3 |................|
c2 b4 c2 b5 c2 b6 c2 b7 c2 b8 c2 b9 c2 ba c2 bb |................|
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
There it was again: The magic 0xC2!
So all bytes with a value 0x80
or higher got prefixed by a 0xC2
. 0x7F
is the last possible value in original 7bit-ASCII — and there the scales fell from my eyes: UTF-8 encoding!
In UTF-8 all characters after 0x7F
are at least two bytes long. They get prefixed with 0xC2
until 0xC2BF
(which is the inverted question mark ¿), which is then followed by 0xC380
. So what happened is, that on the way to the server, the file got encoded to UTF-8 ¯\_(ツ)_/¯
EDIT: Corrected some mistakes after some comments on Hackernews
Top comments (0)