This article was originally posted at HeavyDots Blog_
The legend of the white space
Every now and then, on the World Wide Web, someone meets the "white space" or for us: "the beast".
When we meet the beast, it drives us crazy, we get scared, we can’t understand what our eyes see, we cannot explain what’s going on, we wonder if something affected our perception. A supernatural phenomenon is in front of our eyes and overwhelms us.
- What is this INSANE space character??? (google chrome)
- Weird space character in string, that's not a space?
- Weird white space characters - utf8 PHP
- I have a strange space character in a String
We also got into it twice! Once on a website and once even inside of a Excel sheet! It’s unbelievable the places where this creature summons and the way it got there is an even bigger mystery!
The brave who went after it
Some give up, run and hide from it. Or just accept it without questioning its existence. But others, the brave ones, start on a journey through the darkness and don’t give up until they know the truth.
This is what someone who has been there and survived confessed (with shaking hands):
To the people in the future like myself that had to debug this from a high level all the way down to the character codes, I salute you.
We've also been there and we returned with the truth ready to share it with you.
How the creature looks and where does it come from
The strange space is actually the
entity (Wikipedia), pretty well known by HTML coders but in this case it is not represented/encoded in HTML.
So basically most of us knew the creature, but we didn’t know it could exist in another shape and dimension.
Wikipedia explains the different representations of the beast:
And also how the creature is born and mutates:
The tools and techniques to find it and get rid of it
For the even braver who want to go after it and make it disappear we’ve got some tools and techniques to help locating it and still come back alive from the journey.
Web tool that looks for it in the code of a webpage:
Here’s a tool with a web interface where you can enter a URL and it searches the contents of that page in order to find the beast:
http://tools.heavydots.com/nbsp-space-char-detect/
Manual cleanup technique:
But if you find yourself in the dark code dimension and cannot use the web, and you have to get very close to the beast in order to kill it, here is a way to exorcise your demonized code:
Only for those who are skilled with PHP spells, run a search of chr(194).chr(160)
and replace it with an ordinary space. This will extract the demon out of it and will restore back its clean white space soul.
Take this scroll with you, it contains the spell you will need when you face the beast:
// Define the white beast
$white_beast=chr(194).chr(160);
// Count how many of them are living in your text
$count=substr_count($string, $white_beast);
print_r($count);
// Replace it with a normal space
$string=str_replace($white_beast, ' ', $string);
THE END of the story
So dear reader of this legend, if you never faced the beast, beware!
But if you have also had to deal with it, please share with us your story in the comments section!
I'll try to post new stuff here, and also invite you to drop by at HeavyDots Blog
Top comments (4)
Lol. Excel has gotten a coworker of mine with this too. He emailed the offending code to me and and I couldn't find an issue... because the email program converted it to a normal space during copy/paste.
Yeah that's the scairy thing about it that it can disappear! :D I found it in Excel in a date field that would not get converted when importing the Excel into a custom PHP app. The date's format looked right.. but on both left and right side it had this strange char.. who knows when it got there!
What problem is the non-breaking-space creating?
Surely the myriad of other Unicode spacing characters would also create similar issues?
There are dozens and dozens of Unicode characters that show up as blank space. You might think that \s in a regex would find all of them, since it matches characters with the "separator, space" unicode property. But not all blank characters have the "separator, space" property, including (with the characters between parentheses):
U+3164 HANGUL FILLER (ㅤ)
U+1D173 MUSICAL SYMBOL BEGIN BEAM () (there are 7 other similar musical symbols)
U+200D ZERO WIDTH JOINER ()
U+180E MONGOLIAN VOWEL SEPARATOR () (only shows up as blank in some fonts)
There's even one character, (U+1680 OGHAM SPACE MARK) that has the "separator, space" property and doesn't display as whitespace. Hilariously enough, you can use this character as whitespace in JavaScript.