So I got nerd sniped by my buddy Snoopy the other daaaaay...
He's studying CS in Europe and is writing a program for an assignment where he has to input some characters from the command line on Windows and process them. The relevant part of the program is pretty simple. It's like this:
Scanner sc = new Scanner(System.in);
String input = sc.next();
for (int i = 0; i < input.length(); i++) {
System.out.print(String.format("%02x", (int)input.charAt(i)));
System.out.println();
}
So he runs it and enters a non-ANSI character: š (that's U+0161). The output he gave me is this:
>java PrintBytes
š
00
Now that's weird. I am pretty sure this is not a null character. I expected to see either a Unicode or UTF-8 representation of this. This was about the time I felt the uncontrollable urge to get involved.
Default Codepage Issues
I downloaded the JDK and tried it on my machine.
>java PrintBytes
š
73
Well, that's weird. Oh, my system codepage is set to Windows, whereas his was set to UTF-8. I used chcp
to change it to 65001, which is UTF-8, and got the same odd zero result.
Redirected input from a file
Next test: what if I read the same input from a file instead?
>java PrintBytes < input.txt
c5
a1
Hey, that's correct. That's the UTF-8 representation of it. So something is weird with how Java is reading from an interactive command line compared to file input, even when both come through stdin.
How does Rust do it?
Next test, let's see how it does in Rust.
use std::io::Read;
fn main() {
for b in std::io::stdin().bytes() {
let val = b.unwrap();
match val {
0xd => println!(""),
0xa => (),
_ => println!("{:#02x}", val),
}
}
}
The output is good:
>target\debug\printbytes.exe
š
0xc5
0xa1
So Rust is doing it right interactively. The Rust code actually checks if stdin is currently a console handle and calls ReadConsoleW
, otherwise calling ReadFile
, which handles regular file I/O just fine.
Snoopy also tried writing the equivalent program in Python, and it also did it right. So Java seems to be doing something wrong under certain conditions... but what's the reason?
Finding the answer
A good starting point might be to check the Rust source. My first guess was that somewhere I'd see a call to ReadFile
on the stdin handle, but instead I see the lowest level Windows call it makes is to a function I'm not familiar with, ReadConsoleW
.
Reading the docs, it references something about ANSI compatibility:
ReadConsole
reads keyboard input from a console's input buffer. It behaves like theReadFile
function, except that it can read in either Unicode (wide-character) or ANSI mode.
I found another link that gives a good comparison between ReadFile
and ReadConsole
. It confirms that ReadConsoleA
(the ANSI version) only reads ANSI characters, but ReadConsoleW
can read Unicode characters. Rust is reading Unicode characters (hopefully UTF-16 but I'm not really sure), then translating them internally into UTF-8, since its string type is natively UTF-8.
Confirming with C++
Easiest way to confirm was write a little C++ program, going straight to the source. In different modes it can try ReadFile
or ReadConsoleW
uint16_t c;
if (argc == 1) {
ReadFile(GetStdHandle(STD_INPUT_HANDLE), reinterpret_cast<uint8_t*>(&c), 1, nullptr, nullptr);
} else {
DWORD numRead;
ReadConsoleW(GetStdHandle(STD_INPUT_HANDLE), &c, 1, &numRead, nullptr);
}
printf("%04x\n", c);
First here's ReadFile
mode:
>printbytes_c.exe
š
0000
And then ReadConsoleW
mode:
>printbytes_c.exe -c
š
0161
U+0161 is the UTF-16 encoding of the character, so that seems to be showing some Unicode support. Interesting to note that ReadConsoleA
showed the same behavior as ReadFile
.
Conclusion
The behavior is a little unfortunate in Windows, but it seems to be fairly well documented. Most languages seem to be doing a proper job of handling this, but Java isn't. We can even see it in the debugger. I don't have proper symbols, but at least the top of the stack seems to resolve pretty clearly.
0:004> k
# Child-SP RetAddr Call Site
00 00000016`03ffce28 00007fff`7157c7f4 KERNEL32!ReadFile
01 00000016`03ffce30 00007fff`7157bd76 java!handleRead+0x20
02 00000016`03ffce70 00007fff`71572641 java!JNI_OnLoad+0x196
03 00000016`03ffef00 00000171`9146a02e java!Java_java_io_FileInputStream_readBytes+0x1d
So Java... do better. Have a way to properly handle Unicode interactive console input. Maybe it does...? A Java expert would probably know, but I can't find it on the Internet with any obvious searches. But also this problem is Windows-specific, so Windows... why you gotta be this way? In conclusion, computers are bad.
Top comments (2)
i was having that same problem, i'm glad you figured it out!
hey that was an interesting read, thanks for posting