ChatGPT解决这个技术问题 Extra ChatGPT

Why does Java permit escaped unicode characters in the source code?

I recently learned that Unicode is permitted within Java source code not only as Unicode characters (eg. double π = Math.PI; ) but also as escaped sequences (eg. double \u03C0 = Math.PI; ).

The first variant makes sense to me - it allows programmers to name variables and methods in an international language of their choice. However, I don't see any practical application of the second approach.

Here are a few pieces of code to illustrate usage, tested with Java SE 6 and NetBeans 6.9.1:

This code will print out 3.141592653589793

public static void main(String[] args) {
    double π = Math.PI;
    System.out.println(\u03C0);
}

Explanation: π and \u03C0 are the same Unicode character

This code will not print out anything

public static void main(String[] args) {
    double π = Math.PI; /\u002A
    System.out.println(π);

    /* a comment */
}

Explanation: The code above actually encodes:

public static void main(String[] args) {
    double π = Math.PI; /*
    System.out.println(π);

    /* a comment */
}

Which comments out the print satement.

Just from my examples, I notice a number of potential problems with this language feature.

First, a bad programmer could use it to secretly comment out bits of code, or create multiple ways of identifying the same variable. Perhaps there are other horrible things that can be done that I haven't thought of.

Second, there seems to be a lack of support among IDEs. Neither NetBeans nor Eclipse provided the correct code highlighting for the examples. In fact, NetBeans even marked a syntax error (though compilation was not a problem).

Finally, this feature is poorly documented and not commonly accepted. Why would a programmer use something in his code that other programmers will not be able to recognize and understand? In fact, I couldn't even find something about this on the Hidden Java Features question.

My question is this:

Why does Java allow escaped Unicode sequences to be used within syntax? What are some "pros" of this feature that have allowed it to stay a part Java, despite its many "cons"?

"First, a bad programmer could use it to..." a bad programmer will find another way to make the code worse, even if there is no unicode escaping.
Absolutely, a bad programmer will always find ways to make the code worse. What I'm trying to say is that the Java designers made deliberate decisions in an attempt to minimize abuse. For example, multiple inheritance, pointers, macros, and operator overloading are common practice in C++ but were specifically not included in Java.
For extra fun, move the /\u002A far to the right, outside the IDE's viewport.
@TiborBlenessy Because that tree is not in the BMP (Basic Multilingual Plane) of Unicode. Java allows the use of any characters in the BMP in Java source code.
@vurp0, That's completely wrong. Non-BMP is accepted too. The tree is rejected though, because it's unicode category isn't LETTER_NUMBER. See docs.oracle.com/javase/7/docs/api/java/lang/… and stackoverflow.com/a/65490/632951

N
Nayuki

Unicode escape sequences allow you to store and transmit your source code in pure ASCII and still use the entire range of Unicode characters. This has two advantages:

No risk of non-ASCII characters getting broken by tools that can't handle them. This was a real concern back in the early 1990s when Java was designed. Sending an email containing non-ASCII characters and having it arrive unmangled was the exception rather than the norm.

No need to tell the compiler and editor/IDE which encoding to use for interpreting the source code. This is still a very valid concern. Of course, a much better solution would have been to have the encoding as metadata in a file header (as in XML), but this hadn't yet emerged as a best practice back then.

The first variant makes sense to me - it allows programmers to name variables and methods in an international language of their choice. However, I don't see any practical application of the second approach.

Both will result in exactly the same byte code and have the same power as a language feature. The only difference is in the source code.

First, a bad programmer could use it to secretly comment out bits of code, or create multiple ways of identifying the same variable.

If you're concerned about a programmer deliberately sabotaging your code's readability, this language feature is the least of your problems.

Second, there seems to be a lack of support among IDEs.

That's hardly the fault of the feature or its designers. But then, I don't think it was ever intended to be used "manually". Ideally, the IDE would have an option to have you enter the characters normally and have them displayed normally, but automatically save them as Unicode escape sequences. There may even already be plugins or configuration options that makes the IDEs behave that way.

But in general, this feature seems to be very rarely used and probably therefore badly supported. But how could the people who designed Java around 1993 have known that?


No need to tell the compiler and editor/IDE which encoding to use for interpreting the source code: Are you sure about that? The String System.out.println(\\u03C0); encoded in US-ASCII and UTF-8 as 27 bytes, but UTF-16 for example will output 56 bytes. Most charset will return the same 27 bytes for this String, but not all. So I guess encoding of source files is still an issue.
@Michael Konietzka: He obviously meant that it would allow one to use pure ASCII files, which will not confuse any decent IDE, compiler, or editor a bit...
Being 7-bit safe is also nice for emails
Don't forget that usually the versioning control system also didn't support Unicode yet, leaving the IDE to chose which character set was supposed to be used. With plain ASCII + escapes any compatible choice is OK (but UTF16, of course, still isn't).
S
Steven Schlansker

The nice thing about the \u03C0 encoding is that it is much less likely to be munged by a text editor with the wrong encoding settings. For example a bug in my software was caused by the accidental transformation from UTF-8 é into a MacRoman é by a misconfigured text editor. By specifying the Unicode codepoint, it's completely unambiguous what you mean.


T
Thorbjørn Ravn Andersen

The \uXXXX syntax allows Unicode characters to be represented unambiguously in a file with an encoding not capable of expressing them directly, or if you want a representation guaranteed to be usable even in the lowest common denominator, namely an 7-bit ASCII encoding.

You could represent all your characters with \uXXXX, even spaces and letters, but there is rarely a need to.


A
AlexR

First, thank you for the question. I think it is very interesting. Second, the reason is that the java source file is a text that can use itself various charsets. For example the default charset in Eclipse is Cp1255. This endoding does not support characters like π. I think that they thought about programmers that have to work on systems that do not support unicode and wanted to allow these programmers to create unicode enabled software. This was the reason to support \u notation.


The default charset in Eclipse is the default charset of the platform. On your computer it might be CP1255, on mine it's UTF-8.
A
Andy Turner

The language spec says why this is permitted. There might be other unstated reasons, and unintended benefits and consequences; but this provides a direct answer to the question (emphasis mine):

A raw Unicode character stream is translated into a sequence of tokens, using the following three lexical translation steps, which are applied in turn: A translation of Unicode escapes (§3.3) in the raw stream of Unicode characters to the corresponding Unicode character. A Unicode escape of the form \uxxxx, where xxxx is a hexadecimal value, represents the UTF-16 code unit whose encoding is xxxx. This translation step allows any program to be expressed using only ASCII characters. ...