ChatGPT解决这个技术问题 Extra ChatGPT

JSON character encoding - is UTF-8 well-supported by browsers or should I use numeric escape sequences?

I am writing a webservice that uses json to represent its resources, and I am a bit stuck thinking about the best way to encode the json. Reading the json rfc (http://www.ietf.org/rfc/rfc4627.txt) it is clear that the preferred encoding is utf-8. But the rfc also describes a string escaping mechanism for specifying characters. I assume this would generally be used to escape non-ascii characters, thereby making the resulting utf-8 valid ascii.

So let's say I have a json string that contains unicode characters (code-points) that are non-ascii. Should my webservice just utf-8 encoding that and return it, or should it escape all those non-ascii characters and return pure ascii?

I'd like browsers to be able to execute the results using jsonp or eval. Does that effect the decision? My knowledge of various browser's javascript support for utf-8 is lacking.

EDIT: I wanted to clarify that my main concern about how to encode the results is really about browser handling of the results. What I've read indicates that browsers may be sensitive to the encoding when using JSONP in particular. I haven't found any really good info on the subject, so I'll have to start doing some testing to see what happens. Ideally I'd like to only escape those few characters that are required and just utf-8 encode the results.


t
thomasrutter

The JSON spec requires UTF-8 support by decoders. As a result, all JSON decoders can handle UTF-8 just as well as they can handle the numeric escape sequences. This is also the case for Javascript interpreters, which means JSONP will handle the UTF-8 encoded JSON as well.

The ability for JSON encoders to use the numeric escape sequences instead just offers you more choice. One reason you may choose the numeric escape sequences would be if a transport mechanism in between your encoder and the intended decoder is not binary-safe.

Another reason you may want to use numeric escape sequences is to prevent certain characters appearing in the stream, such as <, & and ", which may be interpreted as HTML sequences if the JSON code is placed without escaping into HTML or a browser wrongly interprets it as HTML. This can be a defence against HTML injection or cross-site scripting (note: some characters MUST be escaped in JSON, including " and \).

Some frameworks, including PHP's json_encode() (by default), always do the numeric escape sequences on the encoder side for any character outside of ASCII. This is a mostly unnecessary extra step intended for maximum compatibility with limited transport mechanisms and the like. However, this should not be interpreted as an indication that any JSON decoders have a problem with UTF-8.

So, I guess you just could decide which to use like this:

Just use UTF-8, unless any software you are using for storage or transport between encoder and decoder isn't binary-safe.

Otherwise, use the numeric escape sequences.


"all JSON decoders can handle UTF-8" While this is true of browsers, just because the standard requires it doesn't mean all software decoding JSON supports UTF-8.
"All JSON decoders can handle UTF-8" is literally true. If something can't accept UTF-8, it's not a JSON decoder. It's may be similar to a JSON decoder, but it definitely isn't one.
The official proposed schema for JSON specifies a JSON string as “A string of Unicode code points”. This means a string of 32-bit values. In fact, UTF-8 isn't even mentioned in json-schema.org/draft/2019-09/json-schema-core.html .
@DavidSpector wrong document - you're looking at the proposal for the media type application/schema+json, that's not where JSON is defined. When referring to encoding it says encoding for the schema is identical to in JSON, and references the JSON spec at: tools.ietf.org/html/rfc8259 where it is defined that JSON MUST use UTF-8 any time it's used outside of a closed ecosystem.
Thank you for the correction! I panicked when I saw "a string of Unicode code points" because this is going backwards to fixed-length characters.
T
Tim Tisdall

I had a problem there. When I JSON encode a string with a character like "é", every browsers will return the same "é", except IE which will return "\u00e9".

Then with PHP json_decode(), it will fail if it find "é", so for Firefox, Opera, Safari and Chrome, I've to call utf8_encode() before json_decode().

Note : with my tests, IE and Firefox are using their native JSON object, others browsers are using json2.js.


Probably you meant utf8_encode(), php.net/manual/en/function.utf8-encode.php
If IE is failing to decode that, it's a bug in whatever JSON decoder you're using. All JSON decoders must successfully decode the encoded form, or they're not a JSON decoder. As for your issue with json_decode() with the é unescaped, it's possible that the text you're feeding it isn't UTF-8. JSON decoders always assume UTF-8, even the PHP implementation, even though PHP doesn't normally assume UTF-8 in many other functions. There are other character encodings which can include an é unescaped and look identical on screen, but which aren't UTF-8. Encoding in \uXXXX form is a workaround to this.
Just saying: JSON can legally come in any Unicode encoding (UTF-8, UTF-16 BE/LE, UTF32 BE/LE, with or without byte order marker). And since ASCII is a subset of UTF-8, it can also come in ASCII. Whether parsers accept UTF-32 for example, I don't know.
That is correct, and parsers aren't required to support anything other than UTF-8. From the spec: "JSON text SHALL be encoded in UTF-8, UTF-16, or UTF-32. The default encoding is UTF-8, and JSON texts that are encoded in UTF-8 are interoperable in the sense that they will be read successfully by the maximum number of implementations; there are many implementations that cannot successfully read texts in other encodings (such as UTF-16 and UTF-32). Implementations MUST NOT add a byte order mark to the beginning of a JSON text."
@thomasrutter The spec you quoted is old. The current spec says: "JSON text exchanged between systems that are not part of a closed ecosystem MUST be encoded using UTF-8. Previous specifications of JSON have not required the use of UTF-8 when transmitting JSON text. However, the vast majority of JSON-based software implementations have chosen to use the UTF-8 encoding, to the extent that it is the only encoding that achieves interoperability. Implementations MUST NOT add a byte order mark (U+FEFF) to the beginning of a networked-transmitted JSON text."
c
chaos

ASCII isn't in it any more. Using UTF-8 encoding means that you aren't using ASCII encoding. What you should use the escaping mechanism for is what the RFC says:

All Unicode characters may be placed within the quotation marks except for the characters that must be escaped: quotation mark, reverse solidus, and the control characters (U+0000 through U+001F)


If read that quote you provided you'll see that you are not required to escape all unicode characters, only a few special characters. But you are required to encode the results (preferably with utf-8). So the question is: "Why bother escaping normal unicode characters if you're utf-8 encoding".
Also, an ascii encoded string is a pure subset of utf-8. If I use json's escaping for all non-ascii characters, the result is ascii -- and therefore utf-8. Various json libraries (like python simplejson) have modes to force ascii results. I presume for a reason, like perhaps execution in browsers.
When you bother escaping normal unicode characters is in contexts where they're metacharacters, like strings. (The RFC chunk I quoted is about strings; sorry, wasn't clear about that.) You don't need to do ASCII output all the time; I'd think that's more for debugging with broken browsers.
T
Tobi Nary

I was facing the same problem. It works for me. Please check this.

json_encode($array,JSON_UNESCAPED_UNICODE);

It should be noted that the above is PHP, since the question is in no way PHP-specific and only talks about web service which also may not use PHP (as the older ones of our readers may still remember…)
C
Community

Reading the json rfc (http://www.ietf.org/rfc/rfc4627.txt) it is clear that the preferred encoding is utf-8.

FYI, RFC 4627 is no longer the official JSON spec. It was obsoleted in 2014 by RFC 7159, which was then obsoleted in 2017 by RFC 8259, which is the current spec.

RFC 8259 states:

8.1. Character Encoding JSON text exchanged between systems that are not part of a closed ecosystem MUST be encoded using UTF-8 [RFC3629]. Previous specifications of JSON have not required the use of UTF-8 when transmitting JSON text. However, the vast majority of JSON-based software implementations have chosen to use the UTF-8 encoding, to the extent that it is the only encoding that achieves interoperability. Implementations MUST NOT add a byte order mark (U+FEFF) to the beginning of a networked-transmitted JSON text. In the interests of interoperability, implementations that parse JSON texts MAY ignore the presence of a byte order mark rather than treating it as an error.


P
Paul Smith

I had a similar problem with é char... I think the comment "it's possible that the text you're feeding it isn't UTF-8" is probably close to the mark here. I have a feeling the default collation in my instance was something else until I realized and changed to utf8... problem is the data was already there, so not sure if it converted the data or not when i changed it, displays fine in mysql workbench. End result is that php will not json encode the data, just returns false. Doesn't matter what browser you use as its the server causing my issue, php will not parse the data to utf8 if this char is present. Like i say not sure if it is due to converting the schema to utf8 after data was present or just a php bug. In this case use json_encode(utf8_encode($string));