ChatGPT解决这个技术问题 Extra ChatGPT

How many characters can be mapped with Unicode?

I am asking for the count of all the possible valid combinations in Unicode with explanation. I know a char can be encoded as 1,2,3 or 4 bytes. I also don't understand why continuation bytes have restrictions even though starting byte of that char clears how long it should be.


B
Boris Verkhovskiy

I am asking for the count of all the possible valid combinations in Unicode with explanation.

1,111,998: 17 planes × 65,536 characters per plane - 2048 surrogates - 66 noncharacters

Note that UTF-8 and UTF-32 could theoretically encode much more than 17 planes, but the range is restricted based on the limitations of the UTF-16 encoding.

137,929 code points are actually assigned in Unicode 12.1.

I also don't understand why continuation bytes have restrictions even though starting byte of that char clears how long it should be.

The purpose of this restriction in UTF-8 is to make the encoding self-synchronizing.

For a counterexample, consider the Chinese GB 18030 encoding. There, the letter ß is represented as the byte sequence 81 30 89 38, which contains the encoding of the digits 0 and 8. So if you have a string-searching function not designed for this encoding-specific quirk, then a search for the digit 8 will find a false positive within the letter ß.

In UTF-8, this cannot happen, because the non-overlap between lead bytes and trail bytes guarantees that the encoding of a shorter character can never occur within the encoding of a longer character.


The "self-synchronizing" article you linked doesn't explain what's self-synchronizing at all
just as an interesting note, UTF8 only needs 4 bytes to map all Unicode characters, but UTF8 can support up to 68 billion characters if it is ever required, taking up to 7 bytes per character.
S
Simon Nickerson

Unicode allows for 17 planes, each of 65,536 possible characters (or 'code points'). This gives a total of 1,114,112 possible characters. At present, only about 10% of this space has been allocated.

The precise details of how these code points are encoded differ with the encoding, but your question makes it sound like you are thinking of UTF-8. The reason for restrictions on the continuation bytes are presumably so it is easy to find the beginning of the next character (as continuation characters are always of the form 10xxxxxx, but the starting byte can never be of this form).


According to these "planes" even the last three byte of a 4 byte char could express 64 of them. Am I wrong?
Yes, that is for synchronization, see cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt
That's outdated I think. It doesn't use 6 bytes anymore
@Andy: That makes sense: the original spec for UTF-8 worked for bigger numbers. The 21-bit limit was a sop to the folks who had locked themselves into 16-bit characters, and thus did UCS-2 beget the abomination known as UTF-16.
@Simon: There are 34 noncharacter code points, anything that when bitwise-addded with 0xFFFE == 0xFFFE, so two such code points per plane. Also, there are 31 noncharacter code points in the range 0x00_FDD0 .. 0x00_FDEF. Plus you should subtract from that the surrogates, which are not legal for open interchange due to the UTF-16 flaw, but must be supported inside your program.
R
Ray Toal

Unicode supports 1,114,112 code points. There are 2048 surrogate code point, giving 1,112,064 scalar values. Of these, there are 66 non-characters, leading to 1,111,998 possible encoded characters (unless I made a calculation error).


Can you look at my answer? Why is there 1,112,114 code points?
This number comes from the number of planes that is addressable using the UTF-16 surrogate system. You have 1024 low surrogates and 1024 high surrogates, giving 1024² non-BMP code points. This plus the 65,536 BMP code points gives exactly 1,114,112.
@Philipp, but you give '1_112_114' in your answer, but you explain '1_114_112' in your comment. Perhaps you mixed up the 2 and 4.
This answer has been sitting around with the calculation errors for years now, so I took the liberty to clean it up. Yes, the value 1112114 in the answer was a typo. The correct value is 1114112, which is the decimal value of 0x110000.
A
Andy Finkenstadt

To give a metaphorically accurate answer, all of them.

Continuation bytes in the UTF-8 encodings allow for resynchronization of the encoded octet stream in the face of "line noise". The encoder, merely need scan forward for a byte that does not have a value between 0x80 and 0xBF to know that the next byte is the start of a new character point.

In theory, the encodings used today allow for expression of characters whose Unicode character number is up to 31 bits in length. In practice, this encoding is actually implemented on services like Twitter, where the maximal length tweet can encode up to 4,340 bits' worth of data. (140 characters [valid and invalid], times 31 bits each.)


Actaully, in theory it is not limited to 31 bits, you can go bigger on a 64 bit machine. perl -le 'print ord "\x{1FFF_FFFF_FFFF}"' prints out 35184372088831 on a 64-bit machine, but gives integer overflow on a 32-bit machine. You can use bigger chars like that inside your perl program, but if you try to print them out as utf8, you get a mandatory warning unless you disable such: perl -le 'print "\x{1FFF_FFFF}"' Code point 0x1FFFFFFF is not Unicode, may not be portable at -e line 1. ######. There is a difference between "loose utf8" and "strict UTF-8": the former is not restricted.
The encodings used today don't allow for 31-bit scalar values. UTF-32 would allow for 32-bit values, UTF-8 for even more, but UTF-16 (used internally by Windows, OS X, Java, .NET, Python, and therefore the most popular encoding scheme) allows for just over one million (which should still be enough).
"All of them" isn't quite accurate; there are characters in legacy encodings that aren't in Unicode. For example, the Apple logo in MacRoman, and a couple of the graphics characters in ATASCII. OTOH, there's a Private Use Area, so these characters can be mapped with Unicode; they're just not part of the standard.
@tchrist: Python 3 does use UTF-16; for example, on my system I can say len(chr(0x10000)), giving 2 (code units). OS X's kernel uses UTF-8, correct—but the high-level APIs (Cocoa etc.) use UTF-16.
@Philip: I only use Python 2, whose Unicode support leaves a lot to be desired. I’m a systems guy, so I don’t do end-user chrome-platting: all the syscalls I use on OS X take UTF-8, which the kernel converts into NFC for you. My UTF-16 experiences in Java have been bad: try a regex bracketed charclass match with literal some non-BMP codepoints in their, like [𝒜-𝒵], and you’ll see why I find exposing UTF-16 to be a botch. It’s a mistake to make programmers think in encoding forms instead of in logical characters.
D
Dmitry Pleshkov

Unicode has the hexadecimal amount of 110000, which is 1114112


B
Boris Verkhovskiy

According to Wikipedia, Unicode 12.1 (released in May 2019) contains 137,994 distinct characters.


@Ufuk: Unicode doesn't have characters. It has code points. Sometimes it requires multiple code points to make up one character. For example, the character "5̃" is two code points, whereas the character "ñ" may be one or two code points (or more!). There are 2²¹ possible code points, but some of those are reserved as non-characters or partial characters.
Unicode is a character encoding standard. First answer from unicode.org/faq/basic_q.html: “Unicode is the universal character encoding,” so saying that “Unicode is not an encoding” is wrong. (I once made that mistake myself.)
@tchrist: The Unicode standard defines multiple terms, among them “abstract character” and “encoded character.” So saying that Unicode doesn’t have characters is also not true.