ChatGPT解决这个技术问题 Extra ChatGPT

How does UTF-8 "variable-width encoding" work?

The unicode standard has enough code-points in it that you need 4 bytes to store them all. That's what the UTF-32 encoding does. Yet the UTF-8 encoding somehow squeezes these into much smaller spaces by using something called "variable-width encoding".

In fact, it manages to represent the first 127 characters of US-ASCII in just one byte which looks exactly like real ASCII, so you can interpret lots of ascii text as if it were UTF-8 without doing anything to it. Neat trick. So how does it work?

I'm going to ask and answer my own question here because I just did a bit of reading to figure it out and I thought it might save somebody else some time. Plus maybe somebody can correct me if I've got some of it wrong.

Straight Unicode does not require 32 bits to encode all its code points. They once did lay claim to that many possible code points, but after UTF-8 took off, they intentionally limited themselves to 21 bits, so that UTF-8 will never exceed 4 bytes per character. Unicode currently requires only 17 bits to hold all possible code points. Without this limitation, UTF-8 could have gone to 6 bytes per character.
@Warren: mostly accurate, but Unicode is a 21-bit code (U+0000 to U+10FFFF).
@Warren: 4-byte-limited UTF-8 could have supported up to U+1FFFFF. The restriction to U+10FFFF was made for the sake of UTF-16.
@dan04 Do we have any easy explanation of how it is restricted to U+10FFFF by UTF-16? It would be nice to know more about this.
@A-letubby: Because the UTF-16 “surrogate” codes are allocated such that there are 1024 lead surrogates and 1024 trail surrogates (and they can only be used in pairs), to make 2^20 (about a million) additional characters available beyond the BMP. Added to the 2^16 characters available in the BMP, this makes 0x110000 possible characters.

A
André Chalella

Each byte starts with a few bits that tell you whether it's a single byte code-point, a multi-byte code point, or a continuation of a multi-byte code point. Like this:

0xxx xxxx    A single-byte US-ASCII code (from the first 127 characters)

The multi-byte code-points each start with a few bits that essentially say "hey, you need to also read the next byte (or two, or three) to figure out what I am." They are:

110x xxxx    One more byte follows
1110 xxxx    Two more bytes follow
1111 0xxx    Three more bytes follow

Finally, the bytes that follow those start codes all look like this:

10xx xxxx    A continuation of one of the multi-byte characters

Since you can tell what kind of byte you're looking at from the first few bits, then even if something gets mangled somewhere, you don't lose the whole sequence.


There is more to the story than that - because the encoding must be the shortest possible encoding for the character, which ends up meaning that bytes 0xC0 and 0xC1 cannot appear in UTF-8, for example; and, in fact, neither can 0xF5..0xFF. See the UTF-8 FAQ at unicode.org/faq/utf_bom.html, or unicode.org/versions/Unicode5.2.0/ch03.pdf
Why couldn't it use just one char to say next char is continuation? If we got 3 byte character then it would be like: 1xxxxxxx 1xxxxxxx 0xxxxxxx, so less space would be wasted.
@Soaku it makes UTF-8 a so-called "self-synchronizing" code. This means if due to errors parts of the sequence are missing, it is possible to detect that and discard whatever got garbled. If you read a byte that starts with 10xx, and there's no preceding "start" byte, you can discard it as it's meaningless. If you had a system like you described, and one of the first bytes is lost, you might end up with a different, valid character with no indication of any kind of error. It will also make it easy to locate the next valid character, as well as correct for missing "continuation" bytes.
C
Community

RFC3629 - UTF-8, a transformation format of ISO 10646 is the final authority here and has all the explanations.

In short, several bits in each byte of the UTF-8-encoded 1-to-4-byte sequence representing a single character are used to indicate whether it's a trailing byte, a leading byte, and if so, how many bytes follow. The remaining bits contain the payload.


Ummmm, silly me, I thought the Unicode Standard was the final authority on UTF-8
Unicode standard defines the Unicode itself. It doesn't define various methods, today's and future, that can be used to encode unicode texts for a variety of purposes (such as storage and transport). UTF-8 is one of those methods and the above reference is to the document that defines it.
RFC3629, page 3, section 3. says " UTF-8 is defined by the Unicode Standard".
Chasing links on unicode.org took me to section 3.9 of the Unicode Standard and specifically definition D92 (and also tangentially D86). I have no idea to what extent this link will be useful when new versions are released but I would imagine that they want to keep the section and definition identifiers stable across versions.
A
Andrew

UTF-8 was another system for storing your string of Unicode code points, those magic U+ numbers, in memory using 8 bit bytes. In UTF-8, every code point from 0-127 is stored in a single byte. Only code points 128 and above are stored using 2, 3, in fact, up to 6 bytes.

Excerpt from The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)


That's a good article, but it seems Joel is wrong regarding the maximum length of the sequence; the Wikipedia page shows 1..4 bytes per character, only.
As I said above, when UTF-8 was first created, Unicode lay claim to up to 32-bits for code points, not because they really needed it, only because 32-bits is a convenient value and they'd already blown past the previous limit of 16-bit characters. After UTF-8 proved popular, they chose to forever limit the maximum number of code points to 2^21, that being the largest value you can encode with 4 bytes of the UTF-8 scheme. There are still less than 2^17 characters in Unicode, so we can more than quadruple the number of characters in Unicode with this new scheme.
Ok but not the explanation asked by OP .
This is not answering the question.