ChatGPT解决这个技术问题 Extra ChatGPT

What are Unicode, UTF-8, and UTF-16?

What's the basis for Unicode and why the need for UTF-8 or UTF-16? I have researched this on Google and searched here as well, but it's not clear to me.

In VSS, when doing a file comparison, sometimes there is a message saying the two files have differing UTF's. Why would this be the case?

Please explain in simple terms.

Sounds like you need to read The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets! It's a very good explanation of what's going on.
This FAQ from the official Unicode web site has some answers for you.
@John: it's a very nice introduction, but it's not the ultimate source: It skips quite a few of the details (which is fine for an overview/introduction!)
The article is great, but it has several mistakes and represents UTF-8 in somewhat conservative light. I suggest reading utf8everywhere.org as a supplement.
Take a look at this website: utf8everywhere.org

P
Peter Mortensen

Why do we need Unicode?

In the (not too) early days, all that existed was ASCII. This was okay, as all that would ever be needed were a few control characters, punctuation, numbers and letters like the ones in this sentence. Unfortunately, today's strange world of global intercommunication and social media was not foreseen, and it is not too unusual to see English, العربية, 汉语, עִבְרִית, ελληνικά, and ភាសាខ្មែរ in the same document (I hope I didn't break any old browsers).

But for argument's sake, let’s say Joe Average is a software developer. He insists that he will only ever need English, and as such only wants to use ASCII. This might be fine for Joe the user, but this is not fine for Joe the software developer. Approximately half the world uses non-Latin characters and using ASCII is arguably inconsiderate to these people, and on top of that, he is closing off his software to a large and growing economy.

Therefore, an encompassing character set including all languages is needed. Thus came Unicode. It assigns every character a unique number called a code point. One advantage of Unicode over other possible sets is that the first 256 code points are identical to ISO-8859-1, and hence also ASCII. In addition, the vast majority of commonly used characters are representable by only two bytes, in a region called the Basic Multilingual Plane (BMP). Now a character encoding is needed to access this character set, and as the question asks, I will concentrate on UTF-8 and UTF-16.

Memory considerations

So how many bytes give access to what characters in these encodings?

UTF-8:

1 byte: Standard ASCII

2 bytes: Arabic, Hebrew, most European scripts (most notably excluding Georgian)

3 bytes: BMP

4 bytes: All Unicode characters

UTF-16:

2 bytes: BMP

4 bytes: All Unicode characters

It's worth mentioning now that characters not in the BMP include ancient scripts, mathematical symbols, musical symbols, and rarer Chinese, Japanese, and Korean (CJK) characters.

If you'll be working mostly with ASCII characters, then UTF-8 is certainly more memory efficient. However, if you're working mostly with non-European scripts, using UTF-8 could be up to 1.5 times less memory efficient than UTF-16. When dealing with large amounts of text, such as large web-pages or lengthy word documents, this could impact performance.

Encoding basics

Note: If you know how UTF-8 and UTF-16 are encoded, skip to the next section for practical applications.

UTF-8: For the standard ASCII (0-127) characters, the UTF-8 codes are identical. This makes UTF-8 ideal if backwards compatibility is required with existing ASCII text. Other characters require anywhere from 2-4 bytes. This is done by reserving some bits in each of these bytes to indicate that it is part of a multi-byte character. In particular, the first bit of each byte is 1 to avoid clashing with the ASCII characters.

UTF-16: For valid BMP characters, the UTF-16 representation is simply its code point. However, for non-BMP characters UTF-16 introduces surrogate pairs. In this case a combination of two two-byte portions map to a non-BMP character. These two-byte portions come from the BMP numeric range, but are guaranteed by the Unicode standard to be invalid as BMP characters. In addition, since UTF-16 has two bytes as its basic unit, it is affected by endianness. To compensate, a reserved byte order mark can be placed at the beginning of a data stream which indicates endianness. Thus, if you are reading UTF-16 input, and no endianness is specified, you must check for this.

As can be seen, UTF-8 and UTF-16 are nowhere near compatible with each other. So if you're doing I/O, make sure you know which encoding you are using! For further details on these encodings, please see the UTF FAQ.

Practical programming considerations

Character and string data types: How are they encoded in the programming language? If they are raw bytes, the minute you try to output non-ASCII characters, you may run into a few problems. Also, even if the character type is based on a UTF, that doesn't mean the strings are proper UTF. They may allow byte sequences that are illegal. Generally, you'll have to use a library that supports UTF, such as ICU for C, C++ and Java. In any case, if you want to input/output something other than the default encoding, you will have to convert it first.

Recommended, default, and dominant encodings: When given a choice of which UTF to use, it is usually best to follow recommended standards for the environment you are working in. For example, UTF-8 is dominant on the web, and since HTML5, it has been the recommended encoding. Conversely, both .NET and Java environments are founded on a UTF-16 character type. Confusingly (and incorrectly), references are often made to the "Unicode encoding", which usually refers to the dominant UTF encoding in a given environment.

Library support: The libraries you are using support some kind of encoding. Which one? Do they support the corner cases? Since necessity is the mother of invention, UTF-8 libraries will generally support 4-byte characters properly, since 1, 2, and even 3 byte characters can occur frequently. However, not all purported UTF-16 libraries support surrogate pairs properly since they occur very rarely.

Counting characters: There exist combining characters in Unicode. For example, the code point U+006E (n), and U+0303 (a combining tilde) forms ñ, but the code point U+00F1 forms ñ. They should look identical, but a simple counting algorithm will return 2 for the first example, and 1 for the latter. This isn't necessarily wrong, but it may not be the desired outcome either.

Comparing for equality: A, А, and Α look the same, but they're Latin, Cyrillic, and Greek respectively. You also have cases like C and Ⅽ. One is a letter, and the other is a Roman numeral. In addition, we have the combining characters to consider as well. For more information, see Duplicate characters in Unicode.

Surrogate pairs: These come up often enough on Stack Overflow, so I'll just provide some example links:

Getting string length

Removing surrogate pairs

Palindrome checking


Excellent answer, great chances for the bounty ;-) Personally I'd add that some argue for UTF-8 as the universal character encoding, but I know that that's a opinion that's not necessarily shared by everyone.
Still too technical for me at this stage. How is the word hello stored in a computer in UTF-8 and UTF-16 ?
Could you expand more on why, for example, the BMP takes 3 bytes in UTF-8? I would have thought that since its maximum value is 0xFFFF (16 bits) then it would only take 2 bytes to access.
@mark Some bits are reserved for encoding purposes. For a code point that takes 2 bytes in UTF-8, there are 5 reserved bits, leaving only 11 bits to select a code point. U+07FF ends up being the highest code point representable in 2 bytes.
BTW - ASCII only defines 128 code points, using only 7 bits for representation. It is ISO-8859-1/ISO-8859-15 which define 256 code points and use 8 bits for representation. The first 128 code points in all these 3 are the same.
w
wengeezhang

Unicode is a set of characters used around the world

is a set of characters used around the world

UTF-8 a character encoding capable of encoding all possible characters (called code points) in Unicode. code unit is 8-bits use one to four code units to encode Unicode 00100100 for "$" (one 8-bits);11000010 10100010 for "¢" (two 8-bits);11100010 10000010 10101100 for "€" (three 8-bits)

a character encoding capable of encoding all possible characters (called code points) in Unicode.

code unit is 8-bits

use one to four code units to encode Unicode

00100100 for "$" (one 8-bits);11000010 10100010 for "¢" (two 8-bits);11100010 10000010 10101100 for "€" (three 8-bits)

UTF-16 another character encoding code unit is 16-bits use one to two code units to encode Unicode 00000000 00100100 for "$" (one 16-bits);11011000 01010010 11011111 01100010 for "𤭢" (two 16-bits)

another character encoding

code unit is 16-bits

use one to two code units to encode Unicode

00000000 00100100 for "$" (one 16-bits);11011000 01010010 11011111 01100010 for "𤭢" (two 16-bits)


The character before "two 16-bits" does not render (Firefox version 97.0 on Ubuntu MATE 20.04 (Focal Fossa)).
P
Peter Mortensen

Unicode is a fairly complex standard. Don’t be too afraid, but be prepared for some work! [2]

Because a credible resource is always needed, but the official report is massive, I suggest reading the following:

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) An introduction by Joel Spolsky, Stack Exchange CEO. To the BMP and beyond! A tutorial by Eric Muller, Technical Director then, Vice President later, at The Unicode Consortium (the first 20 slides and you are done)

A brief explanation:

Computers read bytes and people read characters, so we use encoding standards to map characters to bytes. ASCII was the first widely used standard, but covers only Latin (seven bits/character can represent 128 different characters). Unicode is a standard with the goal to cover all possible characters in the world (can hold up to 1,114,112 characters, meaning 21 bits/character maximum. Current Unicode 8.0 specifies 120,737 characters in total, and that's all).

The main difference is that an ASCII character can fit to a byte (eight bits), but most Unicode characters cannot. So encoding forms/schemes (like UTF-8 and UTF-16) are used, and the character model goes like this:

Every character holds an enumerated position from 0 to 1,114,111 (hex: 0-10FFFF) called a code point.
An encoding form maps a code point to a code unit sequence. A code unit is the way you want characters to be organized in memory, 8-bit units, 16-bit units and so on. UTF-8 uses one to four units of eight bits, and UTF-16 uses one or two units of 16 bits, to cover the entire Unicode of 21 bits maximum. Units use prefixes so that character boundaries can be spotted, and more units mean more prefixes that occupy bits. So, although UTF-8 uses one byte for the Latin script, it needs three bytes for later scripts inside a Basic Multilingual Plane, while UTF-16 uses two bytes for all these. And that's their main difference.
Lastly, an encoding scheme (like UTF-16BE or UTF-16LE) maps (serializes) a code unit sequence to a byte sequence.

character: π code point: U+03C0 encoding forms (code units): UTF-8: CF 80 UTF-16: 03C0 encoding schemes (bytes): UTF-8: CF 80 UTF-16BE: 03 C0 UTF-16LE: C0 03

Tip: a hexadecimal digit represents four bits, so a two-digit hex number represents a byte.
Also take a look at plane maps on Wikipedia to get a feeling of the character set layout.


Joel Spolsky is no longer the CEO.
P
Peter Mortensen

The article What every programmer absolutely, positively needs to know about encodings and character sets to work with text explains all the details.

Writing to buffer

if you write to a 4 byte buffer, symbol with UTF8 encoding, your binary will look like this:

00000000 11100011 10000001 10000010

if you write to a 4 byte buffer, symbol with UTF16 encoding, your binary will look like this:

00000000 00000000 00110000 01000010

As you can see, depending on what language you would use in your content this will effect your memory accordingly.

Example: For this particular symbol: UTF16 encoding is more efficient since we have 2 spare bytes to use for the next symbol. But it doesn't mean that you must use UTF16 for Japan alphabet.

Reading from buffer

Now if you want to read the above bytes, you have to know in what encoding it was written to and decode it back correctly.

e.g. If you decode this : 00000000 11100011 10000001 10000010 into UTF16 encoding, you will end up with not

Note: Encoding and Unicode are two different things. Unicode is the big (table) with each symbol mapped to a unique code point. e.g. symbol (letter) has a (code point): 30 42 (hex). Encoding on the other hand, is an algorithm that converts symbols to more appropriate way, when storing to hardware.

30 42 (hex) - > UTF8 encoding - > E3 81 82 (hex), which is above result in binary.

30 42 (hex) - > UTF16 encoding - > 30 42 (hex), which is above result in binary.

https://i.stack.imgur.com/6C0C6.png


Great answer, which I upvoted. Would you be so kind to check if this part of your answer is how you thought it should be (because it does not make sense): "converts symbols to more appropriate way".
The title of the reference, "What every programmer absolutely, positively needs to know about encodings and character sets to work with text", is close to be plagiarism of the Joel Spolsky's "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)".
P
Peter Mortensen

Originally, Unicode was intended to have a fixed-width 16-bit encoding (UCS-2). Early adopters of Unicode, like Java and Windows NT, built their libraries around 16-bit strings.

Later, the scope of Unicode was expanded to include historical characters, which would require more than the 65,536 code points a 16-bit encoding would support. To allow the additional characters to be represented on platforms that had used UCS-2, the UTF-16 encoding was introduced. It uses "surrogate pairs" to represent characters in the supplementary planes.

Meanwhile, a lot of older software and network protocols were using 8-bit strings. UTF-8 was made so these systems could support Unicode without having to use wide characters. It's backwards-compatible with 7-bit ASCII.


It's worth noting that Microsoft still refers to UTF-16 as Unicode, adding to the confusion. The two are not the same.
P
Peter Mortensen

Unicode is a standard which maps the characters in all languages to a particular numeric value called a code point. The reason it does this is that it allows different encodings to be possible using the same set of code points.

UTF-8 and UTF-16 are two such encodings. They take code points as input and encodes them using some well-defined formula to produce the encoded string.

Choosing a particular encoding depends upon your requirements. Different encodings have different memory requirements and depending upon the characters that you will be dealing with, you should choose the encoding which uses the least sequences of bytes to encode those characters.

For more in-depth details about Unicode, UTF-8 and UTF-16, you can check out this article,

What every programmer should know about Unicode


P
Peter Mortensen

Why Unicode? Because ASCII has just 127 characters. Those from 128 to 255 differ in different countries, and that's why there are code pages. So they said: let’s have up to 1114111 characters.

So how do you store the highest code point? You'll need to store it using 21 bits, so you'll use a DWORD having 32 bits with 11 bits wasted. So if you use a DWORD to store a Unicode character, it is the easiest way, because the value in your DWORD matches exactly the code point.

But DWORD arrays are of course larger than WORD arrays and of course even larger than BYTE arrays. That's why there is not only UTF-32, but also UTF-16. But UTF-16 means a WORD stream, and a WORD has 16 bits, so how can the highest code point 1114111 fit into a WORD? It cannot!

So they put everything higher than 65535 into a DWORD which they call a surrogate-pair. Such a surrogate-pair are two WORDS and can get detected by looking at the first 6 bits.

So what about UTF-8? It is a byte array or byte stream, but how can the highest code point 1114111 fit into a byte? It cannot! Okay, so they put in also a DWORD right? Or possibly a WORD, right? Almost right!

They invented utf-8 sequences which means that every code point higher than 127 must get encoded into a 2-byte, 3-byte or 4-byte sequence. Wow! But how can we detect such sequences? Well, everything up to 127 is ASCII and is a single byte. What starts with 110 is a two-byte sequence, what starts with 1110 is a three-byte sequence and what starts with 11110 is a four-byte sequence. The remaining bits of these so called "startbytes" belong to the code point.

Now depending on the sequence, following bytes must follow. A following byte starts with 10, and the remaining bits are 6 bits of payload bits and belong to the code point. Concatenate the payload bits of the startbyte and the following byte/s and you'll have the code point. That's all the magic of UTF-8.


utf-8 example of € (Euro) sign decoded in utf-8 3-byte sequence: E2=11100010 82=10000010 AC=10101100 As you can see, E2 starts with 1110 so this is a three-byte sequence As you can see, 82 as well as AC starts with 10 so these are following bytes Now we concatenate the "payload bits": 0010 + 000010 + 101100 = 10000010101100 which is decimal 8364 So 8364 must be the codepoint for the € (Euro) sign.
P
Peter Mortensen

ASCII - Software allocates only 8 bit byte in memory for a given character. It works well for English and adopted (loanwords like façade) characters as their corresponding decimal values falls below 128 in the decimal value. Example C program.

UTF-8 - Software allocates one to four variable 8-bit bytes for a given character. What is meant by a variable here? Let us say you are sending the character 'A' through your HTML pages in the browser (HTML is UTF-8), the corresponding decimal value of A is 65, when you convert it into decimal it becomes 01000010. This requires only one byte, and one byte memory is allocated even for special adopted English characters like 'ç' in the word façade. However, when you want to store European characters, it requires two bytes, so you need UTF-8. However, when you go for Asian characters, you require minimum of two bytes and maximum of four bytes. Similarly, emojis require three to four bytes. UTF-8 will solve all your needs.

UTF-16 will allocate minimum 2 bytes and maximum of 4 bytes per character, it will not allocate 1 or 3 bytes. Each character is either represented in 16 bit or 32 bit.

Then why does UTF-16 exist? Originally, Unicode was 16 bit not 8 bit. Java adopted the original version of UTF-16.

In a nutshell, you don't need UTF-16 anywhere unless it has been already been adopted by the language or platform you are working on.

Java program invoked by web browsers uses UTF-16, but the web browser sends characters using UTF-8.


"You don't need UTF-16 anywhere unless it has been already been adopted by the language or platform": This is a good point but here is a non-inclusive list: JavaScript, Java, .NET, SQL NCHAR, SQL NVARCHAR, VB4, VB5, VB6, VBA, VBScript, NTFS, Windows API….
Re "when you want to store European characters, it requires two bytes, so you need UTF-8": Unless code pages are used, e.g. CP-1252.
Re "the web browser sends characters using UTF-8": Unless something like ISO 8859-1 is specified on a web page(?). E.g. <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
P
Peter Mortensen

UTF stands for stands for Unicode Transformation Format. Basically, in today's world there are scripts written in hundreds of other languages, formats not covered by the basic ASCII used earlier. Hence, UTF came into existence.

UTF-8 has character encoding capabilities and its code unit is eight bits while that for UTF-16 it is 16 bits.