ChatGPT解决这个技术问题 Extra ChatGPT

What's the difference between Unicode and UTF-8? [duplicate]

This question already has answers here: What is the difference between UTF-8 and Unicode? (18 answers) Closed 5 years ago.

Consider:

https://i.stack.imgur.com/3ayWh.jpg

Is it true that unicode=utf16?

Many are saying Unicode is a standard, not an encoding, but most editors support save as Unicode encoding actually.

No,I can't ,cause most text editors are doing this way.
@olly: are you using Windows? Get Notepad++.
I'm using editplus ,it's always good,I don't want to switch.

D
DSlatkin

As Rasmus states in his article "The difference between UTF-8 and Unicode?":

If asked the question, "What is the difference between UTF-8 and Unicode?", would you confidently reply with a short and precise answer? In these days of internationalization all developers should be able to do that. I suspect many of us do not understand these concepts as well as we should. If you feel you belong to this group, you should read this ultra short introduction to character sets and encodings. Actually, comparing UTF-8 and Unicode is like comparing apples and oranges: UTF-8 is an encoding - Unicode is a character set A character set is a list of characters with unique numbers (these numbers are sometimes referred to as "code points"). For example, in the Unicode character set, the number for A is 41. An encoding on the other hand, is an algorithm that translates a list of numbers to binary so it can be stored on disk. For example UTF-8 would translate the number sequence 1, 2, 3, 4 like this: 00000001 00000010 00000011 00000100 Our data is now translated into binary and can now be saved to disk. All together now Say an application reads the following from the disk: 1101000 1100101 1101100 1101100 1101111 The app knows this data represent a Unicode string encoded with UTF-8 and must show this as text to the user. First step, is to convert the binary data to numbers. The app uses the UTF-8 algorithm to decode the data. In this case, the decoder returns this: 104 101 108 108 111 Since the app knows this is a Unicode string, it can assume each number represents a character. We use the Unicode character set to translate each number to a corresponding character. The resulting string is "hello". Conclusion So when somebody asks you "What is the difference between UTF-8 and Unicode?", you can now confidently answer short and precise: UTF-8 (Unicode Transformation Format) and Unicode cannot be compared. UTF-8 is an encoding used to translate numbers into binary data. Unicode is a character set used to translate characters into numbers.


This is totally correct, and answers the question posed in the title. It does not however answer the actual question, which is based on a misrepresentation of Microsoft using Unicode to refer to UTF-16.
I'm sorry but I'm not following AT ALL, "it uses an utf8 algorythm to decode the binary" what? binary is binary it just needs math to be converted back in decimal. if you tell me that decimal(41) is A in unicode then I don't need anything else to store it in binary and get it back.
UTF-8 encoding dynamically allocates bits depending on each character. whereas unicode uses 32 bits for each character. This answer examples are only using 7 bit ASCII characters, that is why is easy to understand and satisfies most readers, but UTF-8 is not that easy. Would be nice if you also put some multi-byte examples.
UTF-8 is not only an encoding, it's also a character set. Or more precisely, UTF-8 uses Unicode as its character set. What I mean is you can't use it as an encoding for another character set.
@sliders_alpha "binary is binary, it just needs math to be converted back to decimal" - Wrong, very wrong. We are not talking about numerical bases here, we are talking about encoding schemes. UTF-8 doesn't simply convert a decimal number into binary, it's more complex than that. The example given in this answer is poor because it uses the numbers 1, 2, 3, and 4 which just happen to encode as their binary representations but this is not true in general. Especially for UTF-8 which uses non-trivial bit offsets in the encoding. I suggest you read the Wiki article on the UTF-8 encoding algorithm
h
hannson

most editors support save as ‘Unicode’ encoding actually.

This is an unfortunate misnaming perpetrated by Windows.

Because Windows uses UTF-16LE encoding internally as the memory storage format for Unicode strings, it considers this to be the natural encoding of Unicode text. In the Windows world, there are ANSI strings (the system codepage on the current machine, subject to total unportability) and there are Unicode strings (stored internally as UTF-16LE).

This was all devised in the early days of Unicode, before we realised that UCS-2 wasn't enough, and before UTF-8 was invented. This is why Windows's support for UTF-8 is all-round poor.

This misguided naming scheme became part of the user interface. A text editor that uses Windows's encoding support to provide a range of encodings will automatically and inappropriately describe UTF-16LE as “Unicode”, and UTF-16BE, if provided, as “Unicode big-endian”.

(Other editors that do encodings themselves, like Notepad++, don't have this problem.)

If it makes you feel any better about it, ‘ANSI’ strings aren't based on any ANSI standard, either.


P
Preet Sangha

It's not that simple.

UTF-16 is a 16-bit, variable-width encoding. Simply calling something "Unicode" is ambiguous, since "Unicode" refers to an entire set of standards for character encoding. Unicode is not an encoding!

http://en.wikipedia.org/wiki/Unicode#Unicode_Transformation_Format_and_Universal_Character_Set

and of course, the obligatory Joel On Software - The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) link.


J
Jerry Coffin

There's a lot of misunderstanding being displayed here. Unicode isn't an encoding, but the Unicode standard is devoted primarily to encoding anyway.

ISO 10646 is the international character set you (probably) care about. It defines a mapping between a set of named characters (e.g., "Latin Capital Letter A" or "Greek small letter alpha") and a set of code points (a number assigned to each -- for example, 61 hexadecimal and 3B1 hexadecimal for those two respectively; for Unicode code points, the standard notation would be U+0061 and U+03B1).

At one time, Unicode defined its own character set, more or less as a competitor to ISO 10646. That was a 16-bit character set, but it was not UTF-16; it was known as UCS-2. It included a rather controversial technique to try to keep the number of necessary characters to a minimum (Han Unification -- basically treating Chinese, Japanese and Korean characters that were quite a bit alike as being the same character).

Since then, the Unicode consortium has tacitly admitted that that wasn't going to work, and now concentrate primarily on ways to encode the ISO 10646 character set. The primary methods are UTF-8, UTF-16 and UCS-4 (aka UTF-32). Those (except for UTF-8) also have LE (little endian) and BE (big-endian) variants.

By itself, "Unicode" could refer to almost any of the above (though we can probably eliminate the others that it shows explicitly, such as UTF-8). Unqualified use of "Unicode" probably happens the most often on Windows, where it will almost certainly refer to UTF-16. Early versions of Windows NT adopted Unicode when UCS-2 was current. After UCS-2 was declared obsolete (around Win2k, if memory serves), they switched to UTF-16, which is the most similar to UCS-2 (in fact, it's identical for characters in the "basic multilingual plane", which covers a lot, including all the characters for most Western European languages).


Ok, by why did MS perpetuate this into .NET? Wasn't .NET a post-Win2k invention?
M
Mark Ransom

UTF-16 and UTF-8 are both encodings of Unicode. They are both Unicode; one is not more Unicode than the other.

Don't let an unfortunate historical artifact from Microsoft confuse you.


T
Trufa

The development of Unicode was aimed at creating a new standard for mapping the characters in a great majority of languages that are being used today, along with other characters that are not that essential but might be necessary for creating the text. UTF-8 is only one of the many ways that you can encode the files because there are many ways you can encode the characters inside a file into Unicode.

Source:

http://www.differencebetween.net/technology/difference-between-unicode-and-utf-8/


P
Peter Mortensen

In addition to Trufa's comment, Unicode explicitly isn't UTF-16. When they were first looking into Unicode, it was speculated that a 16-bit integer might be enough to store any code, but in practice that turned out not to be the case. However, UTF-16 is another valid encoding of Unicode - alongside the 8-bit and 32-bit variants - and I believe is the encoding that Microsoft use in memory at runtime on the NT-derived operating systems.


So for visual studio,Unicode=UTF16 holds,right?
@ollydbg, it is true that UTF-16 is the natural representation of Unicode in Windows, but that does not make them identical.
P
Peter Mortensen

Let's start from keeping in mind that data is stored as bytes; Unicode is a character set where characters are mapped to code points (unique integers), and we need something to translate these code points data into bytes. That's where UTF-8 comes in so called encoding – simple!


P
Peter Mortensen

It's weird. Unicode is a standard, not an encoding. As it is possible to specify the endianness I guess it's effectively UTF-16 or maybe 32.

Where does this menu provide from?


From text editor called editplus.