ChatGPT解决这个技术问题 Extra ChatGPT

What does character set and collation mean exactly?

I can read the MySQL documentation and it's pretty clear. But, how does one decide which character set to use? On what data does collation have an effect?

I'm asking for an explanation of the two and how to choose them.


r
revo

From MySQL docs:

A character set is a set of symbols and encodings. A collation is a set of rules for comparing characters in a character set. Let's make the distinction clear with an example of an imaginary character set. Suppose that we have an alphabet with four letters: 'A', 'B', 'a', 'b'. We give each letter a number: 'A' = 0, 'B' = 1, 'a' = 2, 'b' = 3. The letter 'A' is a symbol, the number 0 is the encoding for 'A', and the combination of all four letters and their encodings is a character set. Now, suppose that we want to compare two string values, 'A' and 'B'. The simplest way to do this is to look at the encodings: 0 for 'A' and 1 for 'B'. Because 0 is less than 1, we say 'A' is less than 'B'. Now, what we've just done is apply a collation to our character set. The collation is a set of rules (only one rule in this case): "compare the encodings." We call this simplest of all possible collations a binary collation. But what if we want to say that the lowercase and uppercase letters are equivalent? Then we would have at least two rules: (1) treat the lowercase letters 'a' and 'b' as equivalent to 'A' and 'B'; (2) then compare the encodings. We call this a case-insensitive collation. It's a little more complex than a binary collation. In real life, most character sets have many characters: not just 'A' and 'B' but whole alphabets, sometimes multiple alphabets or eastern writing systems with thousands of characters, along with many special symbols and punctuation marks. Also in real life, most collations have many rules: not just case insensitivity but also accent insensitivity (an "accent" is a mark attached to a character as in German 'ö') and multiple-character mappings (such as the rule that 'ö' = 'OE' in one of the two German collations).


Thanks, but you didn't answer how should we choose the best CHARSET and COLLATION for our database, table, etc? For CHARSET I think the answer is easy, we should choose that CHARSET that contains the characters that we are going to use in our database, table, etc. But what about the best COLLATION?
i
informatik01

A character encoding is a way to encode characters so that they fit in memory. That is, if the charset is ISO-8859-15, the euro symbol, €, will be encoded as 0xa4, and in UTF-8, it will be 0xe282ac.

The collation is how to compare characters, in latin9, there are letters as e é è ê f, if sorted by their binary representation, it will go e f é ê è but if the collation is set to, for example, French, you'll have them in the order you thought they would be, which is all of e é è ê are equal, and then f.


Important to note that there could be many different collations for a single charset. The one that is "right" depends on the semantics of the text which is normally determined by the language it's written in.
e
erickson

A character set is a subset of all written glyphs. A character encoding specifies how those characters are mapped to numeric values. Some character encodings, like UTF-8 and UTF-16, can encode any character in the Universal Character Set. Others, like US-ASCII or ISO-8859-1 can only encode a small subset, since they use 7 and 8 bits per character, respectively. Because many standards specify both a character set and a character encoding, the term "character set" is often substituted freely for "character encoding".

A collation comprises rules that specify how characters can be compared for sorting. Collations rules can be locale-specific: the proper order of two characters varies from language to language.

Choosing a character set and collation comes down to whether your application is internationalized or not. If not, what locale are you targeting?

In order to choose what character set you want to support, you have to consider your application. If you are storing user-supplied input, it might be hard to foresee all the locales in which your software will eventually be used. To support them all, it might be best to support the UCS (Unicode) from the start. However, there is a cost to this; many western European characters will now require two bytes of storage per character instead of one.

Choosing the right collation can help performance if your database uses the collation to create an index, and later uses that index to provide sorted results. However, since collation rules are often locale-specific, that index will be worthless if you need to sort results according to the rules of another locale.


sorry dude I am just beginner and just need your clarification. So, can I understand collation like this, it is what ensure that each character(being either latin or chinese) is properly recognized and given respective encodings. Is that correct? I hope for your reply
@Mirich No, collation is information about how to sort characters. Different regions of the world prefer to sort characters in different ways.
s
simhumileco

I suggest to use utf8mb4_unicode_ci, which is based on the Unicode standard for sorting and comparison, which sorts accurately in a very wide range of languages.


from the OP: "I'm asking for an explanation of the two and how to choose them"
@simhumileco, sorry dude I am just beginner and just need your clarification. So, can I understand collation like this, it is what ensure that each character(being either latin or chinese) is properly recognized and given respective encodings. Is that correct? I hope for your reply
@Mirich It all depends on what encoding you use in other places, if you use UTF-8 encoding in a system outside of the database, then everything in the database should also be properly written if you use utf8mb4 in MySQL. When it comes to the correct operation of sorting, comparing and transforming text for specific characters in MySQL, it is difficult to find the perfect solution, but *_unicode_ci is certainly better than *_general, but it also has its drawbacks. Please read: dev.mysql.com/doc/refman/8.0/en/charset-unicode-sets.html