ChatGPT解决这个技术问题 Extra ChatGPT

MySQL VARCHAR Lengths and UTF-8

In MySQL, if I create a new VARCHAR(32) field in a UTF-8 table does it means I can store 32 bytes of data in that field or 32 chars (multi-byte)?

@naXa: I didn't. You think I should?
I don't know.) It's your question, and it's up to you. I just wanted to say "another answer looks more complete".
@robsch The previous accepted answer was simple and correct. But do to popular demand I've accepted the one you want.

i
informatik01

This answer showed up at the top of my google search results but wasn't correct.

The confusion is probably due to different versions of MySQL being tested.

Version 4 counts bytes

Version 5 counts characters

Here is the quote from the official MySQL 5 documentation:

MySQL interprets length specifications in character column definitions in character units. (Before MySQL 4.1, column lengths were interpreted in bytes.) This applies to CHAR, VARCHAR, and the TEXT types.

Interestingly (I hadn't thought about it) the max length of a varchar column is affected by utf8 as follows:

The effective maximum length of a VARCHAR in MySQL 5.0.3 and later is subject to the maximum row size (65,535 bytes, which is shared among all columns) and the character set used. For example, utf8 characters can require up to three bytes per character, so a VARCHAR column that uses the utf8 character set can be declared to be a maximum of 21,844 characters.


M Brown, thanks for mentioning this. A VARCHAR(10) field (using utf8mb4) can store "💩💩💩💩💩💩💩💩💩💩" (10 piles of poo), that's 10 characters but 40 bytes.
This. This is the only right answer. Far too many people believe the version 4 behavior as gospel.
The accepted answer is also correct for MySQL 5 -- the numbers inserted were actually part of the full-width char set and are multi-byte unicode characters, as also mentioned by the poster that he inserted "32 multibytes data". It's a shame so many people misunderstood.
Citing the following source, I believe a utf8 character currently requires up to 6 bytes so anywhere between 1 and 6 bytes. This causes the worst case for a character maximum to be 10922. I think. joelonsoftware.com/articles/Unicode.html
@usumoio Currently, it looks like MySQL uses the 3-byte variant of UTF-8, with migration to the (standard) 4-byte variant being planned: dev.mysql.com/doc/refman/8.0/en/charset-unicode-utf8.html.
j
jspcal

it would let you store 32 multi-byte chars

To save space with UTF-8, use VARCHAR instead of CHAR. Otherwise, MySQL must reserve three bytes for each character in a CHAR CHARACTER SET utf8 column because that is the maximum possible length. For example, MySQL must reserve 30 bytes for a CHAR(10) CHARACTER SET utf8 column.

http://dev.mysql.com/doc/refman/5.0/en/charset-unicode.html


I almost never use CHAR and when I do it's not intended to store multi-byte chars, so I'm safe. What about VARCHAR, are you sure the limit is defined in multi-byte chars and not on single-byte chars?
@jspcal: UTF-8 uses a maximum of 4 bytes per character, not 3. Or does MySQL not support all 4 bytes?
@RemyLebeau You are right about utf8, but not for MySQL. The various utf8_xxx character sets are 3-byte maximum. The utf8mb4_xxx take 4 byte characters. dev.mysql.com/doc/refman/5.5/en/charset-unicode-utf8mb4.html
With the progression of time, it looks like MySQL will finally use the standard 4-byte version (but not yet, at the time of writing): dev.mysql.com/doc/refman/8.0/en/charset-unicode-utf8.html.
A
Alix Axel

32 multibytes data for varchar(32) with collation utf8_unicode_ci, I just tested with XAMPP.

1234567890123456789012345678901234567890

Get truncated to:

12345678901234567890123456789012

Keep in mind that these are not regular ASCII chars.


In UTF-8 standard ASCII chars will only be stored in a single byte - to really test this you need to actually use some multibyte (ie. non-ascii) chars in your test sting.
This is wrong, at least for MySQL 5+. When specifying column size for varchar or char, it is specified in terms of characters. I believe the actual size of a VARCHAR(32) column would be 32x3+1=97 bytes.
@rjmackay '12345' are not standard ASCII chars. en.wikipedia.org/wiki/…
I inserted 40 unicode characters into DB, and got truncated at 32 character. But looks like people think I used ascii bytes and get truncated at 32 bytes. No wonder, I got downvotes, lol.
@ButtleButkus " I believe the actual size of a VARCHAR(32) column would be 32x3+1=97 bytes" It would if you use utf8, but then you get broken Unicode support in MySQL. You should use utf8mb4 encoding instead, because there are max. 4 bytes in a utf-8 char, not 3 as in MySQL's variant of utf8...
N
Nudge

It is better to use "char" for high-frequent update tables because the total data length of the row will be fixed and fast. Varchar columns make row data sizes dynamic. That's not good for MyISAM, but I don't know about InnoDB and others. For example, if you have a very narrow "type" column, it may be better to use char(2) with latin1 charset to only claim minimal space.


I've read that if ANY column in a table is varchar, then you lose all benefit of having char columns. Basically, it seems like you have to go with all varchar or all char in a table for maximum benefit. I don't know if it's true, though.
For MyISAM there is some argument for CHAR. For InnoDB, so many other things are going on that the "dynamic/fixed row size" debate is essentially irrelevant.
IMHO the important point here is that for very small lengths, it may be beneficial to use CHAR.
L
Laurent Lyaudet

If you connect to the database using latin1 encoding (for example with PHP) to save an PHP UTF8 string in an MySQL UTF8 column, you will have a double UTF8 encoding.

If the UTF8 string $s is 32 characters long but 64 bytes long and the column is VARCHAR(32) UTF8, the double encoding will convert the string $s to a 64 characters long UTF8 string that will be truncated in the database to its 32 first characters corresponding to the 32 first bytes of $s. You may end up thinking that MySQL 5 behaves like MySQL 4 but it is in fact a second cause for the same effect.