ChatGPT解决这个技术问题 Extra ChatGPT

Is there a REAL performance difference between INT and VARCHAR primary keys?

Is there a measurable performance difference between using INT vs. VARCHAR as a primary key in MySQL? I'd like to use VARCHAR as the primary key for reference lists (think US States, Country Codes) and a coworker won't budge on the INT AUTO_INCREMENT as a primary key for all tables.

My argument, as detailed here, is that the performance difference between INT and VARCHAR is negligible, since every INT foreign key reference will require a JOIN to make sense of the reference, a VARCHAR key will directly present the information.

So, does anyone have experience with this particular use-case and the performance concerns associated with it?

I made a post with the answer "no" with some detail of tests I've run... but that was SQL Server, not MySQL. So I deleted my answer.
@Timothy - you shouldn't have deleted it. I was in the process of voting it up. Most SQL database servers have similar query planners and similar performance bottlenecks.
@Timothy please repost your results.
So many comments and answers assume that keys are there to be used for joins. They are not. Keys are there to be used for data consistency - to avoid duplicate rows, (more than one row representing the same entity). Any column(or set of columns) can be used in a join, and to guarantee that the join is a one-to-zero or many the column[s] simply needs to be unique. Any unique index guarantees that, and it need not be meaningful.

T
Tamlyn

You make a good point that you can avoid some number of joined queries by using what's called a natural key instead of a surrogate key. Only you can assess if the benefit of this is significant in your application.

That is, you can measure the queries in your application that are the most important to be speedy, because they work with large volumes of data or they are executed very frequently. If these queries benefit from eliminating a join, and do not suffer by using a varchar primary key, then do it.

Don't use either strategy for all tables in your database. It's likely that in some cases, a natural key is better, but in other cases a surrogate key is better.

Other folks make a good point that it's rare in practice for a natural key to never change or have duplicates, so surrogate keys are usually worthwhile.


And sometimes, (imho, often), both is better, the surrogate to use for FK references in other tables, and for Joins, and the natural key to ensure data consistency
@CharlesBretana That's interesting. Is using a natural key in for data consistency along side of the FK a common practice? My first thought was that the extra storage that would be required on large tables might not make it worthwhile. Any information is appreciated. FYI - I have a decent programming background but my SQL experience is limited mostly to SELECT queries
@CharlesBretana When I read "store them both", I think "redundancy" and "not normalized", which equals "This stuff could get screwed up" and "I have to make sure both are changed if one is ever changed". If you have redundancy, there should be a very good reason (like completely unacceptable performance) because redundancy is always potential for your data to become inconsistent.
@jpmc26, There are absolutely NO Issues of redundancy or normalization involved. A surrogate key has no meaningful connection to the values in a natural key, so it should never need to be changed. As to normalization, what normalization issues are you talking about? Normalization applies to meaningful attributes of a relation; the numeric value of a surrogate key, (indeed, the very concept of a surrogate key itself) lies completely outside the context of any normalization.
And to answer your other question, specifically about a table of states, if you had a surrogate key on this table, with values, say, frpom 1 to 50, but you did NOT put another unique index or key on the state postal code, (and, in my opinion, on the state name as well), then what's to stop someone from entering two rows with different surrogate key values but with the same postal code and/or state name? How would client app handle it if there were two rows with 'NJ', 'New Jersey'? Natural Keys ensure data consistency!
J
Jan Żankowski

I was a bit annoyed by the lack of benchmarks for this online, so I ran a test myself.

Note though that I don't do it on a regular basic, so please check my setup and steps for any factors that could have influenced the results unintentionally, and post your concerns in comments.

The setup was as follows:

Intel® Core™ i7-7500U CPU @ 2.70GHz × 4

15.6 GiB RAM, of which I ensured around 8 GB was free during the test.

148.6 GB SSD drive, with plenty of free space.

Ubuntu 16.04 64-bit

MySQL Ver 14.14 Distrib 5.7.20, for Linux (x86_64)

The tables:

create table jan_int (data1 varchar(255), data2 int(10), myindex tinyint(4)) ENGINE=InnoDB;
create table jan_int_index (data1 varchar(255), data2 int(10), myindex tinyint(4), INDEX (myindex)) ENGINE=InnoDB;
create table jan_char (data1 varchar(255), data2 int(10), myindex char(6)) ENGINE=InnoDB;
create table jan_char_index (data1 varchar(255), data2 int(10), myindex char(6), INDEX (myindex)) ENGINE=InnoDB;
create table jan_varchar (data1 varchar(255), data2 int(10), myindex varchar(63)) ENGINE=InnoDB;
create table jan_varchar_index (data1 varchar(255), data2 int(10), myindex varchar(63), INDEX (myindex)) ENGINE=InnoDB;

Then, I filled 10 million rows in each table with a PHP script whose essence is like this:

$pdo = get_pdo();

$keys = [ 'alabam', 'massac', 'newyor', 'newham', 'delawa', 'califo', 'nevada', 'texas_', 'florid', 'ohio__' ];

for ($k = 0; $k < 10; $k++) {
    for ($j = 0; $j < 1000; $j++) {
        $val = '';
        for ($i = 0; $i < 1000; $i++) {
            $val .= '("' . generate_random_string() . '", ' . rand (0, 10000) . ', "' . ($keys[rand(0, 9)]) . '"),';
        }
        $val = rtrim($val, ',');
        $pdo->query('INSERT INTO jan_char VALUES ' . $val);
    }
    echo "\n" . ($k + 1) . ' millon(s) rows inserted.';
}

For int tables, the bit ($keys[rand(0, 9)]) was replaced with just rand(0, 9), and for varchar tables, I used full US state names, without cutting or extending them to 6 characters. generate_random_string() generates a 10-character random string.

Then I ran in MySQL:

SET SESSION query_cache_type=0;

For jan_int table: SELECT count(*) FROM jan_int WHERE myindex = 5; SELECT BENCHMARK(1000000000, (SELECT count(*) FROM jan_int WHERE myindex = 5));

SELECT count(*) FROM jan_int WHERE myindex = 5;

SELECT BENCHMARK(1000000000, (SELECT count(*) FROM jan_int WHERE myindex = 5));

For other tables, same as above, with myindex = 'califo' for char tables and myindex = 'california' for varchar tables.

Times of the BENCHMARK query on each table:

jan_int: 21.30 sec

jan_int_index: 18.79 sec

jan_char: 21.70 sec

jan_char_index: 18.85 sec

jan_varchar: 21.76 sec

jan_varchar_index: 18.86 sec

Regarding table & index sizes, here's the output of show table status from janperformancetest; (w/ a few columns not shown):

|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Name              | Engine | Version | Row_format | Rows    | Avg_row_length | Data_length | Max_data_length | Index_length | Data_free | Auto_increment | Collation              |
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| jan_int           | InnoDB |      10 | Dynamic    | 9739094 |             43 |   422510592 |               0 |            0 |   4194304 |           NULL | utf8mb4_unicode_520_ci |  
| jan_int_index     | InnoDB |      10 | Dynamic    | 9740329 |             43 |   420413440 |               0 |    132857856 |   7340032 |           NULL | utf8mb4_unicode_520_ci |   
| jan_char          | InnoDB |      10 | Dynamic    | 9726613 |             51 |   500170752 |               0 |            0 |   5242880 |           NULL | utf8mb4_unicode_520_ci |  
| jan_char_index    | InnoDB |      10 | Dynamic    | 9719059 |             52 |   513802240 |               0 |    202342400 |   5242880 |           NULL | utf8mb4_unicode_520_ci |  
| jan_varchar       | InnoDB |      10 | Dynamic    | 9722049 |             53 |   521142272 |               0 |            0 |   7340032 |           NULL | utf8mb4_unicode_520_ci |   
| jan_varchar_index | InnoDB |      10 | Dynamic    | 9738381 |             49 |   486539264 |               0 |    202375168 |   7340032 |           NULL | utf8mb4_unicode_520_ci | 
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|

My conclusion is that there's no performance difference for this particular use case.


Amazing that in a 10-year old question, this is only one of two answers that isn't just speculation and relies on actual benchmarks.
But your tables do not have a primary key, which actually in InnoDB is a sorted data structure. The speed between integer sorting and string sorting should be different.
@Melkor Fair point that I use INDEX instead of PRIMARY KEY. I don't remember my reasoning - I probably assumed PRIMARY KEY is just an INDEX with uniqueness constraint. However, reading the section on how things are stored in InnoDB in federico-razzoli.com/primary-key-in-innodb, I think that my results still apply to primary keys, and answer the question on value lookup performance difference. Also, your comment suggests looking at performance of sorting algorithms, which do not apply to the use case I investigate, which is looking up values in a set.
Lookup operation also requires comparisons on the primary key field (like a binary search), where int should be a bit faster than varchar. But as your experiments suggested, it's not that obvious (or maybe because you didn't have a primary key so the queries were all slower). I think it's the same thing on inserting and lookup.
@Melkor I believe there is no difference because the string is hashed into an integer. It makes lookup and storing/sorting in a B-Tree/Index much faster. I can't find any evidence of this though in MySQL's documentation.
S
Steve McLeod

It's not about performance. It's about what makes a good primary key. Unique and unchanging over time. You may think an entity such as a country code never changes over time and would be a good candidate for a primary key. But bitter experience is that is seldom so.

INT AUTO_INCREMENT meets the "unique and unchanging over time" condition. Hence the preference.


True. One of my biggest databases has entries for Yugoslavia and the Soviet Union. I'm glad they're not primary keys.
@Steve, then why does ANSI SQL support syntax for ON UPDATE CASCADE?
Immutability is not a requirement of a key. In any case, surrogate keys sometimes change too. Nothing wrong with changing keys if you need to.
Paul, so you changed Soviet Union to Russia in your database? And pretend that SU never exists? And all references to SU now points to Russia?
@alga I was born in SU so I know what it is.
C
Charles Bretana

Depends on the length.. If the varchar will be 20 characters, and the int is 4, then if you use an int, your index will have FIVE times as many nodes per page of index space on disk... That means that traversing the index will require one fifth as many physical and/or logical reads..

So, if performance is an issue, given the opportunity, always use an integral non-meaningful key (called a surrogate) for your tables, and for Foreign Keys that reference the rows in these tables...

At the same time, to guarantee data consistency, every table where it matters should also have a meaningful non-numeric alternate key, (or unique Index) to ensure that duplicate rows cannot be inserted (duplicate based on meaningful table attributes) .

For the specific use you are talking about (like state lookups ) it really doesn't matter because the size of the table is so small.. In general there is no impact on performance from indices on tables with less than a few thousand rows...


Sure? Arent the most data formats row based? There is other data aside the keys. Isnt factor 5 utopic?
@manuelSchneid3r, What ? utopic? No, the factor 5 is not "utopic". It is just 20 divided by 4. And what does "data format row based" mean? Indices are not "row based", they are balanced tree structures.
@CharlesBretana indexes reference single rows in a table. I don't understand your second point about an additional unique key. A primary key is - by definition - a unique key used to identify a single row in a table.
@Sebi, I hope you don't think that The word Unique does implies that it is the ONLY key. it just means that only one row in the table can have that value. You can have multiple Unique keys... for example, one unique key for an transaction might be the date, time, product, RegisterId, and dollar amount, possibly 100 bytes or more. Another might simply be the register receipt or invoice, say 4-8 bytes. Which will produce better performance when used in a query or join or filter?
T
Timothy Khouri

Absolutely not.

I have done several... several... performance checks between INT, VARCHAR, and CHAR.

10 million record table with a PRIMARY KEY (unique and clustered) had the exact same speed and performance (and subtree cost) no matter which of the three I used.

That being said... use whatever is best for your application. Don't worry about the performance.


meaningless without knowing how long the varchars were... If they were 100 bytes widem then guaranteed you're not getting the same performance as a 4 byte int
It would also help to know what database you are using and what version of the database. Performance tuning is almost always worked on and improved from version to version.
VARCHAR definitely matters for index size. And index determines how much can be fit in memory. And indexes in memory are far, far faster than those that aren't. It could be that for your 10m rows, you had 250MB of memory available for that index, and was fine. But if you have 100m rows, you will be less fine in that memory.
J
Joel Coehoorn

For short codes, there's probably no difference. This is especially true as the table holding these codes are likely to be very small (a couple thousand rows at most) and not change often (when is the last time we added a new US State).

For larger tables with a wider variation among the key, this can be dangerous. Think about using e-mail address/user name from a User table, for example. What happens when you have a few million users and some of those users have long names or e-mail addresses. Now any time you need to join this table using that key it becomes much more expensive.


Do you know for sure that this would be expensive? Or are you just guessing?
Of course it depends on the rdbms implementation, but from what I understand most servers will keep of hash of the actual value for indexing purposes. Even so, and even if it's a relatively short hash (say, 10 byte), it's still more work to compare 2 10 byte hashes than 2 4 byte ints.
NEVER use a long (wide) key for joins... But if it is the best representation of what is unique for the rows in the table, then there better be a unique key (or index - which is the same thing) on the table using those natural values. Keys are not there for joins, you can join on anything your heart desires. Keys are there to ensure data consistency.
L
LeppyR64

As for Primary Key, whatever physically makes a row unique should be determined as the primary key.

For a reference as a foreign key, using an auto incrementing integer as a surrogate is a nice idea for two main reasons. - First, there's less overhead incurred in the join usually. - Second, if you need to update the table that contains the unique varchar then the update has to cascade down to all the child tables and update all of them as well as the indexes, whereas with the int surrogate, it only has to update the master table and it's indexes.

The drawaback to using the surrogate is that you could possibly allow changing of the meaning of the surrogate:

ex.
id value
1 A
2 B
3 C

Update 3 to D
id value
1 A
2 B
3 D

Update 2 to C
id value
1 A
2 C
3 D

Update 3 to B
id value
1 A
2 C
3 B

It all depends on what you really need to worry about in your structure and what means most.


R
Rick James

Common cases where a surrogate AUTO_INCREMENT hurts:

A common schema pattern is a many-to-many mapping:

CREATE TABLE map (
    id ... AUTO_INCREMENT,
    foo_id ...,
    bar_id ...,
    PRIMARY KEY(id),
    UNIQUE(foo_id, bar_id),
    INDEX(bar_id) );

Performance of this pattern is much better, especially when using InnoDB:

CREATE TABLE map (
    # No surrogate
    foo_id ...,
    bar_id ...,
    PRIMARY KEY(foo_id, bar_id),
    INDEX      (bar_id, foo_id) );

Why?

InnoDB secondary keys need an extra lookup; by moving the pair into the PK, that is avoided for one direction.

The secondary index is "covering", so it does not need the extra lookup.

This table is smaller because of getting rid of id and one index.

Another case (country):

country_id INT ...
-- versus
country_code CHAR(2) CHARACTER SET ascii

All too often the novice normalizes country_code into a 4-byte INT instead of using a 'natural' 2-byte, nearly-unchanging 2-byte string. Faster, smaller, fewer JOINs, more readable.


H
Herman J. Radtke III

At HauteLook, we changed many of our tables to use natural keys. We did experience a real-world increase in performance. As you mention, many of our queries now use less joins which makes the queries more performant. We will even use a composite primary key if it makes sense. That being said, some tables are just easier to work with if they have a surrogate key.

Also, if you are letting people write interfaces to your database, a surrogate key can be helpful. The 3rd party can rely on the fact that the surrogate key will change only in very rare circumstances.


r
rodrigo-silveira

I faced the same dilemma. I made a DW (Constellation schema) with 3 fact tables, Road Accidents, Vehicles in Accidents and Casualties in Accidents. Data includes all accidents recorded in UK from 1979 to 2012, and 60 dimension tables. All together, about 20 million records.

Fact tables relationships:

+----------+          +---------+
| Accident |>--------<| Vehicle |
+-----v----+ 1      * +----v----+
     1|                    |1
      |    +----------+    |
      +---<| Casualty |>---+
         * +----------+ *

RDMS: MySQL 5.6

Natively the Accident index is a varchar(numbers and letters), with 15 digits. I tried not to have surrogate keys, once the accident indexes would never change. In a i7(8 cores) computer, the DW became too slow to query after 12 million records of load depending of the dimensions. After a lot of re-work and adding bigint surrogate keys I got a average 20% speed performance boost. Yet to low performance gain, but valid try. Im working in MySQL tuning and clustering.


Sounds like you need to look in to partitioning.
V
Volksman

The question is about MySQL so I say there is a significant difference. If it was about Oracle (which stores numbers as string - yes, I couldn't believe it at first) then not much difference.

Storage in the table is not the issue but updating and referring to the index is. Queries involving looking up a record based on its primary key are frequent - you want them to occur as fast as possible because they happen so often.

The thing is a CPU deals with 4 byte and 8 byte integers naturally, in silicon. It's REALLY fast for it to compare two integers - it happens in one or two clock cycles.

Now look at a string - it's made up of lots of characters (more than one byte per character these days). Comparing two strings for precedence can't be done in one or two cycles. Instead the strings' characters must be iterated until a difference is found. I'm sure there are tricks to make it faster in some databases but that's irrelevant here because an int comparison is done naturally and lightning fast in silicon by the CPU.

My general rule - every primary key should be an autoincrementing INT especially in OO apps using an ORM (Hibernate, Datanucleus, whatever) where there's lots of relationships between objects - they'll usually always be implemented as a simple FK and the ability for the DB to resolve those fast is important to your app' s responsiveness.


S
Shadi Alnamrouti

Allow me to say yes there is definitely a difference, taking into consideration the scope of performance (Out of the box definition):

1- Using surrogate int is faster in application because you do not need to use ToUpper(), ToLower(), ToUpperInvarient(), or ToLowerInvarient() in your code or in your query and these 4 functions have different performance benchmarks. See Microsoft performance rules on this. (performance of application)

2- Using surrogate int guarantees not changing the key over time. Even country codes may change, see Wikipedia how ISO codes changed over time. That would take lots of time to change the primary key for subtrees. (performance of data maintenance)

3- It seems there are issues with ORM solutions, such as NHibernate when PK/FK is not int. (developer performance)


D
Dexygen

Not sure about the performance implications, but it seems a possible compromise, at least during development, would be to include both the auto-incremented, integer "surrogate" key, as well as your intended, unique, "natural" key. This would give you the opportunity to evaluate performance, as well as other possible issues, including the changeability of natural keys.


V
Vinod

As usual, there are no blanket answers. 'It depends!' and I am not being facetious. My understanding of the original question was for keys on small tables - like Country (integer id or char/varchar code) being a foreign key to a potentially huge table like address/contact table.

There are two scenarios here when you want data back from the DB. First is a list/search kind of query where you want to list all the contacts with state and country codes or names (ids will not help and hence will need a lookup). The other is a get scenario on primary key which shows a single contact record where the name of the state, country needs to be shown.

For the latter get, it probably does not matter what the FK is based on since we are bringing together tables for a single record or a few records and on key reads. The former (search or list) scenario may be impacted by our choice. Since it is required to show country (at least a recognizable code and perhaps even the search itself includes a country code), not having to join another table through a surrogate key can potentially (I am just being cautious here because I have not actually tested this, but seems highly probable) improve performance; notwithstanding the fact that it certainly helps with the search.

As codes are small in size - not more than 3 chars usually for country and state, it may be okay to use the natural keys as foreign keys in this scenario.

The other scenario where keys are dependent on longer varchar values and perhaps on larger tables; the surrogate key probably has the advantage.