ChatGPT解决这个技术问题 Extra ChatGPT

Advantages and disadvantages of GUID / UUID database keys

I've worked on a number of database systems in the past where moving entries between databases would have been made a lot easier if all the database keys had been GUID / UUID values. I've considered going down this path a few times, but there's always a bit of uncertainty, especially around performance and un-read-out-over-the-phone-able URLs.

Has anyone worked extensively with GUIDs in a database? What advantages would I get by going that way, and what are the likely pitfalls?

Jeff has a post about it "Primary Keys: IDs versus GUIDs".
can also use Hi-Lo for remote clients: stackoverflow.com/questions/282099/whats-the-hi-lo-algorithm
Updated location for Jeff Atwood's post about "Primary Keys: IDs versus GUIDs." Thanks to @jfs for the reference.

C
Cory Klein

Advantages:

Can generate them offline.

Makes replication trivial (as opposed to int's, which makes it REALLY hard)

ORM's usually like them

Unique across applications. So We can use the PK's from our CMS (guid) in our app (also guid) and know we are NEVER going to get a clash.

Disadvantages:

Larger space use, but space is cheap(er)

Can't order by ID to get the insert order.

Can look ugly in a URL, but really, WTF are you doing putting a REAL DB key in a URL!? (This point disputed in comments below)

Harder to do manual debugging, but not that hard.

Personally, I use them for most PK's in any system of a decent size, but I got "trained" on a system which was replicated all over the place, so we HAD to have them. YMMV.

I think the duplicate data thing is rubbish - you can get duplicate data however you do it. Surrogate keys are usually frowned upon where ever I've been working. We DO use the WordPress-like system though:

unique ID for the row (GUID/whatever). Never visible to the user.

public ID is generated ONCE from some field (e.g. the title - make it the-title-of-the-article)

UPDATE: So this one gets +1'ed a lot, and I thought I should point out a big downside of GUID PK's: Clustered Indexes.

If you have a lot of records, and a clustered index on a GUID, your insert performance will SUCK, as you get inserts in random places in the list of items (thats the point), not at the end (which is quick)

So if you need insert performance, maybe use a auto-inc INT, and generate a GUID if you want to share it with someone else (ie, show it to a user in a URL)


[WTF are you doing putting a REAL DB key in a URL!?] Not sure why that bothers you. What else would you use? Look at Stack Overflow... It has IDENTITY values in the URL all over the place, and it works just fine. Using DB keys in URLs doesn't prevent you from enforcing security.
No, it doesn't, but things like SEO are usually better if there isn't a key in it - especially something as long as a GUID. Of course, it can be worked around easily, so I gues that was a bit of an over sweeping statement
Good answer, it'd be nice if you also add information about performance disadvantages of using GUIDs; e.g. joining, sorting, and indexing by them will all be slower than using integers. Guids are fantastic, but they come at a cost which can be a pain when performance is critical.
Keep one thing in mind, people often change page, question, forum titles. For SEO it's GOOD to have something like a small ID in the URL so that if the title changes you still know where to forward people coming from an OLD URL. example.com/35/old-and-busted just became example.com/35/new-hotness and you're app can just check the title and forward the user on with a 301.
Indexing a GUID is expensive and slow, which makes them really poor candidates for primary keys.
A
Andrei Rînea

Why doesn't anyone mention performance? When you have multiple joins, all based on these nasty GUIDs the performance will go through the floor, been there :(


Can you elaborate on this as am in the situation where i need to introduce UUID (or similar) , but am concerned about using them as Primary Key.
UUIDs are only 4 times the size of integers... (if your database has a UUID type)
T
Troels Arvin

@Matt Sheppard:

Say you have a table of customers. Surely you don't want a customer to exist in the table more than once, or lots of confusion will happen throughout your sales and logistics departments (especially if the multiple rows about the customer contain different information).

So you have a customer identifier which uniquely identifies the customer and you make sure that the identifier is known by the customer (in invoices), so that the customer and the customer service people have a common reference in case they need to communicate. To guarantee no duplicated customer records, you add a uniqueness-constraint to the table, either through a primary key on the customer identifier or via a NOT NULL + UNIQUE constraint on the customer identifier column.

Next, for some reason (which I can't think of), you are asked to add a GUID column to the customer table and make that the primary key. If the customer identifier column is now left without a uniqueness-guarantee, you are asking for future trouble throughout the organization because the GUIDs will always be unique.

Some "architect" might tell you that "oh, but we handle the real customer uniqueness constraint in our app tier!". Right. Fashion regarding that general purpose programming languages and (especially) middle tier frameworks changes all the time, and will generally never out-live your database. And there is a very good chance that you will at some point need to access the database without going through the present application. == Trouble. (But fortunately, you and the "architect" are long gone, so you will not be there to clean up the mess.) In other words: Do maintain obvious constraints in the database (and in other tiers, as well, if you have the time).

In other words: There may be good reasons to add GUID columns to tables, but please don't fall for the temptation to make that lower your ambitions for consistency within the real (==non-GUID) information.


Hear hear! Love your SQL comparison page btw. Extremely useful. The only thing I miss is a changelog.
I think this answer needs some clarification: this assumes that UUIDs are never used as primary keys. I don't know where this assumption comes from, but I have yet to see a system that doesn't allow you to use them as such. I know it's an old answer, I suppose the advantages of using UUIDs in distributed systems weren't as widely understood back then (?).
M
Mendelt

The main advantages are that you can create unique id's without connecting to the database. And id's are globally unique so you can easilly combine data from different databases. These seem like small advantages but have saved me a lot of work in the past.

The main disadvantages are a bit more storage needed (not a problem on modern systems) and the id's are not really human readable. This can be a problem when debugging.

There are some performance problems like index fragmentation. But those are easilly solvable (comb guids by jimmy nillson: http://www.informit.com/articles/article.aspx?p=25862 )

Edit merged my two answers to this question

@Matt Sheppard I think he means that you can duplicate rows with different GUIDs as primary keys. This is an issue with any kind of surrogate key, not just GUIDs. And like he said it is easilly solved by adding meaningfull unique constraints to non-key columns. The alternative is to use a natural key and those have real problems..


T
Troels Arvin

GUIDs may cause you a lot of trouble in the future if they are used as "uniqifiers", letting duplicated data get into your tables. If you want to use GUIDs, please consider still maintaining UNIQUE-constraints on other column(s).


This is the heart of the problem: Introducing a GUID makes any row unique. But the non-artificial parts of the rows may suddenly contain duplicates (several versions of the truth).
+1 to compensate. I see what you mean, but it's badly expressed.
W
WIDBA

One other small issue to consider with using GUIDS as primary keys if you are also using that column as a clustered index (a relatively common practice). You are going to take a hit on insert because of the nature of a guid not begin sequential in anyway, thus their will be page splits, etc when you insert. Just something to consider if the system is going to have high IO...


c
commadelimited

primary-keys-ids-versus-guids

The Cost of GUIDs as Primary Keys (SQL Server 2000)

Myths, GUID vs. Autoincrement (MySQL 5)

This is realy what you want.

UUID Pros

Unique across every table, every database, every server

Allows easy merging of records from different databases

Allows easy distribution of databases across multiple servers

You can generate IDs anywhere, instead of having to roundtrip to the database

Most replication scenarios require GUID columns anyway

GUID Cons

It is a whopping 4 times larger than the traditional 4-byte index value; this can have serious performance and storage implications if you're not careful

Cumbersome to debug (where userid='{BAE7DF4-DDF-3RG-5TY3E3RF456AS10}')

The generated GUIDs should be partially sequential for best performance (eg, newsequentialid() on SQL 2005) and to enable use of clustered indexes


A
Antti Haapala -- Слава Україні

There is one thing that is not really addressed, namely using random (UUIDv4) IDs as primary keys will harm the performance of the primary key index. It will happen whether or not your table is clustered around the key.

RDBMs usually ensure the uniqueness of the primary keys, and ensure the lookups by a key, in a structure called BTree, which is a search tree with a large branching factor (a binary search tree has branching factor of 2). Now, a sequential integer ID would cause the inserts to occur just one side of the tree, leaving most of the leaf nodes untouched. Adding random UUIDs will cause the insertions to split leaf nodes all over the index.

Likewise if the data stored is mostly temporal, it is often the case that the most recent data needs to be accessed and joined against the most. With random UUIDs the patterns will not benefit from this, and will hit more index rows, thereby needing more of the index pages in memory. With sequential IDs if the most-recent data is needed the most, the hot index pages would require less RAM.


This appears to be a non-issue with Postgres. postgresql.org/message-id/…
@mackstann thanks for reminding about this, I remember a reading a page recently that actually had metrics to prove otherwise
@mackstann I can't find the proper link, but here's another recent one from depesz with benchmarks: depesz.com/2020/02/19/why-im-not-fan-of-uuid-datatype
M
Md. Samim hossain

Advantages:

UUID values are unique between tables and databases. Thats why it can be merge rows between two databases or distributed databases.

UUID is more safer to pass through url than integer type data. If one pass UUID through url, attackers can't guess the next id.But if we pass Integer type such as 10, then attackers can guess the next id is 11 then 12 etc.

UUID can generate offline.


J
J Scott

One thing not mentioned so far: UUIDs make it much harder to profile data

For web apps at least, it's common to access a resource with the id in the url, like stackoverflow.com/questions/45399. If the id is an integer, this both

provides information about the number of questions (ie September 5th, 2008, the 45,399th question was asked)

provides a leverage point to iterate through questions (what happens when I increment that by 1? I open the next asked question)

From the first point, I can combine the timestamp from the question and the number to profile how frequently questions are asked and how that changes over time. this matters less on a site like Stack Overflow, with publicly available information, but, depending on context, this may expose sensitive information.

For example, I am a company that offers customers a permissions gated portal. the address is portal.com/profile/{customerId}. If the id is an integer, you could profile the number of customers regardless of being able to see their information by querying for lastKnownCustomerCount + 1 regularly, and checking if the result is 404 - NotFound (customer does not exist) or 403 - Forbidden (customer does exist, but you do not have access to view).

UUIDs non-sequential nature mitigate these issues. This isn't a garunted to prevent profiling, but it's a start.