ChatGPT解决这个技术问题 Extra ChatGPT

When are you truly forced to use UUID as part of the design?

I don't really see the point of UUID. I know the probability of a collision is effectively nil, but effectively nil is not even close to impossible.

Can somebody give an example where you have no choice but to use UUID? From all the uses I've seen, I can see an alternative design without UUID. Sure the design might be slightly more complicated, but at least it doesn't have a non-zero probability of failure.

UUID smells like global variables to me. There are many ways global variables make for simpler design, but its just lazy design.

Everything has a non-zero chance of failure. I would concentrate on far more likely to occur problems (i.e. almost anything you can think of) than the collision of UUIDs
Actually, "effectively nil" is very close to impossible.
Nope, its actually infinitely far from impossible
@Pyrolistical when you start throwing around words like "infinity", you've left the world of software development. Computer science theory is an entirely different discussion than writing real software.
I'll close mostly because git's sha1 has convinced me of the goodness of a hash

B
Bob Aman

I wrote the UUID generator/parser for Ruby, so I consider myself to be reasonably well-informed on the subject. There are four major UUID versions:

Version 4 UUIDs are essentially just 16 bytes of randomness pulled from a cryptographically secure random number generator, with some bit-twiddling to identify the UUID version and variant. These are extremely unlikely to collide, but it could happen if a PRNG is used or if you just happen to have really, really, really, really, really bad luck.

Version 5 and Version 3 UUIDs use the SHA1 and MD5 hash functions respectively, to combine a namespace with a piece of already unique data to generate a UUID. This will, for example, allow you to produce a UUID from a URL. Collisions here are only possible if the underlying hash function also has a collision.

Version 1 UUIDs are the most common. They use the network card's MAC address (which unless spoofed, should be unique), plus a timestamp, plus the usual bit-twiddling to generate the UUID. In the case of a machine that doesn't have a MAC address, the 6 node bytes are generated with a cryptographically secure random number generator. If two UUIDs are generated in sequence fast enough that the timestamp matches the previous UUID, the timestamp is incremented by 1. Collisions should not occur unless one of the following happens: The MAC address is spoofed; One machine running two different UUID generating applications produces UUIDs at the exact same moment; Two machines without a network card or without user level access to the MAC address are given the same random node sequence, and generate UUIDs at the exact same moment; We run out of bytes to represent the timestamp and rollover back to zero.

Realistically, none of these events occur by accident within a single application's ID space. Unless you're accepting IDs on, say, an Internet-wide scale, or with an untrusted environment where malicious individuals might be able to do something bad in the case of an ID collision, it's just not something you should worry about. It's critical to understand that if you happen to generate the same version 4 UUID as I do, in most cases, it doesn't matter. I've generated the ID in a completely different ID space from yours. My application will never know about the collision so the collision doesn't matter. Frankly, in a single application space without malicious actors, the extinction of all life on earth will occur long before you have a collision, even on a version 4 UUID, even if you're generating quite a few UUIDs per second.

Also, 2^64 * 16 is 256 exabytes. As in, you would need to store 256 exabytes worth of IDs before you had a 50% chance of an ID collision in a single application space.


@Chamnap I wrote UUIDTools. UUIDs can be converted to an integer or their raw byte form, and would be substantially smaller as a binary.
@eric.frederich Do let me know if that happens.
@BobAman in 1990 I had 12 UUID collisions on an Aegis system, turned out to be a faulty FPU, but thought I would let you know it can happen (hasn't happened other than that in the last 30+ years of programming though). Nice explanation, too btw, this is now my defacto UUID refence post to give people:)
@GMasucci Good point. All bets are off if the hardware is bad or if someone decides /dev/random should just return 4.
@kqr You're absolutely right that it's the birthday problem, however for an n-bit code, the birthday paradox problem reduces down to 2^(n/2), which in this case is 2^64, as stated in my answer.
M
Michael Burr

The thing that UUIDs buy you that is very difficult to do otherwise is to get a unique identifier without having to consult or coordinate with a central authority. The general problem of being able to get such a thing without some sort of managed infrastructure is the problem the UUIDs solve.

I've read that according to the birthday paradox the chance of a UUID collision occuring is 50% once 2^64 UUIDs have been generated. Now 2^64 is a pretty big number, but a 50% chance of collision seems far too risky (for example, how many UUIDs need to exist before there's a 5% chance of collision - even that seems like too large of a probability).

The problem with that analysis is twofold:

UUIDs are not entirely random - there are major components of the UUID that are time and/or location-based. So to have any real chance at a collision, the colliding UUIDs need tobe generated at the exact same time from different UUID generators. I'd say that while there is a reasonable chance that several UUID's might be generated at the same time, there's enough other gunk (including location info or random bits) to make the likeyhood of a collision between this very small set of UUIDs nearly impossible. strictly speaking, UUIDs only need to be unique among the set of other UUIDs that they might be compared against. If you're generating a UUID to use as a database key, it doesn't matter if somewhere else in an evil alternate universe that the same UUID is being used to identify a COM interface. Just like it'll cause no confusion if there's someone (or something) else named "Michael Burr" on Alpha-Centauri.


Concrete example? COM/DCE UUIDs - there's no authority for assigning them, and no one wanted to take the responsibility and/or no one wanted there to be an authority. Distributed databases that do not have reliable links and no master.
More concrete example - a banking application. It is installed multiple data centres, one for each country, with each data centre having a DB. The multiple installations are there for obeying different regulations. There can only be one customer record in the entire set for every customer.....
(Continuation of previous comment) You need to have a central server to generate the customer ID for overall reporting and tracking purposes (across all installations) or have the individual installations generate UUIDs to serve as customer IDs (obviously the UUIDs cannot be used as in in reports).
By the time you've got a 50% chance of duplication, you're already drowning. Somebody point out the volume required to get to 0.0000001% chance. Multiple auto-increment databases starting at 1 to n and increasing by n each time solves the same problem effectively.
The odds of getting a duplicate are FAR, FAR lower than the odds of the central authority failing in some mission-critical way
D
DanSingerman

Everything has a non-zero chance of failure. I would concentrate on far more likely to occur problems (i.e. almost anything you can think of) than the collision of UUIDs


Added as an answer at Pyrolistical's request
R
Rex M

An emphasis on "reasonably" or, as you put it, "effectively": good enough is how the real world works. The amount of computational work involved in covering that gap between "practically unique" and "truly unique" is enormous. Uniqueness is a curve with diminishing returns. At some point on that curve, there is a line between where "unique enough" is still affordable, and then we curve VERY steeply. The cost of adding more uniqueness becomes quite large. Infinite uniqueness has infinite cost.

UUID/GUID is, relatively speaking, a computationally quick and easy way to generate an ID which can be reasonably assumed to be universally unique. This is very important in many systems which need to integrate data from previously unconnected systems. For example: if you have a Content Management System which runs on two different platforms, but at some point need to import the content from one system into the other. You don't want IDs to change, so your references between data from system A remain intact, but you don't want any collisions with data created in system B. A UUID solves this.


Solution. Don't be lazy and update the references. Do it right.
This has nothing to do with lazyness - if the policy is that an ID for an item is considered permanent and immutable, then the ID doesn't change. So you want the ID's to be unique from the start, and you want to do that with out requiring all the systems to be connected in some way from the start.
You need context then. If you have two groups of unique ids that may conflict, you need a high level of context to separate them
Or, you could just build the system to use UUIDs and ship it, sell it, make a million dollars and never hear a single complaint that two IDs collided because it won't happen.
R
Rob W

It is never absolutely necessary to create a UUID. It is, however convenient to have a standard where offline users can each generate a key to something with a very low probability of collision.

This can aid in database replication resolution etc...

It would be easy for online users to generate unique keys for something without the overhead or possibility of collision, but that is not what UUIDs are for.

Anyways, a word on the probability of collision, taken from Wikipedia:

To put these numbers into perspective, one's annual risk of being hit by a meteorite is estimated to be one chance in 17 billion, equivalent to the odds of creating a few tens of trillions of UUIDs in a year and having one duplicate. In other words, only after generating 1 billion UUIDs every second for the next 100 years, the probability of creating just one duplicate would be about 50%.


Simple, don't let offline users generate keys. Have the temporary keys assigned until the system goes online so the real keys can be generated.
This is a very helpful answer in my opinion... was going to offer some sort of analogy to the probability myself, as it seemed the OP didn't quite grasp it's meaning, but you seem to have done that.
I quiet understand the probability is effectively nil. To me the use of UUID is lazy design, and I just wanted to see if you could always avoid it
That's fair enough, as long as you see that the low probability need to even be considered in the most extreme circumstances, as I'll now presume you do.
u
user21714

There is also a non-zero probability that every particle in your body will simultaneously tunnel through the chair you're sitting on and you will suddenly find yourself sitting on the floor.

Do you worry about that?


Of course not, that's not something I can control, but designs I can.
@Pyrolistical Is that really, I mean REALLY the reason you don't worry about that? Then you're pretty strange. And moreover, you're not right. You can control it. If you gain a few pounds, you significantly diminish the probability of such an event. Do you consider you should gain weight, then? :-)
J
Johnno Nolan

A classic example is when you are replicating between two databases.

DB(A) inserts a record with int ID 10 and at the same time DB(B) creates a a record with in ID 10. This is a collision.

With UUIDs this will not happen as they will not match. (almost certainly)


Ok, then make DB A use even ID and DB B use odd IDs. Done, no UUID.
With three DB's, use 3 multiples LOL
If you use the 2/3/whatever multiples, what happens when you add a new server into the mix later? You have to coordinate a switch so that you're using n+1 multiples on the new server, and move all the old servers over to the new algorithm, and you have to shut everything down while you're doing this to avoid collisions during the algorithm switch. Or... you could just use UUIDs like EVERYONE ELSE.
It's even worse than that, because how would you differentiate between multiples of 2 and multiples of 4? Or multiples of 3 vs. multiples of 6? In fact, you'd have to stick with multiples of prime numbers. Blech! Just use UUID, it works. Microsoft, Apple, and countless others rely on them and trust them.
@sidewinderguy, in GUID we trust! :)
D
Donal Fellows

I have a scheme for avoiding UUIDs. Set up a server somewhere and have it so that every time some piece of software wants a universally unique identifier, they contact that server and it hands one out. Simple!

Except that there are some real practical problems with this, even if we ignore outright malice. In particular, that server can fail or become unreachable from part of the internet. Dealing with server failure requires replication, and that's very difficult to get right (see the literature on the Paxos algorithm for why consensus building is awkward) and is pretty slow too. Moreover, if all the servers are unreachable from a particular part of the 'net, none of the clients connected to that subnet will be able to do anything because they'll all be waiting for new IDs.

So... use a simple probabilistic algorithm to generate them that is unlikely to fail during the lifetime of the Earth, or (fund and) build a major infrastructure that is going to be a deployment PITA and have frequent failures. I know which one I'd go for.


Actually, the entire point of UUIDs invention was to avoid your approach. If you research the history of UUIDs, you'll see it derives from the earliest experiments in creating sophisticated and meaningful networks of computers. They knew networks are inherently unreliable and complicated. UUIDs were answer to the question of how to coordinate data between computers when you knew they could not be in constant communication.
@BasilBourque I was using sarcasm in that first paragraph, in case it wasn't obvious.
C
Community

i don't get all the talk about the likelihood of collision. I don't care about collision. I care about performance though.

https://dba.stackexchange.com/a/119129/33649

UUIDs are a performance disaster for very large tables. (200K rows is not "very large".) Your #3 is really bad when the CHARCTER SET is utf8 -- CHAR(36) occupies 108 bytes! UUIDs (GUIDs) are very "random". Using them as either a UNIQUE or a PRIMARY key on large tables is very inefficient. This is because of having to jump around the table/index each time you INSERT a new UUID or SELECT by UUID. When the table/index is too large to fit in cache (see innodb_buffer_pool_size, which must be smaller than RAM, typically 70%), the 'next' UUID may not be cached, hence a slow disk hit. When the table/index is 20 times as big as the cache, only 1/20th (5%) of hits are cached -- you are I/O-bound. So, don't use UUIDs unless either you have "small" tables, or you really need them because of generating unique ids from different places (and have not figured out another way to do it). More on UUIDs: http://mysql.rjweb.org/doc.php/uuid (It includes functions for converting between standard 36-char UUIDs and BINARY(16).) Having both a UNIQUE AUTO_INCREMENT and a UNIQUE UUID in the same table is a waste. When an INSERT occurs, all unique/primary keys must be checked for duplicates. Either unique key is sufficient for InnoDB's requirement of having a PRIMARY KEY. BINARY(16) (16 bytes) is somewhat bulky (an argument against making it the PK), but not that bad. The bulkiness matters when you have secondary keys. InnoDB silently tacks the PK onto the end of each secondary key. The main lesson here is to minimize the number of secondary keys, especially for very large tables. For comparision: INT UNSIGNED is 4 bytes with range of 0..4 billion. BIGINT is 8 bytes.


M
Mirko Klemm

If you just look at the alternatives e.g. for a simple database application, to have to query the database every time before you create a new object, you will soon find that using UUID can effectively reduce to complexity of your system. Granted - if you use int keys the are 32bit, which will store in a quarter of the 128bit UUID. Granted - UUID generation algorithms take up more computational power than simply incrementing a number. But - who cares? The overhead of managing an "authority" to assign otherwise unique numbers easily outweighs that by orders of magnitude, depending on your intended uniqueness ID space.


J
Johnno Nolan

On UUID==lazy design

I disagree its about picking your fights. If a duplicate UUID is statistically impossible and the maths is proven then why worry? Spending time designing around your small N UUID generating system is impractical, there are always a dozen other ways you can improve your system.


P
Paul Tomblin

At my last job, we were getting objects from third parties that were uniquely identified with UUID. I put in a UUID->long integer lookup table and used long integer as my primary keys because it was way faster that way.


Yea sure, third party forcing you to use UUID is another issue I don't want to get into. Assuming you have control to use UUID or not.
Well, a "long integer" (128 bit) is actually what a UUID is. It's merely shown as a string for human consumption. Sometimes it may be transmitted that way, but for storage and indexing it will certainly be faster in integer form as you found.
D
Davy8

Using the version 1 algorithm it seems that it is impossible collision under the constraint that less than 10 UUIDs per millisecond are generated from the same MAC address

Conceptually, the original (version 1) generation scheme for UUIDs was to concatenate the UUID version with the MAC address of the computer that is generating the UUID, and with the number of 100-nanosecond intervals since the adoption of the Gregorian calendar in the West. In practice, the actual algorithm is more complicated. This scheme has been criticized in that it is not sufficiently 'opaque'; it reveals both the identity of the computer that generated the UUID and the time at which it did so.

Someone correct me if I misinterpreted how it works


There's many versions, and many software systems (Java for example) can't use version 1 as it has not pure Java way to access the mac address.
Regarding Java's inability to obtain the MAC address: Not entirely true. There are work-arounds for this. You can manually set the MAC address used by the generator via a config file. You can also call out to ifconfig and parse the output. The Ruby UUID generator that I wrote uses both approaches.
Also, as mentioned in my answer, if you can't obtain a MAC address for a version 1 UUID, you use 6 random bytes instead, as per section 4.5 of RFC 4122. So even if you don't want to use either of the two workarounds for Java, you can still generate a valid version 1 UUID.
MS GUIDs are just random numbers. They don't have any MAC part anymore, because that made it possible to reverse-engineer the MAC address of the server (which turned out to be very dangerous).
I
Iain Duncan

To those saying that UUIDs are bad design because they could (at some ridiculously small probability) collide, while your DB generated keys won't... you know the chance of human error causing a collision on your DB generated keys because of some un-forseen need is FAR FAR FAR higher than the chance of UUID4 collision. We know that if the db is recreated it will start ids at 1 again, and how many of us have had to recreate a table when we were sure we would never ever need to? I'd put my money on UUID safeness when stuff starts going wrong with unknown-unknowns any day.


S
StephenS

Aside from cases where you have to use someone else's API that demands a UUID, of course there's always another solution. But will those alternatives solve all the problems that UUIDs do? Will you end up adding more layers of hacks, each to solve a different problem, when you could have solved all of them at once?

Yes, it is theoretically possible for UUIDs to collide. As others have noted, it's ridiculously unlikely to the point that it's just not worth considering. It's never happened to date and most likely never will. Forget about it.

The most "obvious" way to avoid collisions is to let a single server generate unique IDs on every insert, which obviously creates serious performance problems and doesn't solve the offline generation problem at all. Oops.

The other "obvious" solution is a central authority that hands out blocks of unique numbers in advance, which is essentially what UUID V1 does by using the MAC address of the generating machine (via the IEEE OUI). But duplicate MAC addresses do happen because every central authority screws up eventually, so in practice this is far more likely than a UUID V4 collision. Oops.

The best argument against using UUIDs is that they're "too big", but a (significantly) smaller scheme will inevitably fail to solve the most interesting problems; UUIDs' size is an inherent side effect of their usefulness at solving those very problems.

It's possible your problem isn't big enough to need what UUIDs offer, and in that case, feel free to use something else. But if your problem grows unexpectedly (and most do), you'll end up switching later--and kick yourself for not using them in the first place. Why design for failure when it's just as easy to design for success instead?


k
keyser

UUIDs embody all of the bad coding practices associated with global variables, only worse, since they are superglobal variables which can be distributed over different pieces of kit.

Recently hit such an issue with the replacement of a printer with an exact replacement model, and found that none of the client software would work.


Glad we live in a society which still focuses on facts as opposed to random opinions, otherwise all of us on stack overflow would be out of jobs. :)

关注公众号,不定期副业成功案例分享
Follow WeChat

Success story sharing

Want to stay one step ahead of the latest teleworks?

Subscribe Now