Generating v5 UUID. What is name and namespace?

C

Community

Type 3 and Type 5 UUIDs are just a technique of stuffing a hash into a UUID:

Type 1: stuffs MAC address+datetime into 128 bits

Type 3: stuffs an MD5 hash into 128 bits

Type 4: stuffs random data into 128 bits

Type 5: stuffs an SHA1 hash into 128 bits

Type 6: unofficial idea for sequential UUIDs

Edit: Unofficial type 6 now has an official rfc

An SHA1 hash outputs 160 bits (20 bytes); the result of the hash is converted into a UUID.

With a 20-byte digest from SHA1:

SHA1 Digest:   74738ff5 5367 e958 1aee 98fffdcd1876 94028007
UUID (v5):     74738ff5-5367-5958-9aee-98fffdcd1876
                             ⭡    ⬑first two bits set to 1 and 0, respectively
                             ╰─low nibble is set to 5, to indicate type 5

What do I hash?

You're probably wondering what is it that I'm supposed to hash. Basically you hash the concatenation of:

sha1(NamespaceUUID+AnyString);

You prefix your string with a so-called namespace to prevent name conflicts.

The UUID RFC pre-defines four namespaces for you:

NameSpace_DNS: {6ba7b810-9dad-11d1-80b4-00c04fd430c8}

NameSpace_URL: {6ba7b811-9dad-11d1-80b4-00c04fd430c8}

NameSpace_OID: {6ba7b812-9dad-11d1-80b4-00c04fd430c8}

NameSpace_X500:{6ba7b814-9dad-11d1-80b4-00c04fd430c8}

So, you could hash together:

StackOverflowDnsUUID = sha1(Namespace_DNS + "stackoverflow.com");
StackOverflowUrlUUID = sha1(Namespace_URL + "stackoverflow.com");

The RFC then defines how to:

take the 160 bits from SHA1

and convert it into 128 bits of a UUID

The basic gist is to only take the first 128 bits, stuff a 5 in the type record, and then set the first two bits of the clock_seq_hi_and_reserved section to 1 and 0, respectively.

More examples

Now that you have a function that generates a so-called Name , you can have the function (in pseudo-code):

UUID NameToUUID(UUID NamespaceUUID, String Name)
{
    //Note: All code on stackoverflow is public domain - no attribution required.

    Byte[] hash = sha1(NamespaceUUID.ToBytes() + Name.ToBytes());
    Uuid result;

    //Copy first 16-bytes of the hash into our Uuid result
    Copy(hash, result, 16);

    //set high-nibble to 5 to indicate type 5
    result[6] &= 0x0F; 
    result[6] |= 0x50; 

    //set upper two bits to "10"
    result[8] &= 0x3F; 
    result[8] |= 0x80; 

    return result;
}

(Note: the endian-ness of your system can affect indices of the above bytes)

Now you can have calls:

uuid = NameToUUID(Namespace_DNS, 'www.stackoverflow.com');
uuid = NameToUUID(Namespace_DNS, 'www.google.com');
uuid = NameToUUID(Namespace_URL, 'http://www.stackoverflow.com');
uuid = NameToUUID(Namespace_URL, 'http://www.google.com/search&q=rfc+4112');
uuid = NameToUUID(Namespace_URL, 'http://stackoverflow.com/questions/5515880/test-vectors-for-uuid-version-5-converting-hash-into-guid-algorithm');

Now back to your question

For version 3 and version 5 UUIDs the additional command line arguments namespace and name have to be given. The namespace is either a UUID in string representation or an identifier for internally pre-defined namespace UUIDs (currently known are "ns:DNS", "ns:URL", "ns:OID", and "ns:X500"). The name is a string of arbitrary length.

The namespace is whatever UUID you like. It can be one of the pre-defined ones, or you can make up your own, e.g.¹:

UUID Namespace_RectalForeignExtractedObject = '8e884ace-bee4-11e4-8dfc-aa07a5b093db'

The name is a string of arbitrary length.

The name is just the text you want to have appended to the namespace, then hashed, and stuffed into a UUID:

uuid = NameToUUID('8e884ace-bee4-11e4-8dfc-aa07a5b093db', 'screwdriver');
uuid = NameToUUID('8e884ace-bee4-11e4-8dfc-aa07a5b093db', 'toothbrush');
uuid = NameToUUID('8e884ace-bee4-11e4-8dfc-aa07a5b093db', 'broomstick');
uuid = NameToUUID('8e884ace-bee4-11e4-8dfc-aa07a5b093db', 'orange');
uuid = NameToUUID('8e884ace-bee4-11e4-8dfc-aa07a5b093db', 'axe handle');
uuid = NameToUUID('8e884ace-bee4-11e4-8dfc-aa07a5b093db', 'impulse body spray');
uuid = NameToUUID('8e884ace-bee4-11e4-8dfc-aa07a5b093db', 'iPod Touch');

Thanks for the thorough explanation. If I could give bonus points for Namespace_RectalForeignExtractedObject I would.

Is it possible to decode the name or the namespace decoded from the UUID?

@Sathesh No, it is not possible to decode a hash; hashes are one-way functions. For example, the entire Star Trek TNG Blu-Ray collection is 81 GB, and has a hash of C5740BBBF2429115276D4AB60A020ED3ADE01192. There is no way to decode that 20-byte hash back into 81 GB. If you really needed to, you can try hashing all possible GUIDs and possible strings until you find the combination that gives the same result. With any luch you'll find it somewhere between forever and eternity.

What encoding should be used for converting the text to byte arrays?

B

BuZZ-dEE

Name and namespace can be used to create a hierarchy of (very probably) unique UUIDs.

Roughly speaking, a type 3 or type 5 UUID is generated by hashing together a namespace identifier with a name. Type 3 UUIDs use MD5 and type 5 UUIDs use SHA1. Only 128-bits are available and 5 bits are used to specify the type, so all of the hash bits don't make it into the UUID. (Also MD5 is considered cryptographically broken, and SHA1 is on its last legs, so don't use this to verify data that needs to be "very secure"). That said, it gives you a way of creating a repeatable/verifiable "hash" function mapping a possibly hierarchical name onto a probabilistically unique 128-bit value, potentially acting like a hierarchical hash or MAC.

Suppose you have a (key,value) store, but it only supports one namespace. You can generate a large number of distinct logical namespaces using type 3 or type 5 UUIDs. First, create a root UUID for each namespace. This could be a type 1 (host+timestamp) or type 4 (random) UUID so long as you stash it somewhere. Alternatively you could create one random UUID for your root (or use the null UUID: 00000000-0000-0000-0000-000000000000 as root) and then create a reproducible UUID for each namespace using "uuid -v5 $ROOTUUID $NAMESPACENAME". Now you can create unique UUIDs for keys within a namespace using "uuid -v5 $NAMESPACEUUID $KEY". These UUIDs can be thrown into a single key-value store with high probability of avoiding collision. This process can be repeated recursively so that if for instance the "value" associated with a UUID key in turn represents some sort of logical "namespace" like a bucket, container or directory, then its UUID can be used in turn to generate more hierarchical UUIDs.

The generated type 3 or type 5 UUID holds a (partial) hash of the namespace id and name-within-namespace (key). It no more holds the namespace UUID than does a message MAC hold the contents of the message it is encoded from. The name is an "arbitrary" (octet) string from the perspective of the uuid algorithm. Its meaning however depends on your application. It could be a filename within a logical directory, object-id within an object-store, etcetera.

While this works well for a moderately large number of namespaces and keys, it eventually runs out of steam if you are aiming for a very large numbers of keys that are unique with very high probability. The Wikipedia entry for the Birthday Problem (aka Birthday Paradox) includes a table that gives the probabilities of at least one collision for various numbers of keys and table sizes. For 128-bits, hashing 26 billion keys this way has a probability of collision of p=10^-18 (negligible), but 26 trillion keys, increases the probability of at least one collision to p=10^-12 (one in a trillion), and hashing 26*10^15 keys, increases the probability of at least one collision to p=10^-6 (one in a million). Adjusting for 5 bits that encode the UUID type, it will run out somewhat faster, so a trillion keys have roughly a 1-in-a-trillion chance of having a single collision.

See http://en.wikipedia.org/wiki/Birthday_problem#Probability_table for the probability table.

See http://www.ietf.org/rfc/rfc4122.txt for more details on UUID encodings.

At a certain level down the hierarchy, can I use a UUIDv5 as the namespace and UUIDv4 as the random key to ensure the collisions in the data itself (that is being identified by this GUID) do not increase the chances of colliding UUIDs? Any performance issues I should know about?

I am new to the concept and was puzzled of what that hierarchy you're talking about is. Where can I see it etc... Some clarity came once I stuck on explanation this might be used to create a reproducible UUID for namespace. I am wondering is there a way to verify that a given UUID (of type either 3 or 5) has been generated using a particular namespace (its UUID)?

E

Eugen Konkov

A name is nothing more than an identifier that is unique within some namespace. The problem is that namespaces are often quite small and the names in one often collide with names in others. For instance, my car's license plate number (name) is unique within my state DMV's namespace, but it's probably not unique in the world; other state DMVs may have used the same name in their own namespaces. Heck, someone else may have a phone number (name) that also matches because that's yet another namespace, etc.

UUIDs can be seen as inhabiting a single namespace so vast that it can provide a unique name for everything; that's what the "universal" means. But how do you map existing names in other namespaces to a UUID?

One obvious solution is to generate a UUID (V1 or V4) for every item to replace the old names in their disjoint namespaces. The downside is that they're a lot bigger, you have to communicate all the new names to everyone who has a copy of your dataset, update all your APIs, etc. Odds are you can't actually get rid of the old names entirely anyway, which means now every item has two names, so did you make things better or worse?

This is where V3/V5 come in. The UUIDs look just as random as V4 but are actually deterministic; anyone who has the right UUID for a namespace can then independently generate the same UUID for any given name within that namespace. You don't need to publish them at all nor even pre-generate them since anyone can create them on the fly as needed!

DNS names and URLs are very commonly used namespaces, so standard UUIDs were published for those; ASN.1 OIDs and X.500 names aren't as common, but standards bodies love them, so they published standard namespace UUIDs for them too.

For all other namespaces, you have to generate your own namespace UUID (V1 or V4) and communicate it to anyone who needs it. If you have several namespaces, having to publish the UUID for each one is clearly not ideal.

This is where the hierarchy comes in: you create one "base" UUID (of whatever type), and then use that as a namespace for naming your other namespaces! That way, you only have to publish the base UUID (or use an obvious one), and everyone can calculate the rest.

For instance, let's stay we wanted to create some UUIDs for StackOverflow; that has an obvious name within the DNS namespace, so the base is obvious:

uuid ns_dns = '6ba7b810-9dad-11d1-80b4-00c04fd430c8';
uuid ns_base = uuidv5(ns_dns, 'stackoverflow.com');

StackOverflow itself has separate namespaces for users, questions, answers, comments, etc., but those are fairly obvious as well:

uuid ns_user     =  uuidv5( ns_base, 'user'     );
uuid ns_question =  uuidv5( ns_base, 'question' );
uuid ns_answer   =  uuidv5( ns_base, 'answer'   );
uuid ns_comment  =  uuidv5( ns_base, 'comment'  );

This particular question is #10867405, so its UUID would be:

uuid here = uuidv5(ns_question, '10867405');

Notice that there's nothing random in this process, so anyone who follows the same logic will get the same answer, yet the UUID namespace is so vast that it will (effectively, given the security of a 122-bit cryptographic hash) never collide with a UUID generated from any other namespace/name pair.

I am wondering why stackoverflow needs to map a uniquely generated big integer to a UUID given that its APIs apparently return only the big integer as a string. Where would the UUID be used if not in the API. It seems we should either select either a UUID or BIGINT ? Why do this hybrid strategy. Yet +1 for the clear explanation in your answer.

UUID V3/V5 were designed for when you need to deterministically convert existing (and likely colliding) namespaces over to one UUID namespace, which is often useful when merging datasets. If that doesn't apply to what you're doing, then go with V1/V4.

This is a fantastic answer. Thank you.

This one should be the accepted answer, thank you.

I think this should be accept answer, thanks

Generating v5 UUID. What is name and namespace?

Follow WeChat

Want to stay one step ahead of the latest teleworks?

相似问题

Platform

Support

Contact US