Вы находитесь на странице: 1из 33

UUID or GUID as Primary Keys?

Be Careful!

Tom HarrisonFollow
Feb 12, 2017

Nothing says “User friendly” like GUID!

I just read a post on ways to scale your database that hit home with me —
 the author suggests the use of UUIDs (similar to GUIDs) as the primary
key (PK) of database tables.

Reasons UUIDs are Good


There are several reasons using a UUID as a PK would be great
compared to auto-incrementing integers:

1. At scale, when you have multiple databases containing a segment


(shard) of your data, for example a set of customers, using a UUID
means that one ID is unique across all databases, not just the one
you’re in now. This makes moving data across databases safe. Or in
my case where all of our database shards are merged onto our
Hadoop cluster as one, no key conflicts.
2. You can know your PK before insertion, which avoids a round trip DB
hit, and simplifies transactional logic in which you need to know the
PK before inserting child records using that key as its foreign key
(FK)
3. UUIDs do not reveal information about your data, so would be safer
to use in a URL, for example. If I am customer 12345678, it’s easy to
guess that there are customers 12345677 and 1234569, and this
makes for an attack vector. (But see below for a better alternative).

Reasons UUIDs May Not be Good

Don’t be naive
A naive use of a UUID, which might look like 70E2E8DE-500E-4630-B3CB-
166131D35C21, would be to treat as a string, e.g. varchar(36) — don’t do
that!!

“Oh, pshaw”, you say, “no one would ever do such a thing.”

Think twice — in two cases of very large databases I have inherited at


relatively large companies, this was exactly the implementation. Aside
from the 9x cost in size (36 vs. 4 bytes for an int), strings don’t sort as
fast as numbers because they rely on collation rules.

Things got really bad in one company where they had originally decided
to use Latin-1 character set. When we converted to UTF-8 several of the
compound-key indexes were not big enough to contain the larger strings.
Doh!

UUIDs are a pain


Don’t underestimate how annoying it is to have to deal with values that
are too big to remember or verbalize.

Planning for real scaling


If our goal is to scale, and I mean really scale let’s first acknowledge that
an int is not big enough in many cases, maxing out at around 2 billion,
which needs 4 bytes. We have way more than 2 billion transactions in
each of several databases.
So bigint is needed in some cases and that uses 8 bytes. Meanwhile,
using one of several strategies, databases like PostgreSQL and SQL
Server have a native type that is stored in 16 bytes.
So who cares if it’s twice as large as bigint or four times bigger than int?
It’s just a few bytes, right?

Primary keys get around in normalized databases


If you have a well normalized database, as we do at my current company,
each use of the key as an FK starts adding up.

Not just on disk but during joins and sorts these keys need to live in
memory. Memory is getting cheaper, but whether disk or RAM, it’s
limited. And neither is free.

Our database has plenty of intermediate tables that are mainly


containers for the foreign keys of others, especially in 1-to-many
relations. Accounts have multiple card numbers, addresses, phone
numbers, usernames, and all that. For each of these columns in a set of
table with billions of accounts, the extra size of foreign keys adds up fast.

It’s really hard to sort random numbers


Another problem is fragmentation — because UUIDs are random, they
have no natural ordering so cannot be used for clustering. This is why
SQL Server has implemented a newsequentialid() function that is
suitable for use in clustered indexes, and is probably the right
implementation for all UUID PKs. It is probable that there are similar
solutions for other databases, certainly PostgreSQL, MySQL and likely
the rest.

Primary keys should never be exposed,


even UUIDs
A primary key is, by definition unique within its scope. It is, therefore,
an obvious thing to use as a customer number, or in a URL to identify a
unique page or row.
Don’t!

I would argue that using a PK in any public context is a bad idea.

The original issue with simple auto-incrementing values is that they are
easily guessable as I noted above. Botnets will just keep guessing until
they find one. (And they may keep guessing if you use UUIDs, but the
chance of a correct guess is astronomically lower).
Arguably it would be a fool’s errand to try to guess a UUID,
however Microsoft warns against using newsequentialid() because by
mitigating the clustering issue, it makes the key more guessable.

My keys will never change (until they do)


But there’s a far more compelling reason not to use any kind of PK in a
public context: if you ever need to change keys, all your external
references are broken. Think “404 Page Not Found”.

When would you need to change keys? As it happens, we’re doing a data
migration this week, because who knew in 2003 when the company
started that we would now have 13 massive SQL Server databases and
growing fast?

Never say “never”. I have been there and done that, and it has happened
several times just for me. It’s easy to manage up front. It’s way harder to
fix when you’re counting things in the trillions.

Indeed, my current company’s context is a perfect example of why


UUIDs are needed, and why they are costly, and why exposing primary
keys is an issue.

My internal system is external


I manage the Hadoop infrastructure that receives data nightly from all of
our databases. The Hadoop system is linked (bound) to our SQL Server
databases, which is fine — we’re in the same company.

Still, in order to disambiguate colliding sequence keys from our multiple


databases, we generate a pseudo-primary-key by concatenating two
values, the id (PK) of the customer which is unique across databases
(because we planned that), plus the sequence id of the table rows
themselves which may collide.

In so doing we have created a tight, and effectively permanent binding


between years of historical customer data. If those primary keys in the
RDBMS change, ours will need to also, or we’ll have some horrifying
before-and-after scenario.

Best of Both? Integers Internal,


UUIDs External
One solution used in several different contexts that has worked for me is,
in short, to use both. (Please note: not a good solution — see note about
response to original post below).
Internally, let the database manage data relationships with small,
efficient, numeric sequential keys, whether int or bigint.

Then add a column populated with a UUID (perhaps as a trigger on


insert). Within the scope of the database itself, relationships can be
managed using the usual PKs and FKs.

But when a reference to the data needs to be exposed to the outside


world, even when “outside” means another internal system, they must
rely only on the UUID.

This way, if you ever do have to change your internal primary keys, you
can be sure it’s scoped only to one database. (Note: this is just plain
wrong, as Chris observed)

We used this strategy at a different company for customer data, just to


avoid the “guessable” problem. (Note: avoid is different than prevent,
see below).

In another case, we would generate a “slug” of text (e.g. in blog posts like
this one) that would make the URL a little more human friendly. If we
had a duplicate, we would just append a hashed value.
Even as a “secondary primary key”, using a naive use of UUIDs in string
form is wrong: use the built-in database mechanisms as values are
stored as 8-byte integers, I would expect.

Use integers because they are efficient. Use the database


implementation of UUIDs in addition for any external reference to
obfuscate.
Chris Russell responded to the original post on this section correctly
noting two important caveats or errors in logic. First, even exposing a
UUID that is effectively an alternate for the actual PK reveals
information, and this is especially true when using the newsequentialid —
 don’t use UUIDs for security. Second, when the relations of a given
schema are internally managed by integer keys, you still have the key-
collision problem of merging two databases, unless all keys are
doubled … in which case, just use the UUID. So, in reality, the right
solution is probably: use UUIDs for keys, and don’t ever expose them.
The external/internal thing is probably best left to things like friendly-
url treatments, and then (as Medium does) with a hashed value tacked
on the end. Thanks Chris!

References and many thanks


Thanks to Ruby Weekly (which I still read, wistfully although Scala is
growing on me), Starr Horne’s great blog from Honeybadger.io on this
topic, the always funny and smart post on Coding Horror by Jeff
Atwood, co-founder of Stack Overflow, and naturally a fine question on
one of Stackoverflow’s sites at dba.stackexchange.com. Also a nice post
from MySqlserverTeam, another from theBuild.com and of course
MSDN which I linked earlier.

Meta: Why I Blog


I learned a lot writing about this.

I started out reading email on a Sunday afternoon.

Then came across an interesting post by Starr, which got me thinking his
advice might have unintended outcomes. So I googled and learned way
more about UUIDs than I knew before, and changed my fundamental
understanding and disposition about how and when to use them.
Halfway through writing this, I sent email to the team leads at my
company wondering if we had considered one of the topics I discussed.
Hopefully we’re ok, but I think we may have avoided at least one
unexpected surprise in code scheduled for release this week.

Note that all of these are entirely selfish reasons :-)

Hope you like it, too!

Image Credit
 Software Architecture
 Big Data
 Scaling
 Data Science
 Database
4.2K claps
53
 Follow

Tom Harrison
Medium member since Mar 2017
30 Years of Developing Software, 20 Years of Being a Parent, 10 Years of Being Old. (Effective: 2019)

 Follow

Tom Harrison’s Blog

Make it Work, Make it Beautiful, Make it Fast

Related reads
REST vs. GraphQL: A Critical Review
Z
Jun 10, 2018
6.6K

Related reads
Be careful with CTE in PostgreSQL

Haki Benita
Sep 17, 2018
640

Also tagged Software Architecture


What is a Load Balancer

Bivás Biswas
May 28
2
Responses
Write a response…
Conversation between Chris Russell and Tom Harrison.

Chris Russell
Jun 10, 2017

Thank you for taking the time to share your experiences… this is a valuable topic.
There are a couple of points which I think should be made.

1. On security: UUIDS are intended to be unique, not secure. They are not
considered `hard-to-guess`. Harder to manually increment? sure… harder for a
script? Maybe. But, depending on the…

Read more…
233
2 responses

Tom Harrison
Jun 11, 2017

Chris — all excellent points. Thank you very much.


I am adding several corrections to my post. I am especially sensitive to how badly and
repeatedly we mess up security in the software business. So while it seems “less
insecure” that’s a whole different thing than “secure”. There was a time when we used
to think that the…

Read more…
11
Conversation between Bit-Booster! and Jim E. Rustle.

Bit-Booster!
Jun 9, 2017

The 4th reason UUID / GUID’s are good is really important: it allows users to change
their names, their SIN numbers, their email addresses, and your system can handle
that if you key with UUID / GUID.
Yes, don’t store them as Strings in a DB. Store them as BINARY(16) or VARBINARY if
your db supports those column types.

42
2 responses

Jim E. Rustle
Jun 9, 2017

it allows users to change their names, their SIN numbers, their email addresses, and
your system can handle that if you key with UUID / GUID.

Not sure I follow, what do names, email addresses, or in other words, non-primary key
columns, have to do with the primary key?

107
Conversation between Jonathan Garbee and Tom Harrison.

Jonathan Garbee
Oct 12, 2017

Botnets will just keep guessing until they find one.

If this leads to a data leak, it means the web application isn’t engineered properly for
security in the first place. An access control list should be in place to prevent
unauthorized access to resources. There is no security vulnerability using PKs in
public if your application is securely engineered.
129
1 response

Tom Harrison
Oct 12, 2017

Hi Jonathan —
Thanks for your response. To be sure, hiding a sequence id behind a GUID is merely
one step that is really not much more than “security by obscurity” and proper security
is foundational. I think the security case here is more around information leaking —
 that is, exposing the internals of your application may…

Read more…
3
Conversation between Nicolas Grilly and Tom Harrison.

Nicolas Grilly
Jun 9, 2017

the 9x cost

Are you sure about the 9x cost?


An UUID is made of 16 bytes, and its representation is made of 36 characters:
36 / 16 = 2.25 (instead of 9)

47
1 response

Tom Harrison
Jun 11, 2017

I have clarified my post. My comparison was between the 36 characters of a string-


representation of UUID vs. 4 bytes of a bigint which is … well, gosh, that’s apples-to-
oranges, isn’t it. Re-clarifying the post. Thanks :-)

4
Conversation between Nicolas Froidure and Tom Harrison.

Nicolas Froidure
Feb 19, 2017
By combining UUIDs and Auto Incremented Primary Keys, you loose the advantage of
generating ids on the client side (to perform PUTs instead of POSTs) which is the
main reason why people use UUIDs.

14
3 responses

Tom Harrison
Feb 20, 2017

At first this is what I thought, and am quite possibly wrong, but as I looked at it, SQL
Server, at least, implements the sequential UUID as a function, which I assumed
means you could call the function (e.g. select newsequentialid()) then use the
returned value in subsequent inserts. Because it is still a GUID … right? … it should be
guaranteed…

Read more…
2 responses

Nicolas Froidure
Feb 20, 2017

By client I meant the HTTP API client, an additional round trip to the server before
generating a resource is probably not an option (apart from maintaining a pool of
directly usable ids in the frontend application but it adds a lot of complexity).
Thank you too for your helpful posts ;).

2
Applause from Tom Harrison (author)

Alec Zopf
Jun 10, 2017

It is probable that there are similar solutions for other databases, certainly
PostgreSQL, MySQL and likely the rest.

Indeed! The Postgres extension “uuid-ossp” contains a useful function to generate a


UUID, called “uuid_generate_v1mc()”. The first part of the UUID is time-based (i.e.
roughly sequential), and the second part is random.
See: https://www.postgresql.org/docs/9.5/static/uuid-ossp.html

11
1 response
Applause from Tom Harrison (author)

Ricardo Puerto
Sep 8, 2017

2 billion

4 billion, because there is no reason to use SIGNED ints, right?

21
Applause from Tom Harrison (author)

Jens Alfke
Jun 10, 2017

Aside from the 9x cost in size, strings don’t sort as fast as numbers because they rely
on collation rules.

Where did you get “9x cost in size” from? Hex strings are 2x as large as binary. The
unnecessary hyphens add a little bit, but it’s still only 36 bytes when it could be 16,
which is 2.25x.
I don’t think there’s much overhead for string vs numeric sorting if you’re using a
default collation rule; maybe if your db uses Unicode…

Read more…
18
Conversation between Alex Maslakov and Tom Harrison.

Alex Maslakov
Jun 10, 2017

Don’t!

I will!

12
2 responses

Tom Harrison
Jun 11, 2017

You’re just a troublemaker, aren’t you :-)


2
Applause from Tom Harrison (author)

Pierre Phaneuf
May 3, 2018

Another problem is fragmentation — because UUIDs are random, they have no


natural ordering so cannot be used for clustering.

This problem can be an advantage, when using a distributed database, like Cosmos DB
or Cloud Spanner, where you’d want your primary keys to be uniformly distributed, to
avoid a partition becoming a “hot spot”.

28
Applause from Tom Harrison (author)

Bruno Brant
Jun 10, 2017

no one would ever do such a thing

Sorry.

18
Applause from Tom Harrison (author)

João Almeida
Sep 8, 2017

I would argue that using a PK in any public context is a bad idea.

Like when people don’t want to move their data to another database, until they do. I’m
a big, big fan of not putting any business meaning on top primary keys. Couldn’t agree
more with this section.

9
Conversation between 張旭 and Tom Harrison.

張旭
Jun 10, 2017

we would generate a “slug” of text (e.g. in blog posts like this one) that would make
the URL a little more human friendly.
“/uuid-or-guid-as-primary-keys-be-careful-7b2aa3dcb439”

4
1 response

Tom Harrison
Jun 11, 2017

Exactly :-)

Applause from Tom Harrison (author)

Alex Nishikawa
Dec 13, 2017

I keep coming back to this article (and the comments section). Thanks for writing this
Tom.

7
Conversation between Julien Desrosiers and Tom Harrison.

Julien Desrosiers
Jun 9, 2017

When would you need to change keys? As it happens, we’re doing a data migration
this week, because who knew in 2003 when the company started that we would now
have 13 massive SQL Ser...

I’m curious. What kind of data migration did you do for you to need to change the
columns’ UUIDs in that particular case? Why not just dump the data and re-INSERT
it elsewhere while keeping the same UUIDs?
Thanks

3
1 response

Tom Harrison
Jun 10, 2017

The problem is simple: we don’t use UUIDs. If we had used them, it would be as trivial
as you say. Because we use auto-incremented integer (or bigint) values for keys the
process of moving is impossibly complex. (Sigh)
6
Conversation between Mingwei Zhang and Tom Harrison.

Mingwei Zhang
Jun 9, 2017

Meta: Why I Blog

I enjoyed this section in particular. Always wondering why should people blog about
anything.
Thank you for sharing!

4
1 response

Tom Harrison
Jun 11, 2017

Most other bloggers do it for the fame, or just the amazing flood of cash it generates.
I haven’t yet achieved these heights of excellence (but I have only been trying for a
decade or so, so I ain’t givin’ up).
For now, I settle for the benefits of forcing me to think through problems that are…

Read more…
19
Conversation between Carlos Henrique Romano and Tom Harrison.

Carlos Henrique Romano


Jun 10, 2017

Thanks for posting. I wonder what's the problem of an approach like Youtube.
You would use an internal (as in "not exposed") alphabet and convert from an integer
to a string (which you can make public) and vice-versa. Let's say the integer ID 12345
would be converted to something like aXpct; when you receive a request with…

Read more…
2
1 response

Tom Harrison
Jun 11, 2017

I think it’s important to distinguish how a value is represented versys how it is stored.
Of the UUID solutions I researched, all store the value internally as an integer which
can be encoded and decoded as a hash-like value, then store the hash as its decimal
numeric value. The utility of hashing is that the algorithm can be forced to produce
an…

Read more…
2 responses

Carlos Henrique Romano


Jun 11, 2017

Do read the comments (and today’s clarifications) about the risk of exposing UUIDs,
or even hashed values both from a security or sharding standpoint

No doubt internal information should not be exposed, you explained the reasons why;
but you need someway to map the external value to the internal one and what I'm
saying here is that the Youtube approach sounds like a pretty good (and simple!) one.

1
Conversation between Clifford Hammerschmidt and Tom Harrison.

Clifford Hammerschmidt
May 3, 2018

Consider using tuids: http://github.com/tanglebones/pg_tuid


A major issue with uuids is the lack of data locality in the indexing, which hurts
caching of index pages.

31
1 response

Tom Harrison
May 4, 2018

I like the approach of pg_tuid — you get natural clustering from the timestamp
component and it’s atomic.
But I wonder if the notion of data locality in the index is really a problem these days? I
think SSD drives changed this equation dramatically.

Read more…
1 response
Conversation between Allan Wind and Tom Harrison.
Allan Wind
Jun 11, 2017

The definition of a primary key is an identity that never change change. If the int
primary key changes, then presumably so would the uuid key, which is what you
expose externally and still problematic. What am I missing? To me it means you have
a schema problem either the wrong key, or possible table layout. One system I used,
has a redirect feature…

Read more…
1
1 response

Tom Harrison
Jun 11, 2017

Agree on definition of PK — never changes. And yes, systems like WordPress that use
slug-based URLs implement the redirect with a secondary table as you suggest (and so
did a system a company I was at some years ago). This allows the author to change the
title while ensuring all copies still point to the correct page. Most modern systems,
including…

Read more…
7
Conversation with Tom Harrison.

Danielo Rodríguez
Sep 9, 2017

I think you are restricting your story to relational SQL databases, but anyway.
MongoDB has it’s own implementation of UUIDS, the BSON type which is used as
default for the “primary-key” _id. It can be represented as a string, sure, but it is not
saved as a string, and it is easily sortable. In fact, it includes some extra…

Read more…
1 response

Tom Harrison
Sep 10, 2017

Wow, totally great perspective! Yes, this conversation is completely focused on the
relational DB front, and yes, the problems that RDBMS solve create a far different set
of design constraints than document-oriented databases like Mongo.
I wrote this article when I first understood how the company I was with had an…

Read more…
3
2 responses
Conversation between Pierre Phaneuf and Tom Harrison.

Pierre Phaneuf
May 3, 2018

Not just on disk but during joins and sorts these keys need to live in memory.

If you’re really, REALLY going to scale, then these just won’t fit in memory anyway.

5
1 response

Tom Harrison
May 4, 2018

I could perhaps have changed my language to say that there’s a performance benefit to
having the index in memory vs. spilled to disk, even if not always possible.
Today, there’s almost no case now where more memory is not an option — we’re
running jobs on Hadoop that have access to several terabytes of memory as needed…

Read more…
1
Conversation with Tom Harrison.

Alexey Migutsky
Jun 11, 2017

And neither is free.

Still cheaper than programmer’s and user’s time.

1 response

Tom Harrison
Jun 11, 2017
I am sure I agree — thanks for the video post which I’ll check out later.
I think the size, and the few bytes, and string versus native type all only begin to
matter at scale. We have a database that is highly normalized — pretty much all
repeating data is extracted to a separate table and referenced by FK in the parent…

Read more…
16
Conversation between Malcolm Hall and Tom Harrison.

Malcolm Hall
May 8, 2017

I’ve tried combining them but there were issues with join overhead, cascade deletes,
enforcing uniqueness of the identifier. After fighting the database I ended up going
back to using a string key.

15
1 response

Tom Harrison
Sep 10, 2017

Wow, interesting. So do you mean having some strings, some native UUID type, some
integer?
The uniqueness constraint is funny, in a way. Here’s a value whose scope is proximate
to infinity, but also expected to have a trivially small case of key collisions. A trillion or
so ain’t what it used to be!

Read more…
Conversation with Tom Harrison.

Nate Bessa
Apr 30, 2018

So, in reality, the right solution is probably: use UUIDs for keys, and don’t ever
expose them

So is the final recommendation of your article to have both an internal UUID as a pk


and a second UUID to serve for external references?

1 response
Tom Harrison
May 4, 2018

It’s a year or so after I have written the article, but I think my recommendations would
be:

 if you have a large database or some other reason to physically segment or relocate
data to a separate database instance (or will have this), you should use a GUID as
the PK.
 For security it’s best not to…

Read more…
31
Conversation with Tom Harrison.

Kayvan Arianpour
May 24, 2018

Although it is true, but many times we over


normalize.
Our database has plenty of intermediate tables that are mainly containers for the
foreign keys of others, especially in 1-to-many relations. Accounts have multiple card
numbers, addresses, phone numbers, usernames, and all that. For…

Read more…
1 response

Tom Harrison
May 24, 2018

“over-normalize” … when I was learning the theory and practice of 3rd normal form,
the rule I tended to follow was “normalize first, de-normalize only when needed”. At
my current company we have a database schema that scrupulously followed this
approach, and I can say after 15 years, this strategy holds up well every day for us.

Read more…
Conversation with Tom Harrison.

Jens Alfke
Jun 10, 2017

because UUIDs are random, they have no natural ordering so cannot be used for
clustering.
This is a very weird statement, considering that using UUIDs as keys is extremely
common in NoSQL databases that use clustering, for example Couchbase Server. It’s
typical in such a database, as in a DHT, to map records to shards by hashing the key,
so there’s no need for any “natural ordering”. In fact, ordering would mess up the
distribution by…

Read more…
12
1 response

Tom Harrison
Jun 11, 2017

Sharding, yes, but clustering I don’t think so (I could be wrong).


Typically when sharding, the shards are separated based on the modulo of the last
digits of the key or some other rubric based on the key, thus providing a simple
functional method for determining which shard a given record is located on.

Read more…
1 response
Conversation with Tom Harrison.

Fedor Losev
Apr 23, 2018

The article talks about giving up global scalability and development simplicity.
Instead, some small space and performance gains are suggested by adding a non-
trivial layer of complexity. No reason was convincing in our days of very cheap storage
space and very expensive human resources. If anything, it was convincing for exactly
the opposite, go…

Read more…
1 response

Tom Harrison
Aug 19, 2018

I think I generally agree with your points. Most databases will not hit the scale or size
where GUIDs have value, so yes, a premature optimization (and complication, etc.) in
most cases.
I don’t agree that the cost is approximately the same. The data system that I am
working with now has hundreds of tables, over 10’s of…
Read more…
Conversation with Tom Harrison.

James Van Leuven


Apr 27, 2018

Ok so I’m a little unsure about something, as I want to do what you’ve suggested (I


have a preference for bigint if it’s a one off and we’re not sharding or load balancing.
As well as having also done so with the UUID and agree sequential management tends
to need to be done using the timestamp column, rather than the pk.

Read more…
1 response

Tom Harrison
Aug 19, 2018

If you have the space, a trigger that generates a UUID for storage in a secondary
column is a simple idea. This feels like “future-proofing” which could either turn out
to be an act of prescience (in 5 or 10 years) or an act of premature optimization :-).
Your approach seems efficient and simple, though, so if you think it may be
warranted, it’s probably a good approach.

1 response
Conversation with Tom Harrison.

Jeremy Solarz
Jun 29, 2017

the id (PK) of the customer which is unique across databases

Why would there be collisions when the PK is unique across databases?

1 response

Tom Harrison
Jun 30, 2017

Thanks for your question. In our case we defined a value for each customer that is
unique across the company, and thus across databases. That was a business choice (a
good one) rather than an auto-increment. Thus, adding it assures uniqueness. I have
clarified the text.
Conversation with 張旭.

張旭
Jun 10, 2017

using a naive use of UUIDs in string form is wrong: use the built-in database
mechanisms.

What is the mechanisms?


May I have any key words for it?
For example, in MySQL what is this mean?

1 response

張旭
Jun 10, 2017

oh, I think I got it:


https://dev.mysql.com/doc/refman/5.7/en/miscellaneous-
functions.html#function_uuid

1
Conversation with Tom Harrison.

Phil Walsh
Aug 18, 2018

Hi Tom — I keep coming back to this post, but never quite resolving the question I
have in my head — maybe you’d like to comment/advise?
We are currently using uuid’s as our primary keys, but:

 we are generating them in python and passing them into the database (rather than
using mysql’s UUID()…

Read more…
1 response

Tom Harrison
Aug 19, 2018

Hi Phil —
I don’t see any reason why it would be better or different to generate UUIDs
programmatically rather than letting the DB do it. The number of possible values is so
vast that collision is improbable.
The database probably has a UUID or GUID type (which is probably just a 128-bit
number), and…

Read more…
Conversation with Tom Harrison.

Jonathan Garbee
Oct 14, 2017

Botnets will just keep guessing until they find one.

With *proper security* setup via an access control list. Random guesses won’t matter.
That’s why having proper security built into the application as it is designed is far
superior than any obfuscation techniques you can think to implement to refute access.

1 response

Tom Harrison
Oct 15, 2017

Thanks for comment, but I think my point is different. We all agree that security-by-
obscurity is insufficient. Proper security focuses on many other aspects of hardware,
software, database and systems design, and is an entire separate discussion.

Read more…
Conversation with Tom Harrison.

Tom Lo
May 29, 2018

I have a question. I heard a person argue using UUID as primary key decrease the
chance of database contention as UUID is random enough to make B+ tree insert
tuples random enough to it so that the tree will not be unbalanced when inserting and
deleting. Is it correct? I never be abled to find any materials related to this argument.

1 response

Tom Harrison
Aug 19, 2018
I think this is true — UUID address space is so huge that the hashing algorithm result
in very few clusters in the btree. I have no idea how this might affect performance,
resource usage, or efficiency.
Good question!

Conversation between Bit-Booster! and Tom Harrison.

Bit-Booster!
The 4th reason UUID / GUID’s are good is really important: it allows users to change their
names… Yes, don’t store them as Strings in a DB. Store them as BINARY(16) or VARBINARY if
your db supports those column types.

Tom Harrison
Jun 10, 2017

I think we’re on the same page. We agree that a properly normalized database is a
good thing, and allows the kind of change you refer to. The question is, what value do
you choose for those primary (and thus, foreign) keys. A usual choice is integer, or
perhaps bigint in the case that you’ll have more than a few billion records.

Read more…
1
1 response
Conversation with Tom Harrison.

Wout Mertens
Mar 19, 2018

I try to use natural keys whenever I can. Natural keys are intrinsic values of an object
that uniquely identify it. In cases where they are almost unique, I “slugify” them like
you do by adding a suffix.
It makes working with database data a lot more intuitive, at the cost of some bytes.
Just thought I’d mention that as an option when selecting a primary key.

1 response

Tom Harrison
Aug 19, 2018

Sorry I missed this comment when published.


Natural keys have been out of modern database design for decades, now. Natural keys
seem intuitive, because they are at the start of a database. But as data sizes grow, they
result in a number of undesirable characteristics.

Read more…
3
1 response
Conversation with Tom Harrison.

optimuspaul
Jun 11, 2018

8-byte integers

A uuid is 16 bytes though, so it would need to be spread across two columns if stored
as bigints.

1 response

Tom Harrison
Aug 19, 2018

I don’t think that’s necessarily true — if the database has a type for the internal storage
of UUID, it presumably creates a single column of that type having size large enough
to contain it, in this case, 128 bits (16 bytes).

Conversation between Carlos Henrique Romano and Tom Harrison.

Carlos Henrique Romano


Thanks for posting. I wonder what's the problem of an approach like Youtube. You would use an
internal (as in "not exposed") alphabet and convert from an integer to a string (which you can
make public) and…

Tom Harrison
I think it’s important to distinguish how a value is represented versys how it is stored. Do read
the comments (and today’s clarifications) about the risk of exposing UUIDs, or even hashed
values both from a security or sharding…

Carlos Henrique Romano


Jun 11, 2017
hen store the hash as its decimal numeric value

Well, that's not what I said. The hash would never be stored. It is just a value that
passed to a function would return the ID that you can use to find the record… f(hash)
= ID

1
Conversation with Tom Harrison.

Bob Wakefield
Sep 9, 2017

I actually rarely use GUIDS or UUIDs. My challenge has always been MDM issues like
generating customer numbers. They should be actually numbers for the reasons you
brought up about strings. They also have to be unique enterprise wide. Do you have an
advice, tips, or algorithms you like to use to generate customer numbers? I’ve written
one that takes…

Read more…
1 response

Tom Harrison
Sep 10, 2017

Hey Bob — thanks for the response. I am mos’ def’ not an expert on this stuff, but I can
offer what I have seen done. UUIDs are pretty much bad for everything except what
they are designed for, which is to count really high without having to care what your
next number is. Whether UUIDs are represented as a hex value or a decimal, they’re
both numbers…

Read more…
Conversation with Tom Harrison.

zaraguo
Jun 22, 2017

Thanks for your post.


I have a little question. Does “secondary primary key” just equal to the “secondary
key”. I can’t find the definition of the “secondary primary key”.

1 response
Tom Harrison
Jun 23, 2017

There’s no real thing called “secondary primary key” — it’s just something I made up :-)
What I meant is that you would have a column having the same semantics as a
primary key (unique, immutable, not null, indexed, etc.) but having the UUID value
rather than the sequential integer value. It’s secondary in that it is not…

Read more…
1 response
Conversation with Tom Harrison.

Christopher Smith
Jun 11, 2017

There’s a presumption that UUID’s must be random, which isn’t true. Type 1 UUID’s
are particularly useful because they naturally sort by time and also can be grouped by
machine (well, by MAC address). That can be very helpful in a number of situations.

1 response

Tom Harrison
Sep 10, 2017

Yes, you’re right, this wasn’t clear. Randomness is not important to the idea that you
can use something sufficiently likely to be unique that you don’t have to check the
constraint, even across database instances.
So most definitely having a predictable pattern in a value has huge utility. The use of a
MAC as an identifier…

Read more…
Conversation with Tom Harrison.

Renan Le Caro
Jun 9, 2017

“values that are **two** big to remember “ shouldn’t it be “too” big ?

1 response
Tom Harrison
Jun 11, 2017

Corrected typo. Thanks!

Andrey Zharikov
Oct 5, 2017

UUIDs are much better for logs search during debugging / investigations. Say, we
make a todo service. And we have some todo list item object with key “10” and want to
find recent log entries for it. It’s very easy to find gazillion number of records, cause
many log entries will mention 10 for whatever reasons. With UUIDs logs search only
gives…

Read more…
1

Terence Alderson
Nov 2, 2018

We got lazy at my company and just passed 50 keys out to each client reserved and to
be used as primary keys. A lot of people thought we were crazy but we used a big int
and if you do the math on that … it’s pretty safe. If you hav a million users who do a
billion transactions that produce a new key every day that would still last you
something in…

Read more…

Josef Hovad
Feb 9

Very nice post! As Postgres user I guess pgcrypto function gen_random_uuid() looks
secure enough (at least as far as I investigated). I like your approach combining short
int for internal usage and uuid for any external …
Chris Seufert
Nov 17, 2017

The external/internal thing is probably best left to things like friendly-url treatments,
and then (as Medium does) with a hashed value tacked on the end. Thanks Chris!

The slug part of a medium URL is not canonical, its just there to assist with SEO. You
can delete any part of it and medium will figure out which article you want from the
id at the end. I would not be surprised if this is just the last 6 bytes (12 chars) of the
UUID for the article.

Milo van der Linden


Apr 30, 2018

Great article!

Dzintars Klavins
Mar 18

Does this apply to event sourced distributed systems where every domain service owns
its own database?
Alexey Migutsky
Jun 11, 2017

It’s just a few bytes, right?

Actually, it is “just a few bytes”. Here is a good argument from Greg Young on the cost
of data: https://youtu.be/JHGkaShoyNs?t=645
It’s worth to watch the whole video to get some ideas about the tradeoffs and their
relation to business domain and chosen architecture.

Adam Arold
Mar 11

UUIDs are a pain

You might want to add …in my opinion. For example I don’t think they are a pain, I
memorize UUIDs much better than numbers.

Alexey Migutsky
Jun 11, 2017

It’s easy to manage up front

How would you manage this when you need to expose some kind of ID, which will be
unguessable by an atacker?
Thomas Riedel
Mar 11

But never, ever use GUIDs as internal references inside the database system.
Apparently, it is recommended a lot.

Zax
Sep 10, 2017

Thank you for this insightful article, Tom.


I would kindly suggest that you revise/rewrite this piece in such a way that requires
less “jumps”, clarifications and negations in parentheses. It just feels to me as a reader
that you never fully make your point.

Read more…

Jonathan Garbee
Oct 12, 2017

if you ever need to change keys, all your external references are broken
UUIDs shouldn’t require the keys ever be changed, so this is puzzling. That aside, it is
entirely feasible to build a system to store the old IDs for the routing and forward
them to the new.

Chris Seufert
Nov 17, 2017

If our goal is to scale, and I mean really scale let’s first acknowledge that an int is not
big enough in many cases, maxing out at around 2 billion, which needs 4 bytes. We
have way...

What about storing the UUID as a BINARY(16), then it’s only 2x larger than a
BIGINT. Surely storing it as a pretty formatted a hex string is the worst way possible
in terms of storage size, even removing the dashes would be a significant
improvement. (11%)