Permanent identifiers should not carry data. This is like the cardinal sin of data management. You always run into situations where the thing you thought, "surely this never changes, so it's safe to squeeze into the ID to save a lookup". Then people suddenly find out they have a new gender identity, and they need a last final digit in their ID numbers too.
Even if nothing changes, you can run into trouble. Norwegian PNs have your birth date (in DDMMYY format) as the first six digits. Surely that doesn't change, right? Well, wrong, since although the date doesn't change, your knowledge of it might. Immigrants who didn't know their exact date of birth got assigned 1. Jan by default... And then people with actual birthdays on 1 Jan got told, "sorry, you can't have that as birth date, we've run out of numbers in that series!"
Librarians in the analog age can be forgiven for cramming data into their identifiers, to save a lookup. When the lookup is in a physical card catalog, that's somewhat understandable (although you bet they could run into trouble over it too). But when you have a powerful database at your fingertips, use it! Don't make decisions you will regret just to shave off a couple of milliseconds!
I wrote a CRUD app for document storage. It had user id's and document id's. I wrote a method GetDocumentForUser(docID, userID) that checked permissions for that user and document and returned the document if permitted. I then, stupidly, called that method with GetDocumentForUser(userID, docID), and it took me a good half hour to work out why this never returned anything.
It never returned anything because a valid userID will never be a valid docID. If I had used integers it would have returned documents, and I probably wouldn't have spotted it while testing, and I would have shipped a change that cheerfully handed people other people's documents.
I will put up with a fairly considerable amount of performance hit to avoid having this footgun lurking. And yes, I know there are other ways around this (e.g. types) but those come with their own trade-offs too.
PlanetScale wrote up a really good article on why incrementing primary keys are better for performance when compared to randomly inserted primary keys; when it comes to b-tree performance. https://planetscale.com/blog/btrees-and-database-indexes
It is not just about being hard to guess a valid individual identifier in vacuum. Random (or at least random-ish) values, be they UUIDs or undecorated integers, in this context are also about it being hard to guess one from another, or a selection of others.
Wrt: "it isn't x it is y" form: I'm not an LLM, 'onest guv!
I'm sure every dba has a war story that starts with similar decision in the past
Hash index is ideally suited for UUIDs but for some reason Postgres hash indexes cannot be unique.
> the impact to inserts and retrieval of individual items or ranges of values from the index.
Classic OLTP vs OLAP.
Even worse, some tools generate random strings and then ADD hyphens in them to look like UUID (even thought it's not, as the UUID version byte is filled randomly as well), cannot wrap my head why, e.g:
https://github.com/VictoriaMetrics/VictoriaLogs/blob/v1.41.0...
Wrong. Don't use B-Tree for random indexes, there's HASH index exactly for this:
CREATE INDEX [index_name] ON [table_name] USING HASH ([column_name]);And of course we could come up with many ways to generate our own ids and make them unique, but we have the following requirements.
- It needs to be a string (because we allow composing them to 'derive' keys) - A client must be able to create them (not just a server) without risk for collisions - The time order of keys must not be guessable easily (as the id is often leaked via references which could 'betray' not just the existence of a document, but also its relative creation time wrt others). - It should be easy to document how any client can safely generate document ids.
The lookup performance is not really such a big deal for us. Where it is we can do a projection into a more simple format where applicable.
While that is often neat solution, do not do that by simply XORing the numbers with constant. Use a block cipher in ECB mode (If you want the ID to be short then something like NSA's Speck comes handy here as it can be instantiated with 32 or 48 bit block).
And do not even think about using RC4 for that (I've seen that multiple times), because that is completely equivalent to XORing with constant.
I was using 64-bit snowflake pks (timestamp+sequence+random+datacenter+node) previously and made the switch to UUID7 for sortable, user-facing, pks. I'm more than fine letting the DB handle a 128-bit int vs over a 64-bit int if it means not having make sure that the latest version of my snowflake function has made it to every db or that my snowflake server never hiccups, ever.
Most of the data that's going to be keyed with a uuid7 is getting served straight out of Redis anyway.
We have all faced issues where we don’t know where the data will ultimate live that’s optimal for our access patterns.
Or we have devices and services doing async operations that need to sync.
I’m not working on mission critical “if this fails there’s a catastrophic event” type shit. It’s just rent seeking SaaS type shit.
Oh no it cost $0.35 extra to make $100. Next year will make more money relative to cost increase.
> Misconceptions: UUIDs are secure
> One misconception about UUIDs is that they’re secure. However, the RFC describes that they shouldn’t be considered secure “capabilities.”
> From RFC 41221 Section 6 Security Considerations:
> Do not assume that UUIDs are hard to guess; they should not be used as security capabilities
This is just wrong, and the citation doesn't support it. You're not guessing a 122-bit long random identifier. What's crazy is that the article, immediately prior to this, even cites the very math involved in showing exactly how unguessable that is.
… the linked citation (to §4.4, which is different from the in-prose citation) is just about how to generate a v4, and completely unrelated to the claim. The prose citation to §6 is about UUIDs generally: the statement "Do not assume that [all] UUIDs are hard to guess" is not logically inconsistent with properly-generated UUIDv4s being hard to guess. A subset of UUIDs have security properties, if the system generating & using them implements those properties, but we should not assume all UUIDs have that property.
Moreover, replacing an unguessable UUID with an (effectively random) 32-bit integer does make it guessable, and the scheme laid out seems completely insecure if it is to be used in the contexts one finds UUIDv4s being an unguessable identifier.
The additional size argument is pretty weak too; at "millions of rows", a UUID column is consuming an additional ~24 MiB.
I would like to hear from others using, for example, Google Spanner, do you have issues with UUID. I don't for now, most optimizations happen at the Controller level, data transformation can be slow due to validations. Try to keep service logic as straightforward as possible.
The ability to rapidly shard everything can be extremely valuable. The difference between "we can shard on a dime" and "sharding will take a bunch of careful work" can be expensive If the company has poor margins, this can be the difference between "can scale easily" and "we're not getting this investment".
I would argue that if your folks have the technical chops to be able to shard while avoiding surrogate guaranteed unique keys, great. But if they don't.... a UUID on every table can be a massive get-out-of-jail free card and for many companies this is much, much important than some minor space and time optimizations on the DB.
Worth thinking about.
But that also adds complexity.
> Random values don’t have natural sorting like integers or lexicographic (dictionary) sorting like character strings. UUID v4s do have "byte ordering," but this has no useful meaning for how they’re accessed.
Might the author mean that random values are not sequential, so ordering them is inefficient? Of course random values can be ordered - and ordering by what he calls "byte ordering" is exactly how all integer ordering is done. And naive string ordering too, like we would do in the days before Unicode.In our case, we don't want database IDs in an API and in URLs. When IDs are sequential, it enables things like dictionary attacks and provides estimates about how many customers we have.
Encrypting a database ID makes it very obvious when someone is trying to scan, because the UUID won't decrypt. We don't even need a database round trip.
0: https://bsky.app/profile/hugotunius.se/post/3m7wvfokrus2g
1: https://github.com/postgres/postgres/blob/master/src/backend...
If you're using latest version of PG, there is a plugin for it.
That's it.
Are there valid reasons to use UUID (assuming correctly) for primary key? I know systems have incorrectly expose primary key to the public, but assuming that's not the concern. Why use UUID over big-int?
Aren't people using (big)ints are primary keys, and using UUIDs as logical keys for import/export, solving portability across different machines?
(in the scientific reporting world this would be the perennial "in mice")
id => 123, public_id => 202cb962ac59075b964b07152d234b70
There are many ways to generate the public_id. A simple MD5 with a salt works quite well for extremely low effort.
Add a unique constraint on that column (which also indexes it), and you'll be safe and performant for hundreds of millions of rows!
Why do we developers like to overcomplicate things? ;)
Sometimes I have to talk to legacy systems, all my APIs have str IDs, and I encode int IDs as just decimal left padded with leading zeros up to 26 chars. Technically not a compliant ULID but practically speaking, if I see leading `00` I know it's not an actual ULID, since that would be before Nov-2004, and ULID was invented in 2017. The ORM automatically strips the zeros and the query just works.
I'm just kind of over using sequential int IDs for anything bigger than hobby level stuff. Testing/fixturing/QA are just so much easier when you do not have to care about whether an ID happens to already exist.
The ability to know ahead of time what a primary key will be (in lieu of persisting it first, then returning) opened up a whole new world of architecting work in my app. It made a lot of previous awkward things feel natural.
The UUID is 128bits. The first 64bits are a java long. The last 64bits are a java long. Let's just combine the Tenant ID long with a Resource ID long to generate a unique id for this on our platform. (worked until it didn't).
Yes there are different flavors of generating them with their own pros and cons, but at the end of the day it’s just so much more elegant than some auto incrementing crap your database creates. But that is just semantic, you can always change the uuid algorithm for future keys. And honestly if you treat the uuid as some opaque entity (which you should), why not just pick the random one?
And I just thought of the argument that “but what if you want to sort the uuid…” say it’s used for a list of stories or something? Well, again… if you treat the uuid as opaque why would you sort it? You should be sorting on some other field like the date field or title or something. UUIDs are opaque, damn it. You don’t sort opaque data. “Well they get clustered weird” say people. Why are you clustering on a random opaque key? If you need certain data to be clustered, then do it on the right key (user_id field did your data was to be clustered by user, say)
Letting the client generate the primary keys is really liberating. Not having to care about PK collisions or leaking information via auto incrementing numbers is great!
In my opinion uuid isn’t used enough!
The power and main purpose of UUIDs is to act as easy to produce, non-conflicting references in distributed settings. Since the scope of TFA is explicitly set to be "monolithic web apps", nothing stops you from having everything work with bigint PKs internally, and just add the UUIDs where you need to provide external references to rows/objects.
Also is it necessary to show uuid at all to customers of an API? Or could it be a valid pattern to hide all the querying complexity behind named identifiers, even if it could cost a bit in terms of joining and indexing?
The context is the classic B2B SaaS, but feel free to share your experiences even if it comes from other scenarios!
Uuid4 are only 224bits is a bs argument. Such a made up problem.
But a fair point is that one should use a sequential uuid to avoid fragmentation. One that has a time part.
I don't agree with the absolute statement, though. Permanent identifiers should not generally carry data. There are situations where you want to have a way to reconciliate, you have space or speed constraints, so you may accept the trade off, md5 your data and store it in a primary index as a UUID. Your index will fragment and thus you will vacuum, but life will still be good overall.
Definitely an issue, rarely the main one. You can work around integer PKs with composite keys or offset-based sharding schemes. What you can't easily fix is a schema with cross-tenant foreign keys, shared lookup tables, or a tenancy model that wasn't designed for data isolation from day one. Those are architectural decisions that require months of migration work.
UUIDs buy you flexibility, sure. But if your data model assumes everything lives in one database, the PK type is a sub-probem of your problems.
- If you use uuids as foreign keys to another table, it’s obvious when you screw up a join condition by specifying the wrong indices. With int indices you can easily get plausible looking results because your join will still return a bunch of data
- if you’re debugging and need to search logs, having a simple uuid string is nice for searching
Values of the same type can be sorted if a order is defined on the type.
It's also strange to contrast "random values" with "integers". You can generate random integers, and they have a "sorting" (depending on what that means though)
Accessing data in totally random locations can be a performance issue.
Depends on lots of things ofc but this is the concern when people talk about UUID for primary keys being an issue.
It's workload-specific, too. If you want to list ranges of them by PK, then of course random isn't going to work. But then you've got competing tensions: listing a range wants the things you list to be on the same shard, but focusing a workload on one shard undermines horizontal scale. So you've got to decide what you care about (or do something more elaborate).
UUIDv7 is very useful for these scenarios since
A: A hash or modulus of the key will be practically random due to the lower bits being random or pseudo-random (ie distributes well between nodes)
B: the first bits are sortable.. thus the underlying storage on each node won't go bananas.
Integer PKs were seen as fine for years - decades, even - before the rise of UUIDs.
Why?
https://cloud.google.com/blog/products/databases/announcing-...
If you take the gender example, for 99% of people, it is male/female and it won't change, and you can use that for load balancing. But if later, you found out that the gender is not the one you expect for that bucket, no big deal, it will cause a branch misprediction, but instead of happening 50% of the times when you use a random value, it will only happen 1% of the times, significant speedup with no loss in functionality.
To me, what your example really shows is the problem with incorrect default values, not a problem with encoding data into a key per se. If they'd chosen a non-date for unknown values, maybe 00 or 99 for day or month components, then the issue you described would disappear.
But in case, the intention for encoding a timestamp into a UUID isn't for any implied meaning. It's both to guarantee uniqueness with a side effect that IDs are more or less monotonically increasing. Whether this is actually desirable depends on your application, but generally if the application is as a indexed key for insertion into a database, it's usually more useful for performance than a fully random ID as it avoids rewriting lots of leaf-nodes of B-trees. If you insert a load of these such keys, it forms a cluster on one side of the tree that can the rebalance with only the top levels needing to be rewritten.
Specifically, if your database is small, the performance impact is probably not very noticeable. And if your database is large (eg. to the extent primary keys can't fit within 32-bit int), then you're actually going to have to think about sharding and making the system more distributed... and that's where UUID works better than auto-incrementing ints.
Random vs time biased uuids are not a decision to shave off ms that you will regret.
Most likely they will be a decision that shaves off seconds (yes, really - especially when you consider locality effects) and you'll regret nothing.
The solution is not to come up with yet another artificial identifier but to come up with better means of identification taking into account the fact that things change.
This is just another case of keys containing information and is not smart.
The obvious solution is to have a field that drives distribution, allowing rebalancing or whatever.
Now this doesn't work if you actually have enough data that the randomness of the UUIDv4 keys is a practical database performance issue, but I think you really have to think long and hard about every single use of identifiers in your application before concluding that v7 is the solution. Maybe v7 works well for some things (e.g identifiers for resources where creation times are visible to all with access to the resource) but not others (such as users or orgs which are publicly visible but without publicly visible creation times).
(What do you think Youtube video IDs are?)
* How do you manage the key for encrypting IDs? Injected to app environment via envvar? Just embedded in source code? I ask this because I'm curious as to how much "care" I should be putting in into managing the secret material if I were to adopt this scheme.
* Is the ID encrypted using AEAD scheme (e.g. AES-GCM)? Or does the plain AES suffice? I assume that the size of IDs would never exceed the block size of AES, but again, I'm not a cryptographer so not sure if it's safe to do so.
"Recommendation: Stick with sequences, integers, and big integers"
After that then, yes, UUIDv7 over UUIDv4.
This article is a little older. PostgreSQL didn't have native support so, yeah, you needed an extension. Today, PostgreSQL 18 is released with UUIDv7 support... so the extension isn't necessary, though the extension does make the claim:
"[!NOTE] As of Postgres 18, there is a built in uuidv7() function, however it does not include all of the functionality below."
What those features are and if this extension adds more cruft in PostgreSQL 18 than value, I can't tell. But I expect that the vast majority of users just won't need it any more.
original answer: because if you dont come up with these ints randomly they are sequential which can cause many unwanted situations where people can guess valid IDs and deduce things from that data. See https://en.wikipedia.org/wiki/German_tank_problem
UUIDv7 looks really promising but I'm not likely to redo all of our tables to use it.
Main reason I use it is the German Tank problem: https://en.wikipedia.org/wiki/German_tank_problem
(tl;dr; prevent someone from counting how many records you have in that table)
It would be the equivalent of "if you're a middle-aged man" or "you're an American".
P.S. I think some of the considerations may be true for any system that uses B-Tree indexes, but several will be Postgres specific.
> Norwegian PNs have your birth date (in DDMMYY format) as the first six digits
So presumably the format is DDMMYYXXXXX (for some arbitrary number of X's), where the XXX represents e.g. an automatically incrementing number of some kind?
Which means that if it's DDMMYYXXX then you can only have 1000 people born on DDMMYY, and if it's DDMMYYXXXXX then you can have 100,000 people born on DDMMYY.
So in order for there to be so many such entries in common that people are denied use of their actual birthday, then one of the following must be true:
1. The XXX counter must be extremely small, in order for it to run out as a result of people 'using up' those Jan 1 dates each year
2. The number of people born on Jan 1 or immigrating to Norway without knowledge of their birthday must be colossal
If it was just DDMMXXXXX (no year) then I can see how this system would fall apart rapidly, but when you're dealing with specifically "people born on Jan 1 2014 or who immigrated to Norway and didn't know their birthday and were born on/around 2014 so that was the year chosen" I'm not sure how that becomes a sufficiently large number to cause these issues. Perhaps this only occurs in specific years where huge numbers of poorly-documented refugees are accepted?
(Happy to be educated, as I must be missing something here)
As long as you're not in China or India around specific years ...
GP's point stands strong.
* I think there is a wide range in the middle where your database can fit on one machine if you do it well, but it's worth optimizing to use a cheaper machine and/or extend the time until you need to switch to a distributed db. You might hit this middle range soon enough (and/or it might be a painful enough transition) that it's worth thinking about it ahead of time.
* If/when you do switch to a distributed database, you don't always need to rekey everything:
** You can spread existing keys across shards via hashing on lookup or reversing bits. Some databases (e.g. DynamoDB) actually force this.
** Allocating new ids in the old way could be a big problem, but there are ways out. You might be able to switch allocation schemes entirely without clients noticing if your external keys are sufficiently opaque. If you went with UUIDv7 (which addresses some but not all of the article's points), you can just keep using it. If you want to keep using dense(-ish), (mostly-)sequential bigints, you can amortize the latency by reserving blocks at a time.
well, till you run out of numbers for the immigrants that don't have exact birth date
> Norwegian PNs have your birth date (in DDMMYY format) as the first six digits.
You can already feel the disaster rising because sone program expects always the latter.
And it doesn’t fix the problem, it just makes it less likely.
I totally used uuidv7s as "inserted at" in a small project and I had methods to find records created between two timestamps that literally converted timestamps to uuidv7 values so I could do "WHERE id BETWEEN a AND b"
So, random library: https://pkg.go.dev/github.com/google/uuid#UUID.Time
> Time returns the time in 100s of nanoseconds since 15 Oct 1582 encoded in uuid. The time is only defined for version 1, 2, 6 and 7 UUIDs.
Hyrum's Law suggests that someone will.
You need it. Because it's maybe one lone unchangeable thing. Taking person for example: * date of birth can be changed, if there was error and correction in documents * any and near all of existing physical characteristics can change over time, either due to brain things (deciding to change gender), aging, or accidents (fingerprints no longer apply if you burnt your skin enough) * DNA might be good enough, but that's one fucking long identifier to share and one hard to validate in field.
So an unique ID attached to few other parts to identify current iteration of individual is the best we have, and the best we will get.
As a consumer of these databases we're stuck with them as designed, which means we have to worry about key distribution.
Of course this is not always the case that is bad, for example if you have a lot of relations you can have only one table where you have the UUID field (and thus expensive index), and then the relations could use the more efficient int key for relations (for example you have an user entity with both int and uuid keys, and user attribute references the user with the int key, of course at the expense of a join if you need to retrieve one user attribute when retrieving the user is not needed).
One more reason to stay away from microservices, if possible.
If you use UUIDv7, you can partition your table by the key prefix. Then the bulk of your data can be efficiently skipped when applying updates.
Just the other day I delivered significant performance gains to a client by converting ~150 million UUIDv4 PKs to good old BIGINT. They were using a fairly recent version of MariaDB.
I think IDs should not carry information. Yes, that also means I think UUIDv7 was wrong to squeeze a creation date into their ID.
Isn't that clear enough?
Maybe the answer is to evenly spread the defaults over 365 days.
YY.MM.DD-AAA.BB
In either the AAA or BB component there is something about the gender.
But it does mean that there is a limit of people born per day of a certain gender.
But for a given year, using a moniker will only delay the inevitable. Sure, there are more numbers, but still limited as there are SOME parts that need to reflect reality. Year, gender (if that's still the case?) etc.
You still have that problem from organic birthdays and also the problem of needing to change ids to correct birth dates.
You can choose to never make use of that property. But it's tempting.
The identifier is still connected to the user's data, just through the appropriate other fields in the table as opposed to embedded into the identifier itself.
> So, what happens next is that the real world tries to adjust and the "data-less" identifier becomes a real world artifact. The situation becomes the same but worse (eg. you don't exist if you don't remember your social security id). In extreme cases people are tattooed with their numbers.
Using a random UUID as primary key does not mean users have to memorize that UUID. In fact in most cases I don't think there's much reason for it to even be exposed to the user at all.
You can still look up their data from their current email or phone number, for instance. Indexes are not limited to the primary key.
> The solution is not to come up with yet another artificial identifier but to come up with better means of identification taking into account the fact that things change.
A fully random primary key takes into account that things change - since it's not embedding any real-world information. That said I also don't think there's much issue with embedding creation time in the UUID for performance reasons, as the article is suggesting.
What? The first 48 bits of an UUID7 are a UNIX timestamp.
Whether or not this is a meaningful problem or a benefit to any particular use of UUIDs requires thinking about it; in some cases it’s not to be taken lightly and in others it doesn’t matter at all.
I see what you’re getting at, that ignoring the timestamp aspect makes them “just better UUIDs,” but this ignores security implications and the temptation to partition by high bits (timestamp).
The surrogate key's purpose isn't to directly store the natural key's information, rather, it's to provide an index to it.
> The solution is not to come up with yet another artificial identifier but to come up with better means of identification taking into account the fact that things change.
There isn't 'another' - there's just one. The surrogate key. The other pieces of information you're describing are not the means of indexing the data. They are the pieces of data you wish to retrieve.
I think artificial and data-less identifiers are the better means of identification that takes into account that things change. They don't have to be the identifier you present to the world, but having them is very useful.
E.g. phone numbers are semi-common identifiers now, but phone numbers change owners for reasons outside of your control. If you use them as an internal identifier, changing them between accounts gets very messy because now you don't have an identifier for the person who used to have that phone number.
It's much cleaner and easier to adapt if each person gets an internal context-less identifier and you use their phone number to convert from their external ID/phone number to an internal ID. The old account still has an identifier, there's just no external identifier that translates to it. Likewise if you have to change your identifier scheme, you can have multiple external IDs that translate to the same internal ID (i.e. you can resolve both their old ID and their new ID to the same internal ID without insanity in the schema).
[0] https://www.cybertec-postgresql.com/en/unexpected-downsides-...
Just to complement this with a point, but there isn't any mainstream database management system out there that is distributed on the sense that it requires UUIDs to generate its internal keys.
There exist some you can find on the internet, and some institutions have internal systems that behave this way. But as a near universal rule, the thing people know as a "database" isn't distributed on this sense, and if the column creation is done inside the database, you don't need them.
I have slightly different goals for my version. I want everything to fit in 128 bits so I'm sacrificing some of the random bits, I'm also making sure the representation inside Postgres is also exactly 128 bits. My initial version ended up using CBOR encoding and being 160 bits.
Mine dedicates 16 bits for the prefix allowing up to 3 characters (a-z alphabet).
We're not worried about key compromises.
If the key is lost, we have much bigger problems.
The same way we manage all other secrets in the application. (Summarized below)
> Is the ID encrypted using AEAD scheme (e.g. AES-GCM)? Or does the plain AES suffice? I assume that the size of IDs would never exceed the block size of AES, but again, I'm not a cryptographer so not sure if it's safe to do so.
I don't have the source handy at the moment. It's one of the easier to use symmetric algorithms available in .Net. We aren't talking military-grade security here. In general: a 32-bit int encrypts to 64-bits, so we pad it with a few unicode characters so it's 64-bits encrypted to 128 bits.
---
As far as managing secrets in the application: We have a homegrown configuration file generator that's adapted to our needs. It generates both the configuration files, and strongly-typed classes to read the files. All configuration values are loaded at startup, so we don't have to worry about runtime errors from missing configuration values.
Secrets (connection strings, encryption keys, ect,) are encrypted in the configuration file as base64 strings. The certificate to read/write secrets are stored in Azure Keyvault.
The startup logic in all applications is something like:
1: Determine the environment (production, qa, dev)
2: Get the appropriate certificate
3: Read the configuration files, including decrypting secrets (such as the primary key encryption keys) from the configuration files
4: Populate the strongly-typed objects that hold the configuration values
5: These objects are dependency-injected to runtime objects
Balanced and uniformly scattered. A random index means fetching a random page for every item. Fine if your access patterns are truly random, but that's rarely the case.
> Why are you clustering on a random opaque key?
InnoDB clusters by the PK if there is one, and that can't be changed (if you don't have a PK, you have some options, but let's assume you have one). MSSQL behaves similarly, but you can override it. If your PK is random, your clustering will be too. In Postgres, you'll just get fragmented indexes, which isn't quite as bad, but still slows down vacuum. Whether that actually becomes a problem is also going to depend on access patterns.
One shouldn't immediately freak out over having a random PK, but should definitely at least be aware of the potential degradation they might cause.
- A client used to run our app on-premises and now wants to migrate to the cloud.
- Support engineers want to clone a client’s account into the dev environment to debug issues without corrupting client data.
- A client wants to migrate their account to a different region (from US to EU).
Merging data using UUIDs is very easy because ID collisions are practically impossible. With integer IDs, we'd need complex and error-prone ID-rewriting scripts. UUIDs are extremely useful even when the tables are small, contrary to what the article suggests.
I've read people suggest using a UUIDv7 as the primary key and a UUIDv4 as a user-visible one as a remedy.
My first thought when reading the suggestion was, "well but you'll still need an index on the v4 IDs, so what does this actually get you?" But the answer is that it makes joins less expensive; you only require the index once, when constructing the query from the user-supplied data, and everything else operates with the better-for-performance v7 IDs.
To be clear, in a practical sense, this is a bit of a micro-optimization; as far as I understand it, this really only helps you by improving the data locality of temporally-related items. So, for example, if you had an "order items" table, containing rows of a bunch of items in an order, it would speed up retrieval times because you wouldn't need to do as many index traversals to access all of the items in a particular order. But on, say, a users table (where you're unlikely to be querying for two different users who happen to have been created at approximately the same time), it's not going to help you much. Of course the exact same critique is applicable to integer IDs in those situations.
Although, come to think of it, another advantage of a user-visible v4 with v7 Pk is that you could use a different index type on the v4 ID. Specifically, I would think that a hash index for the user-visible v4 might be a halfway-decent way to go.
I'm still not sure either way if I like the idea, but it's certainly not the craziest thing I've ever heard.
But the author does not say timestamp ordering, he says ordering. I think he actually means and believes that there is some problem ordering UUIDv4.
Yes of course everyone should check and unit test that every object is owned by the user or account loading it, but demanding more sophistication from an attacker than taking "/my_things/23" and loading "/my_things/24" is a big win.
See perhaps "UUIDv47 — UUIDv7-in / UUIDv4-out (SipHash‑masked timestamp)":
* https://github.com/stateless-me/uuidv47
* Sept 2025: https://news.ycombinator.com/item?id=45275973
Auto-incrementing keys can work, but what happens when you run out of integers? Also, distributed dbs probably make this hard, and they can't generate a key on client.
There must be something in Postgres that wants to store the records in PK order, which while could be an okay default, I'm pretty sure you can this behavior, as this isn't great for write-heavy workloads.
I actually haven no idea. What are they?
(Also what is the format of their `si=...` thing?)
I shared this article a few weeks ago, discussing the problems with this kind of approach: https://notnotp.com/notes/do-not-encrypt-ids/
I believe it can make sense in some situations, but do you really want to implement such crypto-related complexity?
So far, people have talked a lot about UUIDs, so I'm genuinely curious about what's in-between.
I don't think I should ignore what I already know and intentionally pessimize the first draft in the name of avoiding premature optimization.
This is the key point, I think. Searching is not the same as identifying.
If the domain is modeling something like external events (in my case), and that external timestamp is packed into your primary key, and you support receiving events out of chronological order, then it just follows that you might insert stuff ealrier than you latest record.
You're gonna have problems "backdating" if you mix up time of insertion with when the event you model actually ocurred. Like id you treat those as the same thing when they aren't.
This way you avoid most of the issues highlighted in this article, without compromising your confidential data.
Consider say weather hardware. 5 stations all feeding into a central database. They're all creating rows and uploading them. Using sequential integers for that is unnecessarily complex (if even possible.)
Given the amount of data created on phones and tablets, this affects more situations than first assumed.
It's also very helpful in export / edit / update situations. If I export a subset of the data (let's say to Excel), the user can edit all the other columns and I can safely import the result. With integer they might change the ID field (which would be bad). With uuid they can change it, but I can ignore that row (or the whole file) because what they changed it to will be invalid.
If you separate them (i.e. microservices) the they no longer try to use one db.
if they can live with MariaDB, OK, but I wouldn't choose that in the first place these days. Likely Postgres will also perform better in most scenarios.
A running number also carries data. Before you know it, someone's relying on the ordering or counting on there not being gaps - or counting the gaps to figure out something they shouldn't.
So, the fact that there is a date in the uuidv7 does not extend any meaning or significance to the record outside of the database. To infer such a relationship where none exists is the error.
If you only know the birth year and keyed 99 as the month for unknown, then your algorithm would determine they were of a correct age on the start of the year after that was true, which I guess would be what you want for legal compliance.
If you don't even know if the birth year is correct, then the correct process depends on policy. Maybe they choose any year, maybe they choose the oldest/youngest year they might be, maybe they just encode that as 0000/9999.
Again, if you don't know the birth year of someone, you would have no way of knowing their age. I'm not sure that means that the general policy of putting a birthday into their ID number is flawed.
Many governments re-issue national IDs to the same person with different numbers, which is far less problematic that the many governments who choose to issue the same national ID (looking at you USA with your SSN) to multiple individuals. It doesn't seem like a massive imposition on a person who was originally issued an ID based on not knowing when their birthday to be re-issued a new ID when their birthday was ascertained. Perhaps even give them a choice of keeping the old one knowing it will cause problems, or take the new one instead and having the responsibility to tell people their number had changed.
Presumably the governments that choose to embed the date into a national ID number do so because it's more useful for their purposes to do so than just assigning everyone a random number.
All you need is a guaranteed non-decreasing 48-bit number. A clock is one way to generate it, but I don't see why a UUIDv7 would become invalid if your clock is biased, runs too fast, too slow, or whatever. I would not count on the first 48 bits being a "real" timestamp.
If you know someones birth date and gender, the INSZ is almost certainly 1 in 500 numbers, with a heavy skew to the lower AAA. Luckily, you can't do much damage with someones number,unlike an USA SSN (but I'd still treat it confidential).
I guess you can assign 3-4 bits for identifier version number as well.
And yes - for long living data dealing with compatibility issues is inevitable so you have to take that into account from the very beginning.
So what is such an identifier for? Is it only for some technical purposes (like replication etc.)?
Why bother with UUID at all then for internal identifiers? Sequence number should be enough.
If the only reason you need a surrogate key is to introduce indirection in your internal database design then sequence numbers are enough. There is no need to use UUIDs.
The whole discussion is about externally visible identifiers (ie. identifiers visible to external software, potentially used as a persistent long-term reference to your data).
> E.g. phone numbers are semi-common identifiers now, but phone numbers change owners for reasons outside of your control. If you use them as an internal identifier, changing them between accounts gets very messy because now you don't have an identifier for the person who used to have that phone number.
Introducing surrogate keys (regardless of whether UUIDs or anything else) does not solve any problem in reality. When I come to you and say "My name is X, this is my phone number, this is my e-mail, I want my GDPR records deleted", you still need to be able to find all data that is related to me. Surrogate keys don't help here at all. You either have to be able to solve this issue in the database or you need to have an oracle (ie. a person) that must decide ad-hoc what piece of data is identified by the information I provided.
The key issue here is that you try to model identifiable "entities" in your data model, while it is much better to model "captured information".
So in your example there is no "person" identified by "phone number" but rather "at timestamp X we captured information about a person at the time named Y and using phone number Z". Once you start thinking about your database as structured storage of facts that you can use to infer conclusions, there is much less need for surrogate keys.
That's my whole point: either X becomes a "real world artifact" or it is useless as identifier.
In my personal design sense, I have found keeping away generality actually helps my code last longer (based on more concrete ideas) and easier to change when those days come.
Over the last decade, when working on databases with UUID Version 41 as the primary key data type, these databases have usually had bad performance and excessive IO.
UUID is a native data type in Postgres stored as binary data. Various UUID versions are in the RFC. Version 4 has mostly random bits, obfuscating information like when the value was created or where it was generated.
Version 4 UUIDs are easy to generate in Postgres using the gen_random_uuid()2 function since version 13 (released in 2020).
I’ve learned there are misconceptions about UUID Version 4, and sometimes these are the reasons users pick this data type.
Because of the poor performance, misconceptions, and available alternatives, I’ve come around to a simple position: Avoid UUID Version 4 for primary keys.
My more controversial take is to avoid UUIDs in general, but I understand there are some legitimate reasons for them without practical alternatives.
As a database enthusiast, I wanted to have an articulated position on this classic “Integer v. UUID” debate.
Among databases folks, debating this may be tired and clichéd. However, from my consulting work, I can say I work with databases using UUID v4 in 2024 and 2025, and still see the issues discussed in this post.
Let’s dig in.
uuid data type in PostgresAlthough unreleased as of this writing, and pulled from Postgres 17 previously, UUID V7 is part of Postgres 184 scheduled for release in the Fall of 2025.
What kind of app databases are in scope for this post?
The kinds of web applications I’m thinking of with this post are monolithic web apps, with Postgres as their primary OLTP database. The apps could be in categories like social media, e-commerce, click tracking, or business process automation apps.
The types of performance issues discussed here are related to inefficient storage and retrieval, meaning they happen for all of these types of apps.
What’s the core issue with UUID v4?
The core issue with UUID Version 4, given that the 122 bits they’re made up of are “random or pseudo-randomly generated values”1, is how the values are maintained in indexes. Since primary keys are backed by indexes by default, each insert is less efficient compared with inserts for sequentially ordered values.
For lookups, each update and delete for individual items or for ranges of items are less efficient, due to increased traversal of non-sequential index pages in Postgres.
Since the randomly generated values aren’t inserted sequentially (or in sequential/adjacent pages), it’s less efficient to find them later for updates or deletes. Each of these workload types use the primary key index.
UUID v4s don’t have a useful natural ordering that aligns with how they’re stored, and thus both storage and retrieval is less efficient.
Later in the post we’ll look at just how many more Postgres pages need to be accessed for equivalent data, and what that means in terms of performance.
Despite the inefficiencies, UUID v4s and UUIDs in general remain (or at least were) popular in the last decade based on my experience consulting in Postgres.
Given the popularity, what use cases for UUID are there?
One use case for UUIDs is when there’s a need to generate an identifier on a client or from multiple services, then passed to Postgres for persistence.
For web apps, generally they instantiate objects in memory and don’t expect an identifier to be used for lookups until after an instance is persisted as a row (where the database generates the identifier).
In a microservices architecture where the apps have their own databases, the ability to generate identifiers from each database without collisions is a use case for UUIDs. The UUID could also identify the database a value came from later, vs. an integer.
For collision avoidance (see HN discussion5), we can’t practically make the same guarantee with sequence-backed integers. There are hacks, like generating even and odd integers between two instances, or using different ranges in the int8 range.
There are also alternative identifiers like using composite primary keys (CPKs), however the same set of 2 values wouldn’t uniquely identify a particular table.
The avoidance of collisions is described this way on Wikipedia:6
The number of random version-4 UUIDs which need to be generated in order to have a 50% probability of one collision: 2.71 quintillion
This number would be equivalent to:
Generating 1 billion UUIDs per second for about 86 years.
Are UUIDs secure?
One misconception about UUIDs is that they’re secure. However, the RFC describes that they shouldn’t be considered secure “capabilities.”
From RFC 41221 Section 6 Security Considerations:
Do not assume that UUIDs are hard to guess; they should not be used as security capabilities
How can we create obfuscated codes from integers?
While UUID V4s obfuscate their creation time, the values can’t be ordered to see when they were created relative to each other. We can achieve those properties with integers with a little more work.
One option is to generate a pseudo-random code from an integer, then use that value externally, while still using integers internally.
To see the full details of this solution, please check out: Short alphanumeric pseudo random identifiers in Postgres7
We’ll summarize it here.
The obfuscated id is stored in a generated column. By reviewing the generated values, they are similar, but aren’t ordered by their creation order.
The values in insertion order were 01Y9I, 01Y9L, then 01Y9K.
With alphabetical order, the last two would be flipped: 01Y9I first, then 01Y9K second, then 01Y9L third, sorting on the fifth character.
If I wanted to use this approach for all tables, I’d try a centralized table that was polymorphic, storing a record for each table that’s using a code (and a foreign key constraint).
That way I’d know where the code was used.
Why else might we want to skip UUIDs?
UUIDs are 16 bytes (128 bits) per value, which is double the space of bigint (8 bytes), or quadruple the space of 4-byte integers. This extra space adds up once many tables have millions of rows, and copies of a database are being moved around as backups and restores.
A more considerable impact to performance though is the poor characteristics of writing and reading random data into indexes.
For random UUID v4s, Postgres incurs more latency for every insert operation.
For integer primary key rows, their values are maintained in index pages with “append-mostly” operations on “leaf nodes,” since their values are orderable, and since B-Tree indexes store entries in sorted order.
For UUID v4s, primary key values in B-Tree indexes are problematic.
Inserts are not appended to the right most leaf page. They are placed into a random page, and that could be mid-page or an already-full page, causing a page split that would have been unnecessary with an integer.
Planet Scale has a nice visualization of index page splits and rebalancing.8
Unnecessary splits and rebalancing add space consumption and processing latency to write operations. This extra IO shows up in Write Ahead Log (WAL) generation as well.
Buildkite reported a 50% reduction in write IO for the WAL by moving to time-ordered UUIDs.
Given fixed size pages, we want high density within the pages. Later on we’ll use pageinspect to check the average leaf density between integer and UUID to help compare the two.
B-Tree page layout means you can fit fewer UUIDs per 8KB page. Since we have the limitation of fixed page sizes, we at least want them to be as densely packed as possible.
Since UUID indexes are ~40% larger in leaf pages than bigint (int8) for the same logical number of rows, they can’t be as densely packed with values. As Lukas says, “All in all, the physical data structure matters as much as your server configuration to achieve the best I/O performance in Postgres.”9
This means that for individual lookups, range scans, or UPDATES, we will incur ~40% more I/O on UUID indexes, as more pages are scanned. Remember that even to access one row, in Postgres the whole page is accessed where the row is, and copied into a shared memory buffer.
Let’s insert and query data and take a look at numbers between these data types.
Let’s create integer, UUID v4, and UUID v7 fields, index them, load them into the buffer cache with pg_prewarm.
I will use the schema examples from the Cybertec post Unexpected downsides of UUID keys in PostgreSQL by Ants Aasma.
View andyatkinson/pg_scripts PR #20.
On my Mac, I compiled the pg_uuidv7 extension. Once compiled and enabled for Postgres, I could use the extension functions to generate UUID V7 values.
Another extension pg_prewarm is used. It’s a module included with Postgres, so it just needs to be enabled per database where it’s used.
The difference in latency and the enormous difference in buffers from the post was reproducible in my testing.
“Holy behemoth buffer count batman” - Ants Aasma
Cybertec post results:
bigint columnSince these are buffer hits we’re accessing them from memory, which is faster than disk. We can focus then on only the difference in latency based on the data types.
How many more pages are accessed for the UUID index? 8,535,628 (8.5 million!) more 8KB pages were accessed, a 31229.4% increase. In terms of MB and MB/s that is:
Calculating a low and high estimate of access speeds for memory:
Accessing 68.3 GB of data from memory (shared_buffers in PostgreSQL) would add:
That’s between ~1 and ~3.4 seconds of additional latency solely based on the data type. Here we used 10 million rows and performed 1 million updates, but the latencies will get worse as data and query volumes increase.
We can inspect the average fill percentage (density) of leaf pages using the pageinspect extension.
The uuid_experiments/page_density.sql (andyatkinson/pg_scripts PR #20) query in the repo gets the indexes for the integer and v4 and v7 uuid columns, their total page counts, their page stats, and the number of leaf pages.
Using the leaf pages, the query calculates an average fill percentage.
After performing the 1 million updates on the 10 million rows mentioned in the example, I got these results from that query:
idxname | avg_leaf_fill_percent
---------------------+-----------------------
records_id_idx | 97.64
records_uuid_v4_idx | 79.06
records_uuid_v7_idx | 90.09
(3 rows)
This shows the integer index had an average fill percentage of nearly 98%, while the UUID v4 index was around 79%.
The Postgres buffer cache is a critical part of good performance.
For good performance, we want our queries to produce cache “hits” as much as possible.
The buffer cache has limited space. Usually 25-40% of system memory is allocated to it, and the total database size including table and index data is usually much larger than that amount of memory. That means we’ll have trade-offs, as all data will not fit into system memory. This is where the challenges come in!
When pages are accessed they’re copied into the buffer cache as buffers. When write operations happen, buffers are dirtied before being flushed.10
Since the UUIDs are randomly located, additional buffers will need to be copied to the cache compared to ordered integers. Buffers might be evicted to make space that are needed, decreasing hit rates.
Since the tables and indexes are more likely to be fragmented, it makes sense to rebuild the tables and indexes periodically.
Rebuilding tables can be done using pg_repack, pg_squeeze, or VACUUM FULL if you can afford to perform the operation offline.
Indexes can be rebuilt online using REINDEX CONCURRENTLY.
While the newly laid out data in pages, they will still not have correlation, and thus not be smaller. The space formerly occupied by deletes will be reclaimed for reuse though.
If possible, size your primary instance to have 4x the amount of memory of your size of database. In order words if your database is 25GB, try and run a 128GB memory instance.
This gives around 32GB to 50GB of memory for buffer cache (shared_buffers) which is hopefully enough to store all accessed pages and index entries.
Use pg_buffercache11 to inspect the contents, and pg_prewarm12 to populate tables into it.
One tactic I’ve used when working with UUID v4 random values where sorting is happening, is to provide more memory to sort operations.
To do that in Postgres, we can change the work_mem setting. This setting can be changed for the whole database, a session, or even for individual queries.
Check out Configuring work_mem in Postgres on PgMustard for an example of setting this in a session.
Since Rails 6, we can control implicit_order_column.13 The database_consistency gem even has a checker for folks using UUID primary keys.
When ORDER BY is generated in queries implicitly, it may be worth ordering on a different high cardinality field that’s indexed, like a created_at timestamp field.
Cluster on a column that’s high cardinality and indexed could be a mitigation option.
For example, imagine your UUID primary table has a created_at timestamp column that’s indexed with idx_on_tbl_created_at, and clustering on that.
CLUSTER table_with_uuid_ok USING idx_on_tbl_created_at;
I don’t see CLUSTER used ever really though as it takes an access exclusive lock. The CLUSTER is a one-time operation that would also need to be repeated regularly to maintain its benefits.
For new databases that may be small, with unknown growth, I recommend plain old integers and an identity column (backed by a sequence)14 for primary keys. These are signed 32 bit (4-byte) values. This provides about 2 billion positive unique values per table.
For many business apps, they will never reach 2 billion unique values per table, so this will be adequate for their entire life. I’ve also recommended always using bigint/int8 in other contexts.
I guess it comes down to what you know about your data size, how you can project growth. There are plenty of low growth business apps out there, in constrained industries, and constrained sets of business users.
For Internet-facing consumer apps with expected high growth, like social media, click tracking, sensor data, telemetry collection types of apps, or when migrating an existing medium or large database with 100s of millions or billions of rows, then it makes sense to start with bigint (int8), 64-bit, 8-byte integer primary keys.
Since Postgres 18 is not yet released, generating UUID V7s now in Postgres is possible using the pg_uuidv7 extension.
If you have an existing UUID v4 filled database and can’t afford a costly migration to another primary key data type, then starting to populate new values using UUID v7 will help somewhat.
Fortunately the binary uuid data type in Postgres can be used whether you’re storing V4 or V7 UUID values.
Another alternative that relies on an extension is sequential_uuids.15
gen_random_uuid() for primary key types, which generates random UUID v4 valuesbigintDo you see any errors or have any suggested improvements? Please contact me. Thanks for reading!
https://datatracker.ietf.org/doc/html/rfc4122#section-4.4 ↩ ↩2 ↩3
https://www.postgresql.org/docs/current/functions-uuid.html ↩
https://en.wikipedia.org/wiki/Universally_unique_identifier ↩
https://andyatkinson.com/generating-short-alphanumeric-public-id-postgres ↩
https://planetscale.com/blog/the-problem-with-using-a-uuid-primary-key-in-mysql ↩
https://stringintech.github.io/blog/p/postgresql-buffer-cache-a-practical-guide/ ↩
https://www.postgresql.org/docs/current/pgbuffercache.html ↩
https://www.bigbinary.com/blog/rails-6-adds-implicit_order_column ↩
https://www.cybertec-postgresql.com/en/uuid-serial-or-identity-columns-for-postgresql-auto-generated-primary-keys/ ↩
Edit: just saw your edit, sounds like we're on the same page!
For Aurora MySQL, it just makes it worse either way, since there’s no change buffer.
This doesn’t even rely on your system’s built-in RNG being low quality. It could be audited and known to avoid such issues but you could have a compromised compiler or OS that injects a doctored RNG.
But if you need UUID-based lookup, then you might as well have it as a primary key, as that will save you an extra index on the actual primary key. If you also need a date and the remaining bits in UUIDv7 suffice for randomness, then that is a good option too (though this does essentially amount to having a composite column made up of datetime and randomness).
UUIDv47 might have a space if you need keys generated on multiple backend servers without synchronization. But it feels very niche to me.
That being said, while fine for obfuscation, it should not be used for security for this purpose, e.g. hidden/unlisted links, confirmation links and so on. Those should use actual, long-ish random keys for access, because the inability to enumerate them is a security feature.
Edit: What the article said: > The kinds of web applications I’m thinking of with this post are monolithic web apps, with Postgres as their primary OLTP database.
So you are correct that this does not disqualify distributed databases.
Another example would be a function that sorts the numbers 0 through 999 in a seemingly random order (but's actually deterministic), and then repeat that for each block of 1000 with a slight shift. Discourages casual numeric iteration but isn't as complex or cryptographically secure as UUID.
Had the requirements been different, UUIDv7 would have worked well, too, because fragmentation is the biggest problem here.
For one example, say you were making voting-booth software. You really don't want a (hidden) timestamp attached to each vote (much less an incrementing id) because that would break voter confidentiality.
More generally, it's more a underlying principle of data management. Not leaking ancillary data is easier to justify than "sure we leak the date and time of the record creation, but we can't think of a reason why that matters."
Personally I think the biggest issue are "clever" programmers who treat the uuid as data and start displaying the date and time. This leads to complications ("that which is displayed, the customer wants to change"). It's only a matter of time before someone declares the date "wrong" and it must be "fixed". Not to mention time zone or daylight savings conversions.
But if you are doing that, why not just use an incrementing integer instead of a uuidv7?
Also, if most of your endpoints require auth, this is not typically a problem.
It really depends on your application. But yes, that's something to be aware of. If you need some ids to be unguessable, make sure they are not predictable :-)
Sometimes it might even be for a good reason.
For example, if https://github.com/pytorch/pytorch/issues/111111 can be seen but https://github.com/pytorch/pytorch/issues/111110 can't, someone might infer the existence of a hidden issue relating to a critical security problem.
Whereas if the URL was instead https://github.com/pytorch/pytorch/issues/761500e0-0070-4c0d... that risk would be avoided.
Let's clarify things.
The author argues against UUIDv4 as primary keys when compared to integers or bigints in large databases for performance reasons.
The examples you give refer to the common mistake of using a non-unique attribute that can be changed for a given entity as a primary key.
I'm not a normalization fanatic, but we're only talking about 1NF here.
Or have the opportunity to scam people into thinking you’re a different person. (E.g. take a $1M loan, go bankrupt, remember your birthday, and take a loan again.)
Estonian isikukood is GYYMMDDNNNC, and is relatively public. You can find mine pretty easily if you know where to look (no spoilers!). It’s relatively harmless.
Kazakh IIN is YYMMDDNNNNNN (where N might have some structure) and is similarly relatively public: e.g. if you’re a sole proprietor, chances are you have to hang your license on the wall, which will have it.
It’s a bit more serious: I’ve got my mail at the post office by just showing a barcode of my IIN to the worker. They usually scan it from an ID, which I don’t have, but I’ve figured out the format and created a .pkpass of my own. Zero questions – here’s your package, no we don’t need your passport either, have a nice day!
(Tangential, but Kazakhs also happen to have the most peculiar post office layout: it looks exactly like a supermarket, where you go in, find your packages (sorted by the tracking number, IIRC), and go to checkout. I’ve never seen it anywhere else.)
This is so needlessly complex that you contradicted yourself immediately. You claim there is no “person” identified but immediately say you have information “about a person”. The fact that you can assert that the information is about a person means that you have identified a person.
Clearly tying data to the person makes things so much easier. I feel like attempting to do what you propose is begging to mess up GDPR erasure.
> “So I got a request from a John Doe to erase all data we recorded for them. They identified themselves by mailing address and current phone number. So we deleted all data we recorded for that phone number.”
> “Did you delete data recorded for their previous phone number?”
> “Uh, what?”
The stubborn refusal to create a persistent identifier makes your job harder, not easier.
Do not expose your internal IDs. As simple as that.
I bet people will extract that date and use it, and it's hard to imagine use which wouldn't be abuse. To take the example of a PN/SSN and the usual gender bit: do you really want anyone to be able to tell that you got a new ID at that time? What could you suspect if a person born in 1987 got a new PN/SSN around 2022?
Leaks like that, bypassing whatever access control you have in your database, is just one reason to use real random IDs. But it's even a pretty good one in itself.
Besides the UUIDv7 specification, that is? Otherwise you have some arbitrary kind of UUID.
> I would not count on the first 48 bits being a "real" timestamp.
I agree; this is the existential hazard under discussion which comes from encoding something that might or might not be data into an opaque identifier.
I personally don't agree as dogmatically with the grandparent post that extraneous data should _not_ be incorporated into primary key identifiers, but I also disagree that "just use UUIDv7 and treat UUIDs as opaque" is a completely plausible solution either.
What most people intuit as random is some sort of noise function that is generally dispersed and doesn’t trigger the pattern matching part of their brain
UUIDs or other similarly randomized IDs are useful because they don't include any ordering information or imply anything about significance, which is a very safe default despite the performance hits.
There certainly are reasons to avoid them and the article we're commenting on names some good ones, at scale. But I'd argue that if you have those problems you likely have the resources and experience to mitigate the risks, and that true randomly-derived IDs are a safer default for most new systems if you don't have one of the very specific reasons to avoid them.
The UUID would be an example of an external key (for e.g. preventing crawling keys being easy). This article mentions a few reasons why you may later decide there are better external keys.
> When I come to you and say "My name is X, this is my phone number, this is my e-mail, I want my GDPR records deleted", you still need to be able to find all data that is related to me.
How are you going to trace all those records if the requester has changed their name, phone number and email since they signed up if you don't have a surrogate key? All 3 of those are pretty routine to change. I've changed my email and phone number a few times, and if I got married my name might change as well.
> Once you start thinking about your database as structured storage of facts that you can use to infer conclusions, there is much less need for surrogate keys.
I think that spirals into way more complexity than you're thinking. You get those timestamped records about "we got info about person named Y with phone number Z", and then person Y changes their phone number. Now you're going to start getting records from person named Y with phone number A, but it's the same account. You can record "person named Y changed their phone number from Z to A", and now your queries have to be temporal (i.e. know when that person had what phone number). You could back-update all the records to change Z to A, but that breaks some things (e.g. SMS logs will show that you sent a text to a number that you didn't send it to).
Worse yet, neither names nor phone numbers uniquely identify a person, so it's entirely possible to have records saying "person named Y and phone number Z" that refer to different people if a phone number transfers from a John Doe to a different person named John Doe.
I don't doubt you could do it, but I can't imagine it being worth it. I can't imagine a way to do it that doesn't either a) break records by backdating information that wasn't true back then, or b) require repeated/recursive querying that will hammer the DB (e.g. if someone has had 5 phone numbers, how do you get all the numbers they've had without pulling the latest one to find the last change, and then the one before that, and etc). Those queries are incredibly simple with surrogate keys: "SELECT * FROM phone_number_changes WHERE user_id = blah".
May also be that you don't want to leak information like how many orders are being made, as could be inferred from a `/fetch_order?id=123` API with sequential IDs.
Sequential primary keys are still commonly used though - it's a scenario-dependant trade-off.
Rather, users traverse that through that object graph, narrowing a range of keys of interest.
This hacker news article was given a surrogate key, 46272487. From that, you can determine what it links to, the name/date/author of the submission, comments, etc.
46272487 means absolutely nothing to anybody involved. But if you wanted to see submissions from user pil0u, or submissions submissions on 2025-12-15, or submissions pertaining to UUID, 46272487 would in that in that result set. Once 46272487 joins out to all of its other tables, you can populate a list that includes their user name, title, domain, etc.
Do not encode identifying information in unique identifiers! The entire world of software is built on surrogate keys and they work wonderfully.
Someone may or may not have used the pattern to get to know the attrition rate through running a simple script every month))
The benefit of uuid in this case is that it allows horizontally scalable app servers to construct PKs on their own without risk of collisions. In addition to just reducing database load by doing the ID generation on the app server (admittedly usually a minor benefit), this can be useful either to simplify insert queries that span multiple tables with FK relationships (potentially saving some round trips in the process) or in very niche situations where you have circular dependencies in non-nullable FKs (with the constraint deferred until the end of the transaction).
I am much more interested in the `si` parameter.. but I am fairly sure nobody outside of Google knows what it is exactly.
Many systems are not sparse, and separately, that's simply wrong. Unguessable names is not a primary security measure, but a passive remediation for bugs or bad code. Broken access control remains an owasp top 10, and idor is a piece of that. Companies still get popped for this.
See, eg, google having a bug in 2019, made significantly less impactful by unguessable names https://infosecwriteups.com/google-did-an-oopsie-a-simple-id...
Tbf in Postgres, you can declare FKs to be deferrable, so their existence is checked at transaction commit, rather than at insertion time.
If you don’t have the DB enforcing referential integrity, you need to be extremely careful in your application logic; IME, this inevitably fails. At some point, someone writes bad code, and you get data anomalies.
* https://en.wikipedia.org/wiki/Buginese_language
It appears to be a cultural construct.
I am confident in this fact because I learned it in elementary school decades ago and it is impossible for humanity to discover new information that updates our world model. Every English speaker knows that “plasmas” and “Bose-Eisenstein condensates” are made up.
However, because the online ordering system assigned order numbers sequentially, it would have been trivial for that company to determine how important their business was.
For example, over the course of a month, they could order something at the start of the month and something at the end of the month. That would give them the total number of orders in that period. They already know how many orders they have placed during the month, so company_orders / total_orders = percentage_of_business
It doesn't even have to be accurate, just an approximation. I don't know if they figured out that they could do that but it wouldn't surprise me if they had.
Once you encode shard number into ID, you got:
- instantly* know which shard to query
- each shard has its own ticker
* programatically, maybe visually as well depending on implementation
I had IDs that encode: entity type (IIRC 4 bit?), timestamp, shard, sequence per shard. We even had a admin page wher you can paste ID and it will decode it.
id % n is fine for cache because you can just throw whole thing away and repopulate or when 'n' never changes, but it usually does.
The purpose is to reduce randomness while still preserving probability of uniqueness. UUIDv4 come with performance issues when used to bucket data for updates, such as when there used as primary keys in a database.
A database like MySQL or PostgreSQL has sequential ids and you’d use those instead, but if you’re writing something like iceberg tables using Trino/Spark/etc then being able to generate unique ids (without using a data store) that tend to be clustered together is useful.
The only promise of Unix timestamps is that they never go back, always increase. This is a property of a sequence of UUIDs, not any particular instance. At most, one might argue that an "utterly valid" UUIDv7 should not contain a timestamp from far future. But I don't see why it can't be any time in the past, as long as the timestamp part does not decrease.
The timestamp aspect may be a part of an additional interface agreement: e.g. "we guarantee that this value is UUIDv7 with the timestamp in UTC, no more than a second off". But I assume that most sane engineers won't offer such a guarantee. The useful guarantee is the non-decreasing nature of the prefix, which allows for sorting.
And when working with very large datasets, there are very significant downsides to large, completely random IDs (which is of course what the OP is about).
The idea behind putting some time as prefix was for btree efficiency, but lots of people use client side generation and you can't trust it, and it should not matter because it is just an id not a way of registering time.
> Do not encode identifying information in unique identifiers! The entire world of software is built on surrogate keys and they work wonderfully.
The amount of manual work required to manage duplicates is in no small part the result of not thinking enough about the identifiers and simply slapping surrogate keys on the data.
And for those using ints as keys... you'd be surprised how many databases in the wild won't come close to consuming that many IDs or are for workloads where that sort of volume isn't even aspirational.
Now, to be fair, I'm usually in the UUID camp and am using UUIDv7 in my current designs. I think the parent article makes good points, but I'm after a different set of trade-offs where UUIDs are worth their overhead. Your mileage and use-cases may vary.
This is simply not a meaningful statement. Any ID you expose externally is also an internal ID. Any ID you do not expose is internal-only.
If you expose data in a repeatable way, you still have to choose what IDs to expose, whether that’s the primary key or a secondary key. (In some cases you can avoid exposing keys at all, but those are narrow cases.)
Thank you for spelling it for me. For the readers, It leaks information that the person is likely not a natural born citizen. The assumption doesn't have to be a hundred percent accurate, There is a way to make that assumption And possibly hold it against you.
And there are probably a million ways that a record created date could be held against you If they don't put it in writing, how will you prove They discriminated against you.
Thinking... I don't have a good answer to this. If data exists, people will extract meaning from it whether rightly or not.
Pretty sure sorting and filtering them by date/time range in a database is the purpose.
Not for me :)
"Internal" means "not exposed outside the database" (that includes applications and any other external systems)
So we are talking about "external" keys (ie. visible outside the database). We are back to square one: externally visible surrogate keys are problematic because they are detached from real world information they are supposed to identify and hence don't really identify anything (see my example about GDPR).
It does not matter if they are random or not.
> How are you going to trace all those records if the requester has changed their name, phone number and email since they signed up if you don't have a surrogate key?
And how does surrogate key help? I don't know the surrogate key that identifies my records in your database. Even if you use them internally it is an implementation detail.
If you keep information about the time information was captured, you can at least ask me "what was your phone number last time we've interacted and when was it?"
> I think that spirals into way more complexity than you're thinking.
This complexity is there whether you want it or not and you're not going to eliminate it with surrogate keys. It has to be explicitly taken care of.
DBMSes provide means to tackle this essential complexity: bi-temporal extensions, views, materialized views etc.
Event sourcing is a somewhat convoluted way to attack this problem as well.
> Those queries are incredibly simple with surrogate keys: "SELECT * FROM phone_number_changes WHERE user_id = blah".
Sure, but those queries are useless if you just don't know user_id.
It won't happen though. 0.00000000% chance it happens even once in a trillion attempts.
> What most people intuit as random is some sort of noise function that is generally dispersed and doesn’t trigger the pattern matching part of their brain
Yes, people intuit the texture of random wrong in a situation where most buckets are empty. But when you have orders of magnitude more events than buckets, that effect doesn't apply. You get pretty even results that people expect.
I'm using EF core which hooks up these relationships and allows me to persist them in a single transaction using MSSQL server.
> If you don’t have the DB enforcing referential integrity
I'm building an electronic medical system. I'm well aware of the benefits of referential integrity.
Gender in the sense of "the social roles and norms on top of biological sex" is indeed a construct, though heavily informed by the biology that they're based on. Biological sex is very much real and not a construct.
But there are cases where it matters. Using UUIDv7 for identifiers means you need to carefully consider the security and privacy implications every time you create a new table identified by a UUID, and you'll possibly end up with some tables where you use v4 and some where you use v7. Worst case, you'll end up with painful migrations from v7 to v4 as security review identifies timestamped identifiers as a security concern.
Public sector it-systems may use the ID and rely on it not changing.
Private sector it-systems can't look up people by their ID, but only use the social security number for comparisons and lookups, e.g. for wiping records in GDPR "right to be forgotten"-situations. Social security numbers are sortof-useful for that purpose because they are printed on passports, driver's licenses and the like. And they are a problem w.r.t. identity theft, and shouldn't ever be used as an authenticator (we have better methods for that). The person ID isn't useful for identity theft, since it's only used between authorized contexts (disregarding Byzantine scenarios with rogue public-sector actors!). You can't social engineer your way to personal data using that ID unless (safe a few movie-plot scenarios).
So what is internal in this case? The person id is indeed internal to the public sector's it-systems, and useful for tracking information between agencies. They're not useful for Bob or Alice. (They ARE useful for Eve, or other malicious inside actors, but that's a different story, which realistically does require a much higher level of digital maturity across the entire society)
All IDs are detached from the real world. That’s the core premise of an ID. It’s a bit of information that is unique to someone or something, but it is not that person or thing.
Your phone number is a random number that the phone company points to your phone. Your house has a street name and number that someone decided to assign to it. Your email is an arbitrary label that is used to route mail to some server. Your social security number is some arbitrary id the government assigned you. Even your name is an arbitrary label that your parents assigned to you.
Fundamentally your notion that there is some “real world” identifier is not true. No identifiers are real. They are all abstractions and the question is not whether the “real” identifier is better than a “fake” one, but whether an existing identifier is better than one you create for your system.
I would argue that in most cases, creating your own ID is going to save you headaches in the long term. If you bake SSN or Email or Phone Number throughout your system, you will make it a pain for yourself when inevitably someone needs to change their ID and you have cascading updates needed throughout your entire system.
> The only rules that really matter are these: what a man can do and what a man can't do.
When evaluating security matters, it's better to strip off the moral valence entirely ("rightly") and only consider what is possible given the data available.
Another potential concerning implication besides citizenship status: a person changed their id when put in a witness protection program.
> > Using a random UUID as primary key does not mean users have to memorize that UUID. [...]
> So what is such an identifier for? [...] Why bother with UUID at all then for internal identifiers?
The context, that you're questioning what they're useful for if not for use by the user, suggests that "internal" means the complement. That is, IDs used by your company and software, and maybe even API calls the website makes, but not anything the user has to know.
Otherwise, if "internal" was intended to mean something stricter (only used by a single non-distributed database, not accessed by any applications using the database, and never will be in the future), then my response is just that many IDs are neither internal in this sense nor intended to be memorized/saved by the user.
The second ID has nothing to do with internal structure of your data. It is just another field.
You can change your structure however you want (or type of your "internal" IDs) and you don't have to worry about an external consumer. They still get their artificial ID.
This assumption that you can query across IDs is exactly what is being cautioned against. As soon as you do that, you are talking a dependency on an implementation detail. The contract is that you get a UUID, not that you get 48 bits of timestamp. There are 8 different UUID types and even v7 has more than one variant.
Again, sometimes it does, the article lists a few of them. Making it harder to scrape, unifying across databases that share a keyspace, etc.
> And how does surrogate key help? I don't know the surrogate key that identifies my records in your database. Even if you use them internally it is an implementation detail.
That surrogate key is linked to literally every other record in the database I have for you. There are near infinite ways for me to convert something you know to that surrogate key. Give me a transaction ID, give me a phone number/email and the rough date you signed up, hell give me your IP address and I can probably work back to a user ID from auth logs.
The point isn't that you know the surrogate key, it's that _everything_ is linked to that surrogate key so if you can give me literally any info you know I can work back to the internal ID.
> This complexity is there whether you want it or not and you're not going to eliminate it with surrogate keys. It has to be explicitly taken care of.
Okay, then lets do an exercise here. A user gives you a transaction ID, and you have to tell them the date they signed up and the date you first billed them. I think yours is going to be way more complicated.
Mine is just something like:
SELECT user_id FROM transactions WHERE transaction_id=X; SELECT transaction_date FROM transactions WHERE user_id=Y ORDER BY transaction_date ASC LIMIT 1; SELECT signup_date FROM users WHERE user_id=Y;
Could be a single query, but you get the idea.
> DBMSes provide means to tackle this essential complexity: bi-temporal extensions, views, materialized views etc.
This kind of proves my point. If you need bi-temporal extensions and materialized views to tell a user what their email address is from a transaction ID, I cannot imagine the absolute mountain of SQL it takes to do something more complicated like calculating revenue per user.
It has the same odds as any other specific configuration of randomly assigned dots. The overly active human pattern matching behavior is the only reason it would be treated as special.
https://en.wikipedia.org/wiki/Intersex#Prevalence
https://en.wikipedia.org/wiki/Klinefelter_syndrome
https://en.wikipedia.org/wiki/XXYY_syndrome
https://en.wikipedia.org/wiki/XXXY_syndrome
https://en.wikipedia.org/wiki/XXXYY_syndrome
https://en.wikipedia.org/wiki/XXXXY_syndrome
https://en.wikipedia.org/wiki/Trisomy_X
I assume you will be one of the advocates for my nobel prize
edit: I'm sorry you specifically mentioned gametes, we can talk about diploids and haploids if you wish and how our bodies are such complicated machines that any sort of error that can occur in our growth is guaranteed to at scale
UUIDv4 are great for when you add sharding, and UUIDs in general prevent issues with mixing ids from different tables. But if you reach the kind of scale where you have 2 billion of anything UUIDs are probably not the best choice either
That’s a more reasonable statement but I still don’t agree. This feels like one of those “best practices” that people apply without thinking and create pointless complexity.
Don’t expose your primary key if there is a reason to separate your primary key from the externally-exposed key. If your primary key is the form that you want to expose, then you should just expose the primary key. e.g. If your primary key is a UUID, and you create a separate UUID just to expose publicly, you have most likely added useless complexity to your system.
It is easy to have strong opinions about things you are sheltered from the consequences of.
Which doesn't change anything in practice, since it having "the same odds as any other specific configuration" ignores the fact that more scattered configurations are still far more numerous than it (or even from ones with more visual order in general) taken all together.
>The overly active human pattern matching behavior is the only reason it would be treated as special.
Nope, it's also the fact that it is ONE configuration, whereas all the rest are much much larger number. That's enough to make this specific configuration ultra rare in comparison (since we don't compare it to each other but to all others put together).
I am not arguing against surrogate keys in general. They are obviously very useful _internally_ to introduce a level of indirection. But if they are used _internally_ then it doesn't really matter if they are UUIDs or sequence numbers or whatever - it is just an implementation detail.
What I claim is that surrogate keys are problematic as _externally visible_ identifiers.
> Okay, then lets do an exercise here. A user gives you a transaction ID, and you have to tell them the date they signed up and the date you first billed them. I think yours is going to be way more complicated.
> Mine is just something like:
> SELECT user_id FROM transactions WHERE transaction_id=X; SELECT transaction_date FROM transactions WHERE user_id=Y ORDER BY transaction_date ASC LIMIT 1; SELECT signup_date FROM users WHERE user_id=Y;
I think you are missing the actual problem I am talking about: where does the user take the transaction ID from? Do you expect the users to remember all transaction IDs your system ever generated for them? How would they know which transaction ID to ask about? Are they expected to keep some metadata that would allow them to identify transaction IDs? But if there is metadata that enables identification of transaction IDs then why not use it instead of transaction ID in the first place?
Intersex is a misleading term, the better term is https://en.wikipedia.org/wiki/Disorders_of_sex_development. There are male DSDs and female DSDs. Even in the case of ovotestes, you'll have one gamete produced, and the other tissue will be nonfunctional.
There's actually an ideological movement to try to redefine sex based on sex traits instead of gametes, but this ends up being incoherent and useless for the field of biology. Biologists have had to publish papers explaining the fundamentals of their field to counter the ideological narrative:
Why There Are Exactly Two Sexes
https://link.springer.com/article/10.1007/s10508-025-03348-3
That's why I thought it was worth mentioning. Many people are confused because of the culture wars. To bring it back around to the general topic of this thread, it's fine to store someone's sex as a boolean, because sex is binary and immutable. Storing cultural constructs like gender as anything other than an arbitrary string is asking for trouble, though.
You can have more details here, in the section "Complete invoice":
https://sede.agenciatributaria.gob.es/Sede/en_gb/iva/factura...
https://www.boe.es/buscar/act.php?id=BOE-A-2012-14696#a6 (Spanish only)
While you're at it, you could tell us all what the scientific discovery was that made gender separate from sex, who found it and when, and what the defining difference is. Did they win a Nobel for that?
I request that in any reply, you refrain from spamming me with Wikipedia links to articles you don't understand and probably haven't read.
My exact thought.
A lot else has failed in your system, from access control to API design, if this becomes a problem. Security by obscurity isn’t the answer.
If the only thing between an attacker and your DB is that they can’t guess the IDs you’re already in some serious trouble.
What? This is definitely not the case and can’t be because B-tree nodes change while UUIDs do not.
Your notion that you can avoid sharing internal ids is technically true, but that didn’t mean it’s a good idea. You’re trying force a philosophical viewpoint and disregarding practical concerns, many of which people have already pointed out.
But to answer your question, yes, your customer will probably have some notion of a transaction id. This is why everyone gives you invoice numbers or order numbers. These are indexes back into some system. Because the alternative is that your customer calls you up and says “so I bought this thing last week, maybe on Tuesday?” And it’s most likely possible to eventually find the transaction this way, but it’s a pain and usually requires human investigation to find the right transaction. It’s wasteful for you and the customer to do business this way if you don’t have to.
> Nope, it's also the fact that it is ONE configuration, whereas all the rest are much much larger number.
That is the human pattern overactive pattern matching at play. I compared the single configuration of all dots on one location to any other specific configuration. You are not comparing to to _every other configuration_ because they are not the same
You are assigning specific importance to a single valid set of randomly selected data, because it seems significant to our brains.
If I asked you to give me an array of 1 million items containing an x, and y coordinate, what are the odds that any single specific set of items are returned?
Based on your answer to that, what are the odds for a set being return with all the same exact x and y coordinates, and a set with different x, and y coordinates?
if you answer anything other than it being the same chance, then you either don't think the selection mechanism is random, or you are falling to the standard fallacies around randomness
I am laughing at all the people coming out of the woodwork to reply to my original post in this thread misunderstanding randomness and chance.
If you flip a coin a million times and it lands on head every single time, the millionth and 1 time still has a 50/50 chance of landing on heads
If you are going to step into this argument, please do not move the goalposts
edit: I've triggered the HN censor bot, so editing to apologize to EnergyAmy, they are correct on their point. I am still going to throw back at brigandish that they moved the goalposts
Not all humans are born with the attribute of reproductive sex via gametes.
Hence "biological sex is real and strongly bimodal with outliers" (in humans, it gets odder elsewhere in animal life on earth) it's just not all reproductive sex, nor is all just strictly M or strictly F despite it mostly being one or the other.
> To bring it back around to the general topic of this thread, it's fine to store someone's sex as a boolean, because sex is binary and immutable.
Not in Australia, via a decision that ascended through all levels of the national court system, nor is sex, as you've chosen to define it ("entirely defined by gametes") binary.
Biology is truly messy. It's understandable not everbody truly grasps this.
Colin Wright is pretty much a prop up cardboard "scientist" for the Manhattan Institute (a political conservative think tank).
I tend to run with people with actual field credentials doing real biology and medicine; Michael Alpers, Fiona Stanley, Fiona Wood, et al were my influences.
If Colin Wright scratches your itch for bad biology then by all means run with the one hit wonder who reinforces a preconception untroubled by empiricism.
That's just how importance works.
It sets some things aside as "significant to our brains". The universe doesn't care, even total heat death is not "important" if one excluses us making a prioritization of things.
Given our classification of orderly configurations as a distinct set, the comparison is between "any from all random-looking/noise-like configuration" vs "any from all orderly-like configurations". And the former are much more.
>if you answer anything other than it being the same chance, then you either don't think the selection mechanism is random, or you are falling to the standard fallacies around randomness
You're confusing the selection mechanism (random) with the classification mechanism that segments the set of possible outcomes into orderly vs not (not random).
As a simpler example, imagine a bag with N loterry numbers on individual cards. If they pick one at random, the chance any number has is 1/N. But the chance that a number OTHER than ours has is N-1/N. Our chances are as good as any other single number's, sure. But they're NOT as good as all other numbers put together.
You're argue that "but all are just sets of coordinates" or "all are just lottery numbers".
Sure, but some of those coordinate sets have importance to us, and others don't. And one of these lottery numbers is important t us, all the others aren't. And since the latter is a much larger group, the posibility of a member of it coming up is too.
That we consider one subset of results more special than the other is not negotiable. It's a thing we actually do in the real world, and it's the premise of the whole discussion.
Who said anything about specific configurations?
We started this talking about whether things "clump" or not. The result depends on your definition of "clump" but let's say it involves a standard deviation. Different standard deviations have wildly different probabilities, even when every specific configuration has the same probability.
Nobody responding to you is calculating things wrong. We're talking about the shape of the data. Categories. And those categories are different sizes, because they have different numbers of specific configurations in them.
> the millionth and 1 time
I don't see any connection between the above discussion and the gambler's fallacy?
What some call "philosophical viewpoint" I call "essential complexity" :)
> But to answer your question, yes, your customer will probably have some notion of a transaction id. This is why everyone gives you invoice numbers or order numbers.
We are in agreement here: externally visible identifiers are needed for many reasons (mostly technical). The discussion is not about that though but about what information should be included in these identifiers.
> This is why everyone gives you invoice numbers or order numbers.
And there are good reasons why invoice or order numbers are not randomly generated strings but contain information about the invoices and orders they identify.
My claim is that externally visible identifiers should possess a few characteristics:
* should be based on the data they identify (not detached from it)
* should be easy to remember (and that means they should be as short as possible, they should be easy to construct by a human from the data itself - so they cannot be hashes of data)
* should be versioned (ie. they should contain information somehow identifying the actual algorithm used to construct them)
* should be easy to index by database engines (that is highly db implementation dependent unfortunately)
* can be meaningfully sortable (that is not strictly a requirement but nice to have)
Coming up with an identifier having these characteristics is not trivial but is going to pay off in the long run (ie. is essential complexity).
I look forward to your citation disputing the truth of what he lays out in that paper. In the meantime, feel free to peruse the list here of people affirming the same stance:
https://projectnettie.wordpress.com/
Or someone else:
https://www.nas.org/academic-questions/33/2/in-humans-sex-is...
You should ask the people you run with why no human is born with a body not organized around the production of gametes. You'll notice that when you read about conditions like anorchia or ovarian agenesis, the sex of the person with that condition is not a mystery, it's literally in the name.
Biology is messy indeed, and that's why finding such a universal definition was so useful.
The answer to all of this is to remember that sex is about reproduction, so it must fundamentally be based on gametes.
> you could tell us all what the scientific discovery was that made gender separate from sex, who found it and when, and what the defining difference is. Did they win a Nobel for that?
Take your time, but please avoid making me restate what I've written along with the obvious implications simply because you find it all too inconvenient to address.
* Based on the data they identify - This is a minefield of accidental complexity. Data changes and needs to be redacted for GDPR and other data laws. What do you do when someone demands you delete all personally identifiable data but you’ve burned it into invoice ids that you need to retain for other legal reasons? This is also begging for collisions and very much at odds with making IDs short.
* easy to remember - This is a nice to have. Short is convenient for sharing on the phone. Memorable didn’t matter much. I don’t remember any invoice number I’ve ever received.
* versioned - Versioning is only interesting because you’re trying to derive from real data. Again, accidental complexity.
* easy to index - Sure.
* sortable - Nice to have at best.
So why are you trying to?
> I look forward to your citation disputing the truth of what he lays out in that paper.
Just look to his reputation in the field .. it's up there with Jo Nova on climate physics .. laughable.
> You should ask the people you run with why no human is born with a body not organized around the production of gametes.
So you're implicitly admitting that humans are born without gamates then? You've certainly dodged that question multiple times in your comment history.
You're also not admitting to yourself the existence of those humans born with conflicting organisation re: sexual reproduction - when the physical form, the chromosones, the gamates, et al don't align.
From an empirical PoV for people in field work here it's simply silly to claim that only two cases cover all variations - it's a mystery why any one would work so hard to force it.
The question of classifying human births is larger - not all humans born have gametes. Some have two sets.
For people interested in actual observed birth cases there's a lot more going on than a moronically over simplified two buckets cover all cases when it comes to attributing sex [] .. clearly M or clearly F with everything aligned (physical form + chromosones + gamates) covers most cases .. and then there's the rest.
It gets even broader when including mammals such as rabbits and pigs as they express cases that are potentially possible in humans but not (as yet) observed or on record.
> so it must fundamentally be based on gametes.
Wishful thinking stemming from a strong held preconceived idea of how the workd must be rather than field based observation of that which occurs.
No you did not.
> Since sex and gender are not different until you are able to provide some reason that they are beyond bare assertion then gametes are relevant.
Sex is a parameter of biology, gender is a parameter of social constructs.
You are also having bare assertions that they are the same. Gametes are not relevant. You are unable to discern between different values.
Also stop bringing up the Nobel prize like it matters for the conversation. You are the one who interjected it into the conversation.
Edit: added after the post. To make sure I am not speaking to a bot, can you tell me who the first person in this thread was that mentioned the word “gamete”
> * easy to remember
(which means human readable and related to the actual information which makes them easier to remember)
These actually are the most important features.
Example: transaction references not related to the actual subject of the transaction (ie. what is being paid for) is enabler for MITM scam schemes.
> Short is convenient
Nah. Short is crucial for identifiers to be effective for computers to handle (memory and CPU efficiency). Otherwise we wouldn't need any identifiers and would just pass raw data around.
> * versioned - Versioning is only interesting because you’re trying to derive from real data.
Nah. Even UUID formats contain version information.
> * easy to index - Sure.
> * sortable - Nice to have at best.
These are directly related (and in the context of UUIDv4 vs UUIDv7 discussion sortable is not enough - we also want them to be "close" to each other when generating so that they can be indexed efficiently)
I continue to look forward to your citation disputing the truth of what he lays out in that paper, or the other links I provided that affirm the same stance. Ad hominems are boring, don't you have anything?
You unfortunately don't really understand the point here, but to reiterate, just because someone is born with nonfunctional/missing gonads doesn't mean their body isn't sexed. As an analogy, if someone is born without a hand, we don't just shrug and say that it could've been a fin, or antlers, or a firetruck. That's the point of saying that their body is organized around the production of one of exactly two gamete types.
There's no conflict, physical form and chromosomes are variations within a sex, which is entirely defined by gametes. Chromosomes are part of how sex is determined, but gametes are how sex is defined.
I look forward to your citations of these people doing field work that support your points.
You keep saying that but you have provided virtually no evidence in support of this. This is why I called your claim philosophical. You are asserting this as fact and arguing from that standpoint rather than considering what is the best based on actual requirements and trade offs.
> Example: transaction references not related to the actual subject of the transaction (ie. what is being paid for) is enabler for MITM scam schemes.
I don’t see how this is true. If anything transaction references based on the actual subject would make scamming slightly easier because a scammer can glean information from the reference.
I’m going to stop here, though. I don’t see that this is going to converge on any shared agreement.
Take care. And if you celebrate the holidays, happy holidays, too.
Do grow up.
> Sex is a parameter of biology, gender is a parameter of social constructs.
So you assert, but until you can show the moment that gender was shown to be different from sex beyond bare assertion then I'm not willing to accept your assertion. Do you see how that works?
> Gametes are not relevant.
They are relevant to sex determination and hence gender, see previous paragraph for why.
That's all your points, if they can be called that, addressed.
> Also stop bringing up the Nobel prize like it matters for the conversation.
You'll need to provide something that can be competitive for it to have any impact <shrug>. I won't hold my breath.
An incomplete one that fails to cover all cases.
You unfortunately don't really understand the point here
There is a Bitcoin seller B, a thieve T and a victim V.
T proposes to buy Bitcoin from B. T offers a new iPhone for a very low price to unsuspecting V. V agrees to buy it. B gives T account details and transaction reference so that T can transfer money to B's account. T gives these details to V. V transfers the money. B transfers Bitcoin to T. T disappears.
If only transaction reference contained information that the transfer is about buying Bitcoin, V would have never paid the money.
The scheme was quite common in UK because banks did not like Bitcoin so Bitcoin sellers and buyers avoided referencing it in bank transfers.
> Do grow up.
When I quote a fragment of someone else’s sentence I usually add an ellipses(…) to show that there’s more context and not imply a full statement, but you do you
Since you dropped the part about me asking you to state how the thread started, I am assuming this is at least a person dropping a prompt into ChatGPT and regurgitating it without editing.
Have a good day
Firstly, I am extremely doubtful that this would actually prevent the issue. A wary buyer would not agree to transfer money account to account like this to pay for a cell phone in the first place. Only gullible people would engage in this scam, and I am doubtful that they would question the transaction ID deeply. “Hey what is this bitcoin thing?” “Oh, don’t worry about it. That’s just internal for our tracking purposes. Do you want me to throw in a free phone case too?!”
Secondly, this seems like a massive privacy concern. Is someone purchasing sex toys supposed to use a transaction ID like purpledildo656 and expose what they are buying to the bank?
I’m sympathetic to people who get scammed, but I don’t think your transaction IDs solve this problem. People have been getting scammed like this forever. “Hey, just send the $600 via Western Union and I’ll totally put your phone in the mail for you tomorrow.”
This isn’t an ID anyway. What you are really asking for is to mandate that the contents of the purchase be burned into the transaction from the seller all the way to the buyer through the bank. I think that’s a terrible idea because of privacy concerns, but regardless, it’s not an ID. This would be much better expressed as a different form of metadata.
There has been multiple definitions put forward, they all fall at a few (very few out of nine billion) edge cases.
I look forward to your explaination of why you feel that every human on on the planet must be assigned as either [M] or [F] at birth with no recognition of the real circumstances in the actual edge cases.
Not even the class of South African hermaphrodites cleanly all fall one way or the other.. there's furious individual by individual debate over which of the two potential gamate producing mechanisms is less mangled than the other - as you should be aware given your apparent singular obsession here.
I'm curious as to why so recently so much money has been spent on pushing Colin Wright as the new prophet of an old idea that doesn't provide a complete classification.
I'm also not sure why you're so focused on Colin Wright when I provided other examples of people affirming the same stance, as previously stated. Take your pick, or provide citations of your own.
Do you have a particular example in mind for the South African hermaphrodites?
I don't really get the point of this dance you're doing. Why not just admit that you don't have anything to back your comments up?
I'm old, I've been about biology for a long time, it's never been that way.
* https://en.wikipedia.org/wiki/Phases_of_ice is a cracking read. Good science.
We observe one predominent form of solid water most often in our environment - but other forms exist and frequency varies elsewhere.
There are literal tomes on disorders of sex development, it's a subject with decades of solid research history and centuris of literary reference.
"Strongly bimodal with some outliers" has been an ideal summary for a complex domain until ... recently.
What motivates you to shoehorn every birth into one of two boxes despite the theorectical and practical issues?
It's not about what motivates me, it's about the scientific consensus in the field of biology. The same consensus that has remain unchanged for well over a century. I've provided citations affirming this consensus, and you've refused to back up your comments. Why not just admit that you're wrong?
EDIT: Rate limited by HN, so I guess we are done here. I'm not sure why you think prompting you to get specific about the group means being unaware of it. I'm rather disappointed in your bluster, it makes me wonder in retrospect if I've been baited into arguing with a bot.
What rot, even a casual perusual of literature will confirm debate.
> I've provided citations affirming this consensus
You've cited a single faction that have only recently surged across public communications.
The fact that you're claiming to be unaware of the debate, the history, the SA hermaphrodite group to whom I refered tells me a great deal.
I suspect we're done here.
For now I've a cluster of 12 tonne lego pieces to fit together and seal up, I'll check in later to see if you've any reflection on the actual politics and culture trappings about this matter that are driving the presentation of a factional PoV.
I look forward to some grown up adult comment, not any childish gotcha traps.