git push --force

I decided to make a worse UUID for the pettiest of reasons.

Solving exactly two tiny non-problems by un-solving a bunch of real ones. But it was a fun learning exercise!

Yesterday, I was messing with an API project I’ve been tinkering on for a long time. The kind of project that one keeps rewriting over and over again through the years to maintain that post-refactor dopamine high. You know what I mean. Anyway, totally out of the blue, I realized something. Something needed to be refactored. You see, I use UUID pretty heavily and that meant my resource URLs were, like, really long and ugly.

A UUID has a bunch of things, depending on the version and variant. But mostly it has a whole bunch of random bits that give it that whole “Universally Unique” property that we like so much. That’s awesome for lots of reasons! But if, for some reason, you don’t like the idea of a battle-proven standard with a massive, mature ecosystem of native support… this one’s for you!

So what is it?

Without further ado, let me introduce you to dotvezz/smolid. It’s an ID scheme implemented in Go, which provides some actually useful stuff! An example smolid looks like acpje64aeyez6 and it…

And it all fits into 8 bytes! In Go, it’s all kept in a single uint64! That’s a lot of useful things to fit into a postgres bigint* column!

* This is foreshadowing. Bonus points if you’ve already figured out the problem my dumb ass missed right until I started bragging to friends.

Implementation Details

Cutting a UUID in half* actually means making some real sacrifices. Most importantly, it’s a lot less entropy that we’re working with, so there is not even a remote promise of global uniqueness. The built-in timestamp helps quite a lot for the intended API usecases, and depending on how you use it, a smolid has 13-20 bits of entropy on top of the timestamp. I’ve taken to calling this “unique-enough” but it has some real caveats to go over later in this post.

* Remember, a UUID is just 16 bytes. The hex encoding just makes it look really long in string form.

Let’s walk you through all 64 bits in our 8 byte ID!

 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                          time_high                            |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|    time_low     |ver|t| rand  | type or rand|       rand      |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
 

A Custom Epoch and 41 Bit Timestamp (Un-Solving The Biggest Solved Problem On Purpose)

One of the best things to happen in the very exciting object identifier space is RFC 9562 which was ratified in the middle of 2024 and introduced UUIDv6 and v7. These are both natively sortable by time, but they do it in a slightly different way…

The RFC has a great section all about timestamp considerations for more recommended reading!

When I started working on smolid like 4 hours ago, one of the first challenges was finding a good balance between useful timestamps, versioning, features, and entropy. I really wanted millisecond precision for the timestamp, you see, but 48 bits ate up a whole lot of my 64-bit budget. You may have heard of The 2038 Problem. So have I! I’ve even heard of it as a “good problem to have,” you see. If a 32-bit system survived long enough to overflow in 2038, it must have had good traction.

If it’s good enough for that system, it must be good enough for me! So allow me to introduce the smolid epoch: Milliseconds from 2025-01-01. With 41 bits to play with, that means it will overflow precisely at 2094-09-07 15:47:35. That’s good enough in my book, but I can imagine that problem (among many other reasons) preventing you from adopting smolid.

On that note, though, remember I mentioned this all fits in a Postgres bigint, with an ominous * character next to it? Well, it turns out the ISO/IEC SQL standard does not define unsigned integers, and indeed PostgreSQL does not support them! So that awesome database index locality I bragged about kinda gets thrown out the door when the most significant bit gets flipped at 2059-11-04 19:53:47

I’ll just leave that to the PostgreSQL project or ISO/IEC to fix before 2059.

Only 2 bits for versioning?

It’s a narrow focus that I’m dealing with here, so I feel pretty confident that I’ll never need to extend this beyond a v3. But if I do, I will entirely deserve whatever pain that comes to me as a result of this decision.

Embedded Type Identifiers

When I got to thinking about how many bytes to allocate to which features, I got to thinking about a minor annoyance that irks me more than it would a healthy person. Every now and then, someone at work will send a request like, “hey Ben, can you get me the info for this ID?” and they copypaste 9f3ac7ee-e5de-4cdf-b08d-f68cbe7f1d56 like I should know what that is. Is it a user ID? A payment method ID? A node ID? A request ID? Did it even come from inside our network? Often, the requester is unable to give provenance to it beyond, “I saw it in an error log and figured you could help.”

Am I the only one this happens to? (No, really, am I?)

Anyway, I decided to set aside enough bits to be useful without impacting the entropy too much. In v1 it’s set as 7 bits, allowing for 128 distinct types to be embedded into the ID. That means if you use that feature, the ID itself will tell you if it’s a user ID, a device ID, a file ID, or whatever else you have in your system (as long as you have less than 128 different types of things to get IDs, at least

“Unique-Enough” for many use-cases

Emphasis on many. UUID is king if you need your ID collision probability to be expressed in scientific notation. And it’s a sane default to use for that and many other reasons. But let’s look at the math here because numbers are fun! This isn’t a scientific exercise as much as it is an illustration, so let’s paint with broad strokes whole recognizomg that the devil is in the details.

Since there’s a millisecond-precision timestamp component, there’s at least the promise* that you’ll never risk collision with any ID created more than 0.001 seconds in the past or future (until the year 2094). You can have a hundred million users, making a billion blog posts, and ten billion blog comments per year in your database and average about one new record every millisecond. At this scale, smolid will nominally be able to ensure uniqueness from the timestamp alone.

* Assuming that the clock on the system generating these IDs is trustworthy, which isn’t a given.

But that’s a false sense of security; in real life, traffic happens in fits and bursts. let’s say a famous political figure makes a shitpost about buying your blog platform and you’re suddenly handling 100,000 new comments every second, that puts you in 100 comments per millisecond territory. Now we need to calculate the probability of a collision at that generation rate.

Since collision probability as a percentage is pretty easy to calculate.

P 1 - e - n ( n - 1 ) 2 d

That gives us the collision probability per millisecond at a given generation rate, which we’re able to feed the result of that as probability p into the formula below to find the collision probability of a spike lasting n milliseconds.

P = 1 - ( 1 - p ) n

So what’s that look like without embedded types (our best case scenario), at 100 requests per millisecond over one second? Here we can fill in the variables and combine the two equations. I’ve also canceled out a 1-1 bit but haven’t done any other simplification, just to maintain continuity with the explanations above.

P = 1 - ( - e - 100 ( 100 - 1 ) 2 ( 2 20 ) ) 1000

Yay, a 99.1% chance of collision when we handle a million new IDs per second. That’s no good!

Because I’m having so much fun with the <math> tag here, and because comparisons are useful, let’s see what UUIDv7’s 74 bits of entropy give us. Fortunately, UUIDv7’s timestamp also has millisecond precision, which makes this a really straight comparison without needing to change any base assumptions.

P = 1 - ( - e - 100 ( 100 - 1 ) 2 ( 2 74 ) ) 1000

Ahhh yes. 210-14%. 0.000000000000002% collision probability after generating a million IDs in a second. Zero-point-fourteen-zeroes-then-a-two-percent. This is the sacrifice I’m making by cutting just eight bytes per record, because I want my URLs to be pretty. Like I said, worse UUID for the pettiest reason.

So smolid is useful where your peaks are around a thousand new records per second. If this sounds like you, feel free to play with it!

What’s it look like in practice?

The godoc link may have enough information for you. Overall I’ve tried to follow the example of gofrs/uuid which has a good balance of ergonomics and flexibility. I’ve made sure to provide implementations for the usual important interfaces.

And if you think it makes sense to add more support, throw up a PR or an issue on github!

Here’s a simple Go example that illustrates how smolid can be used with the embedded type ID.

package blog

import (
	"context"

	"github.com/dotvezz/smolid"
)

const (
	TypeUser = iota
	TypePost
	TypeComment
	// . . .
)

type User struct {
	ID    smolid.ID
	Name  string
	Email string
}

// CreateUser
func CreateUser(ctx context.Context, u User) (User, error) {
	u.ID = smolid.NewWithType(TypeUser)
	
	query := `insert into users (id, name, email) values ($1, $2, $3)`
	
	_, err := pool.Exec(ctx, query, u.ID, u.Name, u.Email)
	
	return u, err
}

Trying to Preemptively Answer Your Questions

Q. Are you okay?

This is a natural reaction, all things considered. Yeah. Yeah, I think I am! Thanks for checking.

Q. Are you really gonna use this?

Heck yeah I am! This solves real (albeit petty, as admitted up-front in the title of this post) problems in my projects. I won’t be forcing it on my colleagues at my day job, but for a lot of my personal projects this will be replacing gofrs/uuid for real.

I’m hoping I don’t need to write a follow-up titled, “Worse UUID was a mistake because of course it was.” Time will tell.

Q. Should I use this?

If it works for you… sure? UUID has a lot of real important advantages that you probably benefit from, maybe without even realizing it. Play with it! If you feel so inspired, maybe use this as an excuse to make your own ID scheme for fun and learning!

Q. Will you make a PostgreSQL extension adding native supprot for this new type?

Maybe? Probably not.

Tags: