New Hellō Identifiers

New Hellō Identifiers

Why we are moving away from UUIDv4

Updated July 1, 2024 - we will continue to provide UUIDv4 identifiers to our existing customers.

Identifiers are a foundational component of any computing system, and key to an identity service. At Hellō, we have:

  1. external identifiers created by other systems (email address, OpenID Connect sub );

  2. internal identifiers used to manage relationships between objects (internal user identifier, publisher identifier); and

  3. identifiers exposed to other systems (client_id, sub, jti, authorization code).

We of course don't have control over external identifiers. Our requirements for our identifiers are:

  • Not Leaky - we don't want the identifiers to leak information about our internal systems, or about our users. This eliminates identifiers tied to machine identity, and identifiers tied to time or being sequential in nature.

  • Distributed Generation - we want each instance to be able to generate its own identifiers and not have a central service bottleneck. This leads to a random identifier that has a large enough entropy that duplicates are extremely improbable.

  • URL Safe - we want to be able to pass the identifiers around in URLs and HTTP headers without concern of mismatched encoding, and making them easy to identify when part of a URL or HTTP header.

  • Widely Used - we don't want to invent our own random numbers. We want a proven implementation.

The popular choice today, which is what we chose, is UUID v4. This is what one looks like in string format:

a9ab46e7-a526-43e7-9e18-458c76c2f5f4

UUIDs use a base 16 (hex) alphabet, have 32 characters, and 4 dashes for readability. The version is indicated by the digit 4 at the start of the 3rd set. Earlier versions generated values using a combination of machine identity and time. Version 4 is random, which is much simpler to do now with modern processors. With 122 bits of entropy (6 bits are reserved for version), the odds of generating a duplicate identifier for each person on the planet is 6.04 x 10-19 (thanks ChatGPT -- note the link uses a UUID v4 identifier!).

We chose to use the standard string formatting for UUIDs, which are 36 characters long. While a binary format would save storage and bandwidth, simpler code wins in the era of cloud computing where Hellō is deployed.

While UUID v4 identifiers met our security, privacy, and entropy requirements, we encountered friction working with them.

UUIDs All Look the Same

While it is a feature that they are indifferentiable from each other, when writing and testing code with a bunch of different identifiers in the same system, we don't know what the identifier represents. We use DynamoDB and the recommended single table architecture, so we have a mix of record types all in the same index, and we added a type property to help, but we are then looking in multiple locations. I had seen some implementations prefixing UUIDs with the identifier type, and appreciated how it simplified development.

We chose a set of 3 character prefixes that indicate at a glance the identifier type. Our initial set:

  // Wallet created
{
  usr: 'Hellō internal user identifier',
  hdi: 'Hellō directed identifier - sub value in ID token',
  jti: 'ID Token jti',
  kid: 'Hellō key identifier in ID Token header',
  ses: 'Hellō session identifier',
  dvc: 'Hellō device cookie identifier',
  inv: 'Hellō invitation identifier',
  pky: 'Hellō passkey identifier',
  pic: 'Hellō picture identifier',
  non: 'Hellō nonce identifier',
  cod: "Hellō authorization code",
  // Admin created
  pub: 'Hellō publisher identifier',
  app: 'Hellō application identifier (client_id)',
}

Dashes Suck When Selecting Text

A common task is to select an identifier to be copied and then pasted somewhere else. When double clicking on a string, editors, browsers, and terminals will select the word being clicked, and most editors consider a dash to be a word separator, so only part of the identifier is selected (try it out):

'a9ab46e7-a526-43e7-9e18-458c76c2f5f4'

Triple clicking will often increase the selection, but then the quotes are included in the selection, which is often not what is desired:

'a9ab46e7-a526-43e7-9e18-458c76c2f5f4'

This led us on a search for an alternative random number generator that met our requirements and we came across nanoids, which are shorter, and allow a custom alphabet. We chose 0-9a-zA-Z as the alphabet, the underscore as a separator. While base62 is not a nice neat power of 2, using only alphanumeric characters as the custom alphabet reduces the cognitive dissonance glancing at them.

UUIDs are Hard to Visually Correlate

UUIDs start with an 8 char string, and end with a 12 char string. When looking at a set of identifiers, either in a column or across variables while debugging, it is difficult to identify if an identifier is the same at a glance. Using just the first digit has a 1 in 16 chance of collision. Using the first two digits has a 1 in 256 chance of collision - low enough to cause confusion. Visually selecting more than two digits without a separator was challenging.

As we were now prefixing our identifiers, a short suffix with enough entropy for quick visual correlation appeared a good solution. Two digits of base 62 is 1 in 3,844. Three digits is 1 in 238,328. This leads us to an identifier that looks something like:

typ_0123456789abcdefABCDEF_xyz

This jumps out as one of our identifiers, we know at a glance the type, and the suffix is easy to use to match another variable value across contexts.

Other Design Factors

Identifier Validation: Being able to programmatically test if an identifier is valid can be useful in detect errors in a system. Having a consistent length, format, and character set can detect if something is not an identifier. The type prefix helps detect if identifiers got mixed up. Why not add in a checksum as well?

Identifier Length: A nanoid using a 62 character alphabet only needs to be 22 characters long to exceed the entropy of a UUID. While we were ok with making the computer work harder with a 62 alphabet instead of 64, we wanted an easy to remember length. With a 4 character prefix, and an underscore separating the suffix, that gave us 27 characters. 32 is easier for programmers to remember, and a 24 character random string is also easy to remember, allowing us to use the 3 character suffix as a checksum.

New Hellō Identifiers

Here is an example of a column of our new identifiers in a:

app_JbkuwjnRPIxuerq765q4IOXO_rc2
sub_To8aelKK5rOpeLesEJA0VawX_TW7
app_Cd5iWmdENXTYqJw6o07FuRKn_pUM
pub_PDOzPRqBuZjBcrfG9oh4M0oN_3qF
app_Zpa1TgesIRna5nDKtWMp11cV_jlH
sub_76t2ITgp6wRMBcyHhgUOM2pQ_v7A
app_wcmPSIaiPuLtCa8Yp0Iwhwfm_IAC
pub_PDOzPRqBuZjBcrfG9oh4M0oN_3qF
app_FAZ9eZ8NgtauhQp5bnXXE1W1_oi3

We know these are our identifiers at a glance. We can see they are the app, sub and pub types. We can quickly find the two pub identifiers, and see they (likely) have the same value with the 3qF suffix. We are finding it useful, and hope that they also help our customers when working with our identifiers. And a double click on the value selects the whole value!

Deployment Considerations

All of our identifiers are opaque strings as far as the logic is concerned. The only logic requirement is that they are a string and are unique in the system. This allows us to change the identifier format without any impact on the logic, or our customers.

While we already have UUID identifiers being used in production by our customers, our objective is to help us when doing development and testing.

How about the change in performance? I did a little test of 1M generations:

  • 45 ms for UUIDv4

  • 160 ms for a standard 21 char nanoid

  • 360 ms for a 27 char nanoid with custom alphabet

  • 470 ms for a 24 char nanoid and 3 digit checksum

I was surprised how much slower nanoid was compared to UUID, but then again, the UUID is now native code in node. The custom alphabet and checksum tripled the time over the default nanoid, and a 10X change over UUID. Worse than I had hoped, but the impact on user experience is irrelevant as each one takes half a nanosecond.

Even though we don't expect there to be any customer impact, we will hold off deployment until we have informed our customers and received any feedback.

We have made our library open source and available at:

https://github.com/hellocoop/packages

and available as an npm package as @hellocoop/identifiers.

The build process is somewhat of a hack, but that is another story about wanting to support both CommonJS and ECMA Script modules, and have a single source of truth for our list of identifier types.

Updated July 1, 2024 - we will continue to provide UUIDv4 identifiers to our existing customers.

A few of our customers implementations take advantage of the sub in the ID Token being a UUID and have configured their DB schema accordingly. Our data model has directed users (the sub in the ID Token) belonging to publishers, and applications (client_id for developer, and aud in the ID Token) belonging to publishers.

Any existing publishers (that have a UUID identifier), or any new ones created that check a box to use UUIDs, will continue to use UUIDs and won't receive the new identifiers. Any publisher that has an the new format, will have all their user and application identifiers in the new format, providing consistency in the identifier format.