Design a system to generate a Unique ID
A database table with more than one rows needs a unique primary key for each row. Since data are partitioned across different shards therefore we need to generate a globally unique ID for the primary key.
Following are a few ideas to generate the primary key.
Random Id
We can generate a fixed length of a random number and use it as a primary key. However, due to high collision, this is not all a great fit for UUID.
import random
# Generate a random integer between a specified range
random_number = random.randint(0, 10**20) # Adjust the range as needed
print("Random Number:", random_number)
Random Number: 73245678901234567890
The secrets
module is recommended for generating secure random numbers, especially for cryptographic purposes.
import secrets
# Generate a random integer with a specified number of bits (e.g., 128 bits)
random_number = secrets.randbits(128) # 128 bits = 16 bytes
print("Secure Random Number:", random_number)
Secure Random Number: 30973298740912837409128374091283740912
Hashing
We can take input for each row and create a hash out of it. While hashing mechanisms are not the primary way to generate UUIDs, they can be used in conjunction with other data to create unique identifiers.
1. MD5 Hashing
MD5 is a widely used cryptographic hash function that produces a 128-bit hash value.
import hashlib
import uuid
def generate_uuid_from_md5(input_data):
# Generate MD5 hash
md5_hash = hashlib.md5(input_data.encode('utf-8')).hexdigest()
# Convert MD5 hash to UUID
return uuid.UUID(hex=md5_hash)
# Example usage
input_data = "example_data"
uuid_md5 = generate_uuid_from_md5(input_data)
print(f"UUID from MD5: {uuid_md5}")
UUID from MD5: 1f3870be-274f-67e0-9a3d-4e80a5a6b7b1
2. SHA-1 Hashing
SHA-1 is another cryptographic hash function that produces a 160-bit hash value.
import hashlib
import uuid
def generate_uuid_from_sha1(input_data):
# Generate SHA-1 hash
sha1_hash = hashlib.sha1(input_data.encode('utf-8')).hexdigest()
# Convert SHA-1 hash to UUID (using only the first 32 characters)
return uuid.UUID(hex=sha1_hash[:32])
# Example usage
input_data = "example_data"
uuid_sha1 = generate_uuid_from_sha1(input_data)
print(f"UUID from SHA-1: {uuid_sha1}")
UUID from SHA-1: 5a105e8b-9d40-5f1d-9a36-9a8c9a3d9f8b
3. SHA-256 Hashing
SHA-256 is a more secure cryptographic hash function that produces a 256-bit hash value. It can be used to generate a UUID by truncating the hash.
import hashlib
import uuid
def generate_uuid_from_sha256(input_data):
# Generate SHA-256 hash
sha256_hash = hashlib.sha256(input_data.encode('utf-8')).hexdigest()
# Convert SHA-256 hash to UUID (using only the first 32 characters)
return uuid.UUID(hex=sha256_hash[:32])
# Example usage
input_data = "example_data"
uuid_sha256 = generate_uuid_from_sha256(input_data)
print(f"UUID from SHA-256: {uuid_sha256}")
UUID from SHA-256: 5e884898-da28-8b28-8b28-8b28da280000
Since hash functions generate long id therefore generally not suitable for UUID. Though it reduces collision it's not 100% collision-free therefore not a recommended UUID. This is sometimes used if there is a need for deterministic UUID.
UUIDv1(Timestamp + Random number + host)
A UUIDv1 is a 128-bit value, typically represented as a 36-character string in the format:
xxxxxxxx-xxxx-Mxxx-Nxxx-xxxxxxxxxxxx
Following is one example for reference.
f81d4fae-7dec-11d0-a765-00a0c91e6bf6
UUIDv1 is designed as below.
+-----------------------------+-----------------------------+
| Time Low (32) | Time Mid (16) |
+-----------------------------+-----------------------------+
| Time Hi & Version (16) | Clock Seq (14) | Variant (2)|
+-----------------------------+-----------------------------+
| Node (48 bits) |
+-----------------------------------------------------------+
- Time Low (32 bits):
- Represents the lower 32 bits of the 60-bit timestamp.
- Example:
0xf81d4fae
2. Time Mid (16 bits):
- Represents the middle 16 bits of the 60-bit timestamp.
- Example:
0x7dec
3. Time Hi & Version (16 bits):
- The upper 12 bits represent the high part of the timestamp.
- The lower 4 bits are reserved for the version number (
1
for UUIDv1). - Example:
0x11d0
(where1
in the 4th nibble indicates version 1).
4. Clock Sequence (14 bits):
- A random or incrementing value to handle clock adjustments or duplicates.
- Example:
0xa765
5. Variant (2 bits):
- Indicates the UUID variant. For UUIDv1, this is
10
in binary. - Example:
0x8
(binary10
in the top 2 bits).
6. Node (48 bits):
- Typically the MAC address of the host machine.
- If a MAC address is unavailable, a random or pseudo-random value is used.
- Example:
0x00a0c91e6bf6
Following is python code to generate UUIDV1.
import uuid
import time
import random
def generate_uuidv1():
# Get the current time in 100-nanosecond intervals since October 15, 1582
nanoseconds = time.time_ns()
timestamp = (nanoseconds // 100) + 0x01B21DD213814000 # Offset for UUID epoch
# Split the timestamp into low, mid, and high parts
time_low = timestamp & 0xFFFFFFFF
time_mid = (timestamp >> 32) & 0xFFFF
time_hi_and_version = ((timestamp >> 48) & 0x0FFF) | 0x1000 # Set version to 1
# Generate a random clock sequence (14 bits)
clock_seq = random.randint(0, 0x3FFF)
# Generate a node identifier (48 bits), typically the MAC address
# For simplicity, we'll use a random value here
node = random.randint(0, 0xFFFFFFFFFFFF)
# Combine all parts into a UUID
uuid_bytes = (
(time_low << 96) |
(time_mid << 80) |
(time_hi_and_version << 64) |
(clock_seq << 48) |
node
)
# Convert to a UUID object
return uuid.UUID(int=uuid_bytes)
# Generate a UUIDv1
uuidv1 = generate_uuidv1()
print(uuidv1)
Pros of UUIDV1
- Uniqueness Across Time and Space: UUIDv1 is designed to be unique across different machines (due to the node identifier) and across time (due to the timestamp). This makes it highly reliable for distributed systems.
- Time-Ordered: The inclusion of a timestamp ensures that UUIDv1s are generated in a time-ordered manner. This can be useful for sorting or indexing records by creation time.
- Predictable Structure: The structure of UUIDv1 is well-defined (timestamp, clock sequence, node identifier), making it easier to debug or analyze.
- No Centralized Authority: UUIDv1 does not require a centralized authority to generate unique IDs, making it suitable for distributed systems.
- Low Collision Probability: The combination of timestamp, clock sequence, and node identifier ensures an extremely low probability of collisions (duplicate UUIDs).
- Wide Support: UUIDv1 is supported by most programming languages and libraries, making it easy to implement and use.
Cons of UUIDv1
- Privacy Concerns: The node identifier in UUIDv1 is typically derived from the MAC address of the machine generating the UUID. This can expose hardware information, raising privacy and security concerns.
- Predictability: Since UUIDv1 is based on a timestamp and MAC address, it is somewhat predictable. This can be a security risk in scenarios where unpredictability is required.
- Clock Dependency:UUIDv1 relies on the system clock. If the clock is adjusted backward (e.g., due to NTP synchronization or manual changes), it can lead to duplicates unless the clock sequence is properly managed.
- Limited Uniqueness in High-Frequency Scenarios:In systems where UUIDs are generated at an extremely high frequency (e.g., millions per second), the 14-bit clock sequence may not be sufficient to guarantee uniqueness.
- Larger Storage Requirements: UUIDs are 128 bits (16 bytes) long, which is larger than simpler integer-based IDs. This can increase storage and indexing overhead in databases.
- No Semantic Meaning: Unlike some other UUID versions (e.g., UUIDv3 or UUIDv5), UUIDv1 does not embed any semantic information (e.g., namespace or hash-based content).
- MAC Address Dependency: If the MAC address is not available (e.g., in virtualized environments), the node identifier may need to be randomly generated, which reduces the uniqueness guarantee.
Other UUID Versions
There are total 8 versions available to handle the weakness of v1. Following are details on these.
UUID Generation service using database auto-sequence
We can leverage the database-supported auto sequencing to generate the unique identifiers. These identifiers can then be combined with other data or transformed into UUIDs. Following are high level steps.
- Database Sequence Creation: Create a sequence in your database that will generate unique numbers.
- Service Layer: Implement a service layer that fetches the next value from the sequence and formats it into a UUID.
- UUID Formatting: Combine the sequence number with other data (like a timestamp or machine ID) to create a UUID.
1. Create a Sequence in PostgreSQL
CREATE SEQUENCE uuid_sequence START 1;
2. Python Service to Generate UUID
# Fetch the next value from the sequence
cursor.execute("SELECT nextval('uuid_sequence')")
sequence_value = cursor.fetchone()[0]
# Combine the sequence value with a timestamp and a namespace to create a UUID
namespace = uuid.NAMESPACE_DNS
name = f"example.com/{sequence_value}"
generated_uuid = uuid.uuid5(namespace, name)
Use two databases to improve fault tolerant
A single database is not fault-tolerant therefore we can have two standalone databases to improve it.
By distributing the sequence generation across two databases, you can ensure that if one database fails, the other can continue to generate unique IDs. This approach also helps in load balancing and scalability.
Following are key points.
- The first database will generate the odd sequence.
- The second will generate the even sequence.
- The load balancer will distribute traffic to these two databases.
- If one database is down then the load balancer will send traffic to the active database.
- We can also use a replica set for each odd/even to improve the fault tolerance.
CREATE SEQUENCE even_sequence START 2 INCREMENT BY 2; #DB1
CREATE SEQUENCE odd_sequence START 1 INCREMENT BY 2; #DB2
To increase the fault tolerance we can extend this design to use n
databases.
Cons
- Managing two databases increases operational complexity.
- If both databases are unavailable, the service cannot generate UUIDs.
UUID Generation service using pre-generated blocks
Local counter
Every app server can maintain a local counter=0
and simply keep incrementing it to generate the ID. But this has a problem as multiple App servers will generate the same ID as everyone starts with counter=0
.
Blocks of counter powered by database auto-sequencing
- Every time the app server restarts will ask the database to give the new UUID block.
- The app server will simply keep incrementing it to generate the UUID.
- If the app server exhausts the ID block then it again asks the database to return the next block.
Key Design Points
- Pre-Generated Blocks:
- Generate a block of unique IDs (e.g., 1000 IDs at a time) and store them in memory or a fast-access storage system.
- Use a database or distributed system to ensure uniqueness across multiple instances of the service.
2. Block Management:
- When a block is exhausted, generate a new block.
- Use a locking mechanism to prevent race conditions when multiple instances try to generate a new block.
3. Fault Tolerance:
- If the primary storage (e.g., database) is unavailable, use a fallback mechanism to generate IDs locally.
4. Performance:
- Retrieve IDs from memory or fast storage to minimize latency.
Problem with sequential UUID as Primary key
If we generate a monotonic increasing integer as the primary key then the row will always be inserted at the end of the same shard. This will create a hotspot to handle high write volume. Therefore it is not a good idea to use monotonic increasing integer as a primary key.
Approach #1: Hash the ID and use key=hash+id
A common technique to spread the load across the multiple shards is to hash the actual unique key and use primary key = hash + unique id
.
Pros:
- Helps to spread the rows across the shards to avoid hotspots on high volume.
Cons:
- Can’t do efficient range reads of the original unique key.
Approach #2: Hash the ID and use two columns HASH and ID as composite primary key
We can write hash and ID as separate columns and use these two columns as a composite primary key order as (hash, ID)
.
For example, to spread writes for different users to the timestamp-ordered UserLogs
the table across 100 shards, we can use a composite primary key as <ShardId, Timestamp, UserId>
. Here ShardId
is acting as a hash.
Note: Based on the database solution ShardId
can be automatically created by the Database as per the available shards therefore it might not be known to you at the time of read.
Now we can run the query as below if you know the hash algorithm.
SELECT *
FROM UserLogs
WHERE Timestamp > X AND ShardId = MOD(UserId, 100);
If not then you can run blindly on all the shards as below.
SELECT *
FROM UserLogs U, (GENERATE_ARRAY(0,99) S
WHERE Timestamp > X AND ShardId = S.VALUE
Approach #3: Bit reversed sequence
The following are steps.
- Generate monotonic increasing ID, let’s call it X (preferably odd sequence)
- Reverse the bit and get Y (one option is to reverse the order of the bits in the 2’s complement binary representation of X).
- Use Y as a primary key.
- User-visible IDs (which we want to increase in value) can be generated from the same bit-reversed sequence by bit-reversing the primary key again.
Following are examples using 4 bits to show how bit reversal helps to break monotonicity which helps to shard data.
Sequence Bitreversal
0001 (1) 1000 (8)
0010 (2) 0100 (4)
0011 (3) 1100 (12)
0100 (4) 0010 (2)
0101 (5) 1010 (10)
0110 (6) 0110 (6)
0111 (7) 1110 (14)
1000 (8) 0001 (1)
Primary Key using Commit timestamp
If the database provides the commit timestamp after a transaction is committed then we might want to use the CommitTimestamp as a primary key.
Since the timestamp is monotonic increasing the number therefore it will lead to a hotspot on high load.
We can solve this problem as similar to what we discussed above. Following is a summary.
Approach # 1: Hash the key and spread writes among N Shards
To divide the work among N servers, create N shards and maintain shard details in a separate table.
ShardId (PrimaryKey=<ShardId>)
1
2
3
.
.
N
Generate ShardId using hash(userId)%N
or any other suitable data. Then create UserLogs
as below.
ShardId Timestamp UserId PrimaryKey = <ShardId, Timestamp>
1 12345 1
2 12346 1
3 12347 1
1 12348 2
3 12349 2
4 12310 1
4 12311 2
Approach # 2: Swap the order for timestamp with leading key with high cardinality
Let’s say if we know that UserId has high cardinality then we can swap the primary key as <UserId, Timestamp>
and drop the ShardId
.
UserId Timestamp (Primary Key = <UserId, Timestamp>)
1 12350
1 12351
1 12352
2 12310
2 12311
2 12312
1 12353
Now the insertion to shard is based on UserId, not the timestamp. Now writing will spread among N
shards.
If the leading Primary key is low cardinality then it will lead to a hotspot on high load.
Another disadvantage of this approach over #1 is that it’s harder to read a sequence of records in order of commit timestamp. For example, find all the records that were added since the time T
.
Happy learning :-)