Skip to content

Generate Identifiers using UUID type7 with the capture timestamp as the 48 most significants bits#50

Open
lfoppiano wants to merge 10 commits into
ccfrom
features/use-uuid-type7
Open

Generate Identifiers using UUID type7 with the capture timestamp as the 48 most significants bits#50
lfoppiano wants to merge 10 commits into
ccfrom
features/use-uuid-type7

Conversation

@lfoppiano

Copy link
Copy Markdown

This PR introduces UUID of type 7 using the implementation CC0 from https://github.com/belief-driven-design/blog-uuidv7/.

@lfoppiano

Copy link
Copy Markdown
Author

I have a few questions:

  1. I've overloaded the getRecordId() with getRecordId(long timestamp) which obtain the UUID based on the timestamp comes from the capture time, as requested. What should we do with the getRecordId()? Should I mark it as @deprecated? The use will guarantee both uniqueness and monotonicity, but mixing the call of those methods will result in a mess.
  2. The UUIDv7.fromTimestamp(timestamp) will call the underlying library using sequence = 0, this so the ordering will depend on the timestamp, as far as I understood, the timestamp will be the same for the four records, so the ordering won't be guaranteed. On option is to add a sequence component that provide ordering in the record sequence.

@sebastian-nagel

Copy link
Copy Markdown

Should I mark it as @deprecated?

Yes. Also deprecate getUUID()?

@sebastian-nagel

Copy link
Copy Markdown

using sequence = 0

We should use a random, not a sequence or a sub-millisecond clock. RFC 9562 allows this, even mentions it as first option. That's also the way the Java 26 Type-7 UUID is created, cf. implementation / PR.

The timestamp is in milliseconds precision. Collisions on the timestamp are likely, given

  • 3 billion fetched pages per crawl
  • 100 segments or 30 million pages per segment
  • 3h fetch time per segment = 10.8 million milliseconds
  • about 3 pages fetched per milliseconds

Because it's a distributed crawler there is no way to get a reliable sequence without collisions.

@lfoppiano

Copy link
Copy Markdown
Author

@sebastian-nagel sure, I can add an additional random component. However, how about the sorting/ordering of the generated identifiers?

Am I understanding correctly that within the same record fetch we will write the warc components in order so they might be naturally sorted?

@sebastian-nagel

sebastian-nagel commented Apr 2, 2026

Copy link
Copy Markdown

Am I understanding correctly that within the same record fetch we will write the warc components in order so they might be naturally sorted?

Of course, we could use 2 bits (maybe reserve 4 bits for that purpose) for a sequence number (0: request, 1: response, 2: metadata). However, the records are already linked per WARC-Concurrent-To: Response linked to request, metadata to response. So, there is no real benefit. It would be also kind of a hidden feature.

And later on, when we switch to the Java UUID implementation in Java 26 or if it's backported, the hidden feature would disappear. So, I'd prefer, to use only random.

@lfoppiano lfoppiano marked this pull request as ready for review April 2, 2026 12:05
@lfoppiano

Copy link
Copy Markdown
Author

@sebastian-nagel I've implemented the random sequence. I believe we can start iterating over the review. Meanwhile I'm keeping working on adding more unit tests. I've added only one now that tests uniqueness, and sorting. I'm not sure the unpacking of the records is actually done correctly. I need to run nutch, generate a new segment, but this takes more time.

@lfoppiano lfoppiano changed the title Features/use UUID type7 Generate Identifiers using UUID type7 with the capture timestamp as the 48 most significants bits Apr 2, 2026
@lfoppiano lfoppiano linked an issue Apr 2, 2026 that may be closed by this pull request

@sebastian-nagel sebastian-nagel left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @lfoppiano, thank! Looks good.

Tested the code using Fetcher (few pages) and WarcExport (37k pages):

  • compared WARC-Date and timestamp from UUID:
    paste <(echo WARC-Date; fastwarc index -fwarc-date warc/*/*.warc.gz | jq -r '."warc-date"' | tr T " ") <(fastwarc index -fwarc-record-id warc/*/*.warc.gz | jq -r '."warc-record-id"' | cut -c11-46 | TZ=UTC uuidparse) | cut -c1-19,75-97 | perl -a -lne 'print if $F[0] ne $F[2] || $F[1] ne $F[3]'
    
    • all timestamps are the same
    • including that of the warcinfo record (timestamp indicates start of fetching)
  • verified that all UUIDs are unique

@lfoppiano

lfoppiano commented Apr 2, 2026

Copy link
Copy Markdown
Author

Great! We merge this later on, together with the rest of https://github.com/commoncrawl/issues/issues/630

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

WARC writer: consider usage of Type 7 UUIDs as WARC-Record-ID

2 participants