Generate Identifiers using UUID type7 with the capture timestamp as the 48 most significants bits by lfoppiano · Pull Request #50 · commoncrawl/nutch

lfoppiano · 2026-04-02T08:47:35Z

This PR introduces UUID of type 7 using the implementation CC0 from https://github.com/belief-driven-design/blog-uuidv7/.

lfoppiano · 2026-04-02T08:54:14Z

I have a few questions:

I've overloaded the getRecordId() with getRecordId(long timestamp) which obtain the UUID based on the timestamp comes from the capture time, as requested. What should we do with the getRecordId()? Should I mark it as @deprecated? The use will guarantee both uniqueness and monotonicity, but mixing the call of those methods will result in a mess.
The UUIDv7.fromTimestamp(timestamp) will call the underlying library using sequence = 0, this so the ordering will depend on the timestamp, as far as I understood, the timestamp will be the same for the four records, so the ordering won't be guaranteed. On option is to add a sequence component that provide ordering in the record sequence.

sebastian-nagel · 2026-04-02T09:30:57Z

Should I mark it as @deprecated?

Yes. Also deprecate getUUID()?

sebastian-nagel · 2026-04-02T10:18:05Z

using sequence = 0

We should use a random, not a sequence or a sub-millisecond clock. RFC 9562 allows this, even mentions it as first option. That's also the way the Java 26 Type-7 UUID is created, cf. implementation / PR.

The timestamp is in milliseconds precision. Collisions on the timestamp are likely, given

3 billion fetched pages per crawl
100 segments or 30 million pages per segment
3h fetch time per segment = 10.8 million milliseconds
about 3 pages fetched per milliseconds

Because it's a distributed crawler there is no way to get a reliable sequence without collisions.

lfoppiano · 2026-04-02T11:28:52Z

@sebastian-nagel sure, I can add an additional random component. However, how about the sorting/ordering of the generated identifiers?

Am I understanding correctly that within the same record fetch we will write the warc components in order so they might be naturally sorted?

sebastian-nagel · 2026-04-02T11:41:11Z

Am I understanding correctly that within the same record fetch we will write the warc components in order so they might be naturally sorted?

Of course, we could use 2 bits (maybe reserve 4 bits for that purpose) for a sequence number (0: request, 1: response, 2: metadata). However, the records are already linked per WARC-Concurrent-To: Response linked to request, metadata to response. So, there is no real benefit. It would be also kind of a hidden feature.

And later on, when we switch to the Java UUID implementation in Java 26 or if it's backported, the hidden feature would disappear. So, I'd prefer, to use only random.

lfoppiano · 2026-04-02T12:12:03Z

@sebastian-nagel I've implemented the random sequence. I believe we can start iterating over the review. Meanwhile I'm keeping working on adding more unit tests. I've added only one now that tests uniqueness, and sorting. I'm not sure the unpacking of the records is actually done correctly. I need to run nutch, generate a new segment, but this takes more time.

sebastian-nagel

Hi @lfoppiano, thank! Looks good.

Tested the code using Fetcher (few pages) and WarcExport (37k pages):

compared WARC-Date and timestamp from UUID:

paste <(echo WARC-Date; fastwarc index -fwarc-date warc/*/*.warc.gz | jq -r '."warc-date"' | tr T " ") <(fastwarc index -fwarc-record-id warc/*/*.warc.gz | jq -r '."warc-record-id"' | cut -c11-46 | TZ=UTC uuidparse) | cut -c1-19,75-97 | perl -a -lne 'print if $F[0] ne $F[2] || $F[1] ne $F[3]'

all timestamps are the same
including that of the warcinfo record (timestamp indicates start of fetching)

verified that all UUIDs are unique

lfoppiano · 2026-04-02T13:38:05Z

Great! We merge this later on, together with the rest of https://github.com/commoncrawl/issues/issues/630

lfoppiano added 2 commits April 1, 2026 21:17

feat: integrate the CC0 library from @belief-driven-design blog-uuidv7

3881d2c

feat: use the timestamp for generating UUID of type 7

15ba592

lfoppiano added 2 commits April 2, 2026 11:02

feat: overload method supplying the sequence

77e9c29

feat: unit tests

969ea44

lfoppiano added 2 commits April 2, 2026 11:44

fix: unit tests naming convention

3c04c74

fix: deprecate non-timestamp provided UUID generated methods

cf7331e

lfoppiano added 2 commits April 2, 2026 12:39

feat: update unit tests

c6a462e

fix: make unit tests work locally

8031e8d

lfoppiano added 2 commits April 2, 2026 13:58

fix: use random sequence

3e53715

tests: add test on segment for warc writer

7bc68cc

lfoppiano marked this pull request as ready for review April 2, 2026 12:05

lfoppiano changed the title ~~Features/use UUID type7~~ Generate Identifiers using UUID type7 with the capture timestamp as the 48 most significants bits Apr 2, 2026

lfoppiano linked an issue Apr 2, 2026 that may be closed by this pull request

WARC writer: consider usage of Type 7 UUIDs as WARC-Record-ID #41

Open

sebastian-nagel approved these changes Apr 2, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generate Identifiers using UUID type7 with the capture timestamp as the 48 most significants bits#50

Generate Identifiers using UUID type7 with the capture timestamp as the 48 most significants bits#50
lfoppiano wants to merge 10 commits into
ccfrom
features/use-uuid-type7

lfoppiano commented Apr 2, 2026

Uh oh!

lfoppiano commented Apr 2, 2026

Uh oh!

sebastian-nagel commented Apr 2, 2026

Uh oh!

sebastian-nagel commented Apr 2, 2026

Uh oh!

lfoppiano commented Apr 2, 2026

Uh oh!

sebastian-nagel commented Apr 2, 2026 •

edited

Loading

Uh oh!

lfoppiano commented Apr 2, 2026

Uh oh!

sebastian-nagel left a comment

Uh oh!

lfoppiano commented Apr 2, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

lfoppiano commented Apr 2, 2026

Uh oh!

lfoppiano commented Apr 2, 2026

Uh oh!

sebastian-nagel commented Apr 2, 2026

Uh oh!

sebastian-nagel commented Apr 2, 2026

Uh oh!

lfoppiano commented Apr 2, 2026

Uh oh!

sebastian-nagel commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lfoppiano commented Apr 2, 2026

Uh oh!

sebastian-nagel left a comment

Choose a reason for hiding this comment

Uh oh!

lfoppiano commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sebastian-nagel commented Apr 2, 2026 •

edited

Loading

lfoppiano commented Apr 2, 2026 •

edited

Loading