Generate Identifiers using UUID type7 with the capture timestamp as the 48 most significants bits#50
Generate Identifiers using UUID type7 with the capture timestamp as the 48 most significants bits#50lfoppiano wants to merge 10 commits into
Conversation
|
I have a few questions:
|
Yes. Also deprecate |
We should use a random, not a sequence or a sub-millisecond clock. RFC 9562 allows this, even mentions it as first option. That's also the way the Java 26 Type-7 UUID is created, cf. implementation / PR. The timestamp is in milliseconds precision. Collisions on the timestamp are likely, given
Because it's a distributed crawler there is no way to get a reliable sequence without collisions. |
|
@sebastian-nagel sure, I can add an additional random component. However, how about the sorting/ordering of the generated identifiers? Am I understanding correctly that within the same record fetch we will write the warc components in order so they might be naturally sorted? |
Of course, we could use 2 bits (maybe reserve 4 bits for that purpose) for a sequence number (0: request, 1: response, 2: metadata). However, the records are already linked per And later on, when we switch to the Java UUID implementation in Java 26 or if it's backported, the hidden feature would disappear. So, I'd prefer, to use only random. |
|
@sebastian-nagel I've implemented the random sequence. I believe we can start iterating over the review. Meanwhile I'm keeping working on adding more unit tests. I've added only one now that tests uniqueness, and sorting. I'm not sure the unpacking of the records is actually done correctly. I need to run nutch, generate a new segment, but this takes more time. |
sebastian-nagel
left a comment
There was a problem hiding this comment.
Hi @lfoppiano, thank! Looks good.
Tested the code using Fetcher (few pages) and WarcExport (37k pages):
- compared WARC-Date and timestamp from UUID:
paste <(echo WARC-Date; fastwarc index -fwarc-date warc/*/*.warc.gz | jq -r '."warc-date"' | tr T " ") <(fastwarc index -fwarc-record-id warc/*/*.warc.gz | jq -r '."warc-record-id"' | cut -c11-46 | TZ=UTC uuidparse) | cut -c1-19,75-97 | perl -a -lne 'print if $F[0] ne $F[2] || $F[1] ne $F[3]'- all timestamps are the same
- including that of the warcinfo record (timestamp indicates start of fetching)
- verified that all UUIDs are unique
|
Great! We merge this later on, together with the rest of https://github.com/commoncrawl/issues/issues/630 |
This PR introduces UUID of type 7 using the implementation CC0 from https://github.com/belief-driven-design/blog-uuidv7/.