Skip to content

WARC writer: consider usage of Type 7 UUIDs as WARC-Record-ID #41

@sebastian-nagel

Description

@sebastian-nagel

Common Crawl's WarcWriter uses a type 4 (pseudo randomly generated) UUID generated by Java's UUID.randomUUID() method.

RFC 9562 (published in 2024 and updating RFC 4122) defines a new Type 7 combining a Unix timestamp (epoch seconds) with random. This would allow to encode the capture time (WARC-Date) in the UUID used in the WARC-Record-Id

  • adding usable information to the record ID (e.g., for verification of the WARC-Date)
    • this requires that the capture time is used as timestamp and not the time the WARC record is created
  • while reducing the entropy required to store the WARC-Record-ID, at least, in the columnar index (see Add column to hold WARC-Record-ID cc-index-table#42)
  • sorts by timestamp
  • uniqueness should be still guaranteed by the 74 random bits (or 62 bit random plus a 12 bit sub-millisecond timestamp)

Notes and links:

Metadata

Metadata

Assignees

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions