Add column to hold WARC-Record-ID

The WARC-Record-ID has been used in several datasets derived from Common Crawl data as the ID field / column to reference records and establish provenance. See for example the [FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) or [Gneissweb](https://huggingface.co/datasets/ibm-granite/GneissWeb) datasets. In order to allow to establish the provenance link between source and derivate, a relation table is required for join operations. This table must include both WARC-Record-ID and URL plus capture timestamp. See the [Announcing GneissWeb Annotations](https://commoncrawl.org/blog/announcing-gneissweb-annotations) for further information.

Adding the WARC-Record-ID directly to the columnar index would allow for faster joins without the need for the relation table.

Because the estimated size of the record ID column is large, exhaustive testing of variant implementation is required.

See also:
- [WARC 1.0 guidelines: record identification](https://iipc.github.io/warc-specifications/guidelines/warc-implementation-guidelines/#record-identification)


- [ ] Proposed column name: `warc_record_id` (analogous to `warc_record_offset` etc.)
- [ ] Decide on the representation:
  - Include surrounding `<urn:uuid:...>`
    - This is the form used in the [FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) or [Gneissweb](https://huggingface.co/datasets/ibm-granite/GneissWeb) datasets.
    - Using this representation would simplify to write the join queries, but may decrease performance because the leading uniform 10 bytes need to be compared needlessly.
  - Strip surrounding parentheses `<...>`. Note URL indexes may strip `<>`, e.g., for the WARC-Target-URI.
  - Only keep the bare UUID (as whatever data type)
- [ ] Decide on the data type to store the WARC-Record-ID
  - It's about a [UUID](https://en.wikipedia.org/wiki/Universally_unique_identifier), a 128-bit integer
  - Parquet does not have a 128-bit integer [data type](https://parquet.apache.org/docs/file-format/types/), so options are:
    - FIXED_LEN_BYTE_ARRAY (used for the UUID [logical type](https://parquet.apache.org/docs/file-format/types/logicaltypes/)
    - arbitrary length BYTE_ARRAY
    - data type (depending on the representation):
      1. 16 bytes long to purely contain the 128-bit integer in big-endian encoding
      2. 32 bytes long hex digits
      3. 36 bytes long hex digits including the four hyphens used for grouping
      4. 47 bytes including `<urn:uuid:...>`
 - [ ] Evaluate compression given the representation and data type:
    - Compression will hopefully reduce the variant representations and data types onto a similar size.
    - Common Crawl's [WARC writer](https://github.com/commoncrawl/nutch/blob/23e02c10e78ebeb233072def26e84b61db487f20/src/java/org/commoncrawl/util/WarcWriter.java#L458) uses a type 4 (pseudo randomly generated) UUID.
     - Entropy is high. Only the 6 bits representing UUID version and variant will allow to reduce the storage footprint when compressed.
     - So, the lower bound for the compressed size is 15-16 bytes per UUID.
     - The GneissWeb annotations use the full representation (`<urn:uuid:...>` as variable length array and spend 22.5 bytes in average to hold a UUID.
     - Using a condensed representation and data type may reduce the storage footprint of the new column not trivially. Because a larger size causes smaller row groups in terms of rows, this affects also the storage and query performance of other columns.
     - Estimate footprint for a 3 billion index: 60 GiB for a 22.5 byte representation and 40 GiB for a 15.5 byte one.
- [ ] Consider using type 7 UUID as `WARC-Record-ID`
  - see commoncrawl/nutch#41
- [ ] Exhaustively test querying and processing using Athena, Presto, Trino, Spark, DuckDb, Hive, etc.
  - see UUID data type of [presto](https://prestodb.io/docs/current/language/types.html#uuid) and [trino](https://trino.io/docs/current/language/types.html#uuid)

 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add column to hold WARC-Record-ID #42

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Add column to hold WARC-Record-ID #42

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions