You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The WARC-Record-ID has been used in several datasets derived from Common Crawl data as the ID field / column to reference records and establish provenance. See for example the FineWeb or Gneissweb datasets. In order to allow to establish the provenance link between source and derivate, a relation table is required for join operations. This table must include both WARC-Record-ID and URL plus capture timestamp. See the Announcing GneissWeb Annotations for further information.
Adding the WARC-Record-ID directly to the columnar index would allow for faster joins without the need for the relation table.
Because the estimated size of the record ID column is large, exhaustive testing of variant implementation is required.
Using this representation would simplify to write the join queries, but may decrease performance because the leading uniform 10 bytes need to be compared needlessly.
Strip surrounding parentheses <...>. Note URL indexes may strip <>, e.g., for the WARC-Target-URI.
Only keep the bare UUID (as whatever data type)
Decide on the data type to store the WARC-Record-ID
Parquet does not have a 128-bit integer data type, so options are:
FIXED_LEN_BYTE_ARRAY (used for the UUID logical type
arbitrary length BYTE_ARRAY
data type (depending on the representation):
16 bytes long to purely contain the 128-bit integer in big-endian encoding
32 bytes long hex digits
36 bytes long hex digits including the four hyphens used for grouping
47 bytes including <urn:uuid:...>
Evaluate compression given the representation and data type:
Compression will hopefully reduce the variant representations and data types onto a similar size.
Common Crawl's WARC writer uses a type 4 (pseudo randomly generated) UUID.
Entropy is high. Only the 6 bits representing UUID version and variant will allow to reduce the storage footprint when compressed.
So, the lower bound for the compressed size is 15-16 bytes per UUID.
The GneissWeb annotations use the full representation (<urn:uuid:...> as variable length array and spend 22.5 bytes in average to hold a UUID.
Using a condensed representation and data type may reduce the storage footprint of the new column not trivially. Because a larger size causes smaller row groups in terms of rows, this affects also the storage and query performance of other columns.
Estimate footprint for a 3 billion index: 60 GiB for a 22.5 byte representation and 40 GiB for a 15.5 byte one.
The WARC-Record-ID has been used in several datasets derived from Common Crawl data as the ID field / column to reference records and establish provenance. See for example the FineWeb or Gneissweb datasets. In order to allow to establish the provenance link between source and derivate, a relation table is required for join operations. This table must include both WARC-Record-ID and URL plus capture timestamp. See the Announcing GneissWeb Annotations for further information.
Adding the WARC-Record-ID directly to the columnar index would allow for faster joins without the need for the relation table.
Because the estimated size of the record ID column is large, exhaustive testing of variant implementation is required.
See also:
WARC 1.0 guidelines: record identification
Proposed column name:
warc_record_id(analogous towarc_record_offsetetc.)Decide on the representation:
<urn:uuid:...><...>. Note URL indexes may strip<>, e.g., for the WARC-Target-URI.Decide on the data type to store the WARC-Record-ID
<urn:uuid:...>Evaluate compression given the representation and data type:
<urn:uuid:...>as variable length array and spend 22.5 bytes in average to hold a UUID.Consider using type 7 UUID as
WARC-Record-IDExhaustively test querying and processing using Athena, Presto, Trino, Spark, DuckDb, Hive, etc.