Skip to content

WAT: Duplicated payload metadata values for "Actual-Content-Length" and "Trailing-Slop-Length" #43

@sebastian-nagel

Description

@sebastian-nagel

With multi-valued metadata (#38) the payload metadata "Actual-Content-Length" and "Trailing-Slop-Length" is duplicated in WAT records stemming from WARC metadata and WARC response records. Here one example:

  "Envelope": {
    "Format": "WARC/1.0",
    "Payload-Metadata": {
      "Actual-Content-Length": [
        "418",
        "418"
      ],
      "Actual-Content-Type": "application/warc-fields",
      "Trailing-Slop-Length": [
        "4",
        "0"
      ],
    },
    "WARC-Header-Metadata": {
      "WARC-Type": "warcinfo"
    }

The reason is that these values are set (or appended) from the classes WARCResource and WARCMetaDataResourceFactory resp. HTTPHeadersResourceFactory.

  • the value of "Actual-Content-Length" is simply duplicated
  • "Trailing-Slop-Length" has two values: 4 is set in WARCResource while 0 is set in the factory classes. Before Make MetaData multi-valued to preserve values of repeating WARC and HTTP headers #38 the last value 0 made into the WAT record.
    • this is also cumbersome, because other WARC types, e.g. WARC request, have the first value (4), as the value is set only once and is never overwritten.
    • unfortunately, the documentation of "Trailing-Slop-Length" ("Number of trailing slop bytes" in the WAT spec is not really useful to understand which of the two values is the correct one. Both make sense: 4 bytes (\r\n\r\n) used as WARC record separator resp. zero superfluous bytes.

Anyway, the duplicated values should be dropped.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions