Skip to content

WAT extractor: envelope structure does not conform to the WAT specification #44

@saraaubry

Description

@saraaubry

According to the WAT specification (https://webarchive.jira.com/wiki/display/Iresearch/Web+Archive+Metadata+File+Specification), the enveloppe structure should be:
"Envelope": {
"Format": "WARC",
"Payload-Metadata": {}
"WARC-Header-Length": "298",
"WARC-Header-Metadata": {}
}

In the WAT files generated with the extractor, we have the following structure:
Envelope: {
Format: "WARC",
WARC-Header-Length: "298",
Actual-Content-Length": "1343",
WARC-Header-Metadata: {},
Block-Digest: "sha1:XW7VSE74YCSE6AIJNT5AVSELMVBCIYYN",
Payload-Metadata: {}
}
Block-Digest and Actual-Content-Length are not supposed to be in this section.
There are also an Actual-Content-Length and a Entity-Digest in the Payload-Metadata section.
Content and computation of these 4 metadata need to be clarified.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions