Skip to content

WAT generator: do not fail on missing WARC-Filename in warcinfo record #23

@sebastian-nagel

Description

@sebastian-nagel

(reported by @Xue-Alex, see the discussion in the Common Crawl group)

The WAT/WET generator fails on WARC files which contain a "warcinfo" with no "WARC-Filename" header:

java.io.IOException: No Envelope.WARC-Header-Metadata.WARC-Filename found.
        at org.archive.extract.WATExtractorOutput.extractOrIO(WATExtractorOutput.java:152)
        at org.archive.extract.WATExtractorOutput.writeWARC(WATExtractorOutput.java:170)
        at org.archive.extract.WATExtractorOutput.output(WATExtractorOutput.java:85)

The first 60 WARC files of the CC-NEWS dataset (written Aug - Oct 2016) miss this field in the warcinfo records.

However, the WAT/WET extractor should not fail because the WARC-Filename header field is optional ("may be used in ‘warcinfo’ type records").

The WARC-Filename is used to fill the WARC-Target-URI header for the corresponding metadata record. Again: this field is optional (a "‘metadata’ record may have a WARC-Target-URI field"), so it seems natural to simple leave away the WARC-Target-URI for metadata records corresponding to a warcinfo record without WARC-Filename.

Metadata

Metadata

Labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions