Skip to content

WAT extractor: do not fail on missing WARC-Filename in warcinfo record #88

@sebastian-nagel

Description

@sebastian-nagel

(see commoncrawl#23)

The WAT resource extractor fails on WARC files which contain a "warcinfo" with no "WARC-Filename" header (eg. CC-NEWS-20160827132735-00002.warc.gz):

$> java -cp ... org.archive.extract.ResourceExtractor -wat CC-NEWS-20160827132735-00002.warc.gz
Exception in thread "main" java.io.IOException: No Envelope.WARC-Header-Metadata.WARC-Filename found.
        at org.archive.extract.WATExtractorOutput.extractOrIO(WATExtractorOutput.java:136)
        at org.archive.extract.WATExtractorOutput.writeWARC(WATExtractorOutput.java:154)
        at org.archive.extract.WATExtractorOutput.output(WATExtractorOutput.java:74)
        at org.archive.extract.ResourceExtractor.run(ResourceExtractor.java:139)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.archive.extract.ResourceExtractor.main(ResourceExtractor.java:62)

So, the simplest solution would be to extract a metadata record (concurrent to the warcinfo w/o WARC-Filename) without a WARC-Target-URI header.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions