Skip to content

CompressedWARCReader does not work for Common Crawl WARC files. #81

@YossiTamari

Description

@YossiTamari

When reading a Common Crawl WARC file (e.g. crawl-data/CC-MAIN-2018-34/segments/1534221208676.20/warc/CC-MAIN-20180814062251-20180814082251-00000.warc.gz), when iterating to the second record, in cleanupCurrentRecord(), close() moves to the end of the record, but then gotoEOR(this.currentRecord) expects one of the next 4 characters to be -1, and they're not, it is just getting the start of the next record.
This results in "unexpected extra data after record" being written to stderr, followed by failing to parse any more records.
It seems like removing gotoEOR will solve the problem, but I'm not sure I understand the logic behind this code, so maybe a smarter fix is needed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions