CompressedWARCReader does not work for Common Crawl WARC files.

When reading a Common Crawl WARC file (e.g. crawl-data/CC-MAIN-2018-34/segments/1534221208676.20/warc/CC-MAIN-20180814062251-20180814082251-00000.warc.gz), when iterating to the second record, in cleanupCurrentRecord(), close() moves to the end of the record, but then gotoEOR(this.currentRecord) expects one of the next 4 characters to be -1, and they're not, it is just getting the start of the next record.
This results in "unexpected extra data after record" being written to stderr, followed by failing to parse any more records.
It seems like removing gotoEOR will solve the problem, but I'm not sure I understand the logic behind this code, so maybe a smarter fix is needed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CompressedWARCReader does not work for Common Crawl WARC files. #81

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

CompressedWARCReader does not work for Common Crawl WARC files. #81

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions