When reading a Common Crawl WARC file (e.g. crawl-data/CC-MAIN-2018-34/segments/1534221208676.20/warc/CC-MAIN-20180814062251-20180814082251-00000.warc.gz), when iterating to the second record, in cleanupCurrentRecord(), close() moves to the end of the record, but then gotoEOR(this.currentRecord) expects one of the next 4 characters to be -1, and they're not, it is just getting the start of the next record.
This results in "unexpected extra data after record" being written to stderr, followed by failing to parse any more records.
It seems like removing gotoEOR will solve the problem, but I'm not sure I understand the logic behind this code, so maybe a smarter fix is needed.
When reading a Common Crawl WARC file (e.g. crawl-data/CC-MAIN-2018-34/segments/1534221208676.20/warc/CC-MAIN-20180814062251-20180814082251-00000.warc.gz), when iterating to the second record, in cleanupCurrentRecord(), close() moves to the end of the record, but then gotoEOR(this.currentRecord) expects one of the next 4 characters to be -1, and they're not, it is just getting the start of the next record.
This results in "unexpected extra data after record" being written to stderr, followed by failing to parse any more records.
It seems like removing gotoEOR will solve the problem, but I'm not sure I understand the logic behind this code, so maybe a smarter fix is needed.