Skip to content

ARCRecord entered inconsistent state for some ARC files#40

Open
tokee wants to merge 8 commits into
iipc:masterfrom
tokee:arc-read-http-header
Open

ARCRecord entered inconsistent state for some ARC files#40
tokee wants to merge 8 commits into
iipc:masterfrom
tokee:arc-read-http-header

Conversation

@tokee

@tokee tokee commented Feb 18, 2015

Copy link
Copy Markdown

If no status could be found, ARCReader#readHttpHeader read forward until next status in the originating ARC-Stream, thereby reading past the current record and into the next one. This was combined with a faulty exclusion of status lines containing colons.

This pull request contains a sample ARC file that triggered the two bugs as well as unit tests for the bugs.

@anjackson

Copy link
Copy Markdown
Member

This 'off spec' hook looks like an ARC format variation edge case to me (see https://github.com/iipc/webarchive-commons/pull/40/files#diff-1a20f5668b210061c0267c7f079c8830L600), so I'd really like @nlevitt or @kngenie to take a look at this and see if they are happy with the change.

@tokee

tokee commented Feb 18, 2015

Copy link
Copy Markdown
Author

With the check for empty line (i.e. no more HTTP-response lines), keeping the check for colon just means that individual records are wrongly processed instead of the rest of the ARC file. But I do find it curious to discard well-formed status lines. If the colon-check it kept, maybe a rationale could be given as a comment in the code?

@anjackson

Copy link
Copy Markdown
Member

@tokee And, just to check, the removal of that colon-test is the only material change you're making right? All the other stuff is in the /test/ area, right?

@tokee

tokee commented Feb 18, 2015

Copy link
Copy Markdown
Author

Removing of colon and adding of newline-check are the only real changes, yes. We are in the process of testing with 95 known problem-ARCs (~100MB each) locally and I am happy to say that the first 9 has passed without ARC-parse-showstoppers with the new ARCReader.

@tokee

tokee commented Feb 18, 2015

Copy link
Copy Markdown
Author

Out test run finished with 94/95 previously partly-skipped ARC-files being fully processed. The last one seems to fail due to the HTTP-header-parser being too greedy (I'll update issue #41 with a description).

@thomasegense informs me that about 3% of our ARC files gets marked as problematic when indexing without the patch, so it seems that colon is not uncommon in HTTP-status lines.

@johnerikhalse

Copy link
Copy Markdown
Contributor

Seems to me that this change is all about supporting real arcs which actually conforms to the spec. But since it is possible for this change to let through records to OpenWayback which today would fail, I propose it to be included in a minor release and not the next bugfix release.

@johnerikhalse johnerikhalse added this to the 1.2.0 Release milestone Apr 26, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants