Make MetaData multi-valued to preserve values of repeating WARC and HTTP headers#98
Merged
Merged
Conversation
ed20b16 to
b1eacae
Compare
b1eacae to
f4f6655
Compare
…TTP headers - code cleanup: fix indentation, remove unneeded return statements
Member
|
Thanks. Released as 1.2.0. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
MetaData objects which hold (among other) the headers of WARC records and HTTP captures should be multi-valued to store the values of repeated values as list.
The core objective is to make multiple WARC and HTTP headers extracted into WAT files, see also commoncrawl#18. The WAT specification does not tell anything about repeated headers and the given examples do include any repeated header.
Depart from the ubiquitous "Set-Cookie" HTTP header, more and more HTTP headers repeat in the HTTP header. As an example, the number of WARC response records (out of 31498) from a single Common Crawl WARC file where a HTTP header was repeated:
See also the WARC response record included in this PR and used as test resource.
In addition, proposed WARC headers are allowed (or desired) to occur multiple times, e.g. iipc/warc-specifications#42.