Skip to content

Make MetaData multi-valued to preserve values of repeating WARC and HTTP headers#98

Merged
ato merged 2 commits into
iipc:masterfrom
sebastian-nagel:metadata-multivalued
Nov 27, 2024
Merged

Make MetaData multi-valued to preserve values of repeating WARC and HTTP headers#98
ato merged 2 commits into
iipc:masterfrom
sebastian-nagel:metadata-multivalued

Conversation

@sebastian-nagel

Copy link
Copy Markdown
Collaborator

MetaData objects which hold (among other) the headers of WARC records and HTTP captures should be multi-valued to store the values of repeated values as list.

The core objective is to make multiple WARC and HTTP headers extracted into WAT files, see also commoncrawl#18. The WAT specification does not tell anything about repeated headers and the given examples do include any repeated header.

Depart from the ubiquitous "Set-Cookie" HTTP header, more and more HTTP headers repeat in the HTTP header. As an example, the number of WARC response records (out of 31498) from a single Common Crawl WARC file where a HTTP header was repeated:

8356    set-cookie
4959    link
2022    server-timing
1321    vary
 983     x-powered-by
 592     cache-control
 361     x-frame-options
 285     x-content-type-options
 246     strict-transport-security
 155     x-xss-protection
  88      content-security-policy
  84      referrer-policy
  42      simplycom-server
  37      server
  31      x-permitted-cross-domain-policies
  28      pragma
 ...

See also the WARC response record included in this PR and used as test resource.

In addition, proposed WARC headers are allowed (or desired) to occur multiple times, e.g. iipc/warc-specifications#42.

…TTP headers

- code cleanup: fix indentation, remove unneeded return statements
@ato

ato commented Nov 29, 2024

Copy link
Copy Markdown
Member

Thanks. Released as 1.2.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants