Skip to content

WARC writer (CDX writer): new optional CDX JSON fields "redirect" and "truncated"#15

Merged
sebastian-nagel merged 1 commit into
ccfrom
warc-cdx-mark-truncation-and-redirects
Nov 12, 2019
Merged

WARC writer (CDX writer): new optional CDX JSON fields "redirect" and "truncated"#15
sebastian-nagel merged 1 commit into
ccfrom
warc-cdx-mark-truncation-and-redirects

Conversation

@sebastian-nagel

@sebastian-nagel sebastian-nagel commented Nov 7, 2019

Copy link
Copy Markdown
  • add key "truncated" if the record payload is truncated indication the reason for the truncation, cf.
    WARC-Truncated in WARC 1.1 spec
  • add key "redirect" containing the redirect target
    • from HTTP header field "Location" if the HTTP status code indicates a HTTP redirect
    • relative paths converted to absolute URLs using the page URL as base/context
    • absent if the "Location" value is missing or is not a valid URL or a valid relative URL path

Example CDX snippets (multi-line JSON):

  • redirect target/location
org,commoncrawl)/faq 20191107134157 {
  "url": "https://commoncrawl.org/faq/",
  ...,
  "status": "301",
  ...,
  "redirect": "http://commoncrawl.org/big-picture/frequently-asked-questions/"
}
  • truncation because of overlong content
es,remax,inmomas)/robots.txt 20191107134158 {
  "url": "http://www.inmomas.remax.es/robots.txt",
  ...,
  "status": "200",
  ...,
  "truncated": "length"
}

- add key "truncated" if the record payload is truncated
  indication the reason for the truncation, cf.
  http://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/#warc-truncated
- add key "redirect" containing the redirect target
  - from HTTP header field "Location"
  - relative paths converted to absolute URLs
    using the page URL as base/context
  - absent if the "Location" string is not a valid URL
    or relative URL path
@sebastian-nagel sebastian-nagel changed the title WARC writer (CDX writer): new CDX fields/keys in JSON data WARC writer (CDX writer): new optional CDX JSON fields "redirect" and "truncated" Nov 8, 2019
@sebastian-nagel sebastian-nagel merged commit adfcc45 into cc Nov 12, 2019
@sebastian-nagel sebastian-nagel deleted the warc-cdx-mark-truncation-and-redirects branch November 12, 2019 13:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant