If the HTTP header "Location" is written not in this form (eg. all letters lowercase) it is not extracted and added to the CDX as field "redirect". Of course, the extraction should be case-insensitive.
Seen in CC-MAIN-2020-24, to be fixed in CC-MAIN-2020-33 (August crawl). Partially fixed (not for all segments) in CC-MAIN-2020-29.
Example:
{
"status": "301",
"url": "http://1001moviespodcast.libsyn.com/episode-25-the-great-escape-1963",
"filename": "crawl-data/CC-MAIN-2020-24/segments/1590347417746.33/crawldiagnostics/CC-MAIN-20200601113849-20200601143849-00536.warc.gz",
"offset": "18536",
"length": "3691"
}
HTTP/1.1 301 Moved Permanently
server: Apache/2.2.15 (CentOS)
location: https://1001moviespodcast.libsyn.com/episode-25-the-great-escape-1963
...
If the HTTP header "Location" is written not in this form (eg. all letters lowercase) it is not extracted and added to the CDX as field "redirect". Of course, the extraction should be case-insensitive.
Seen in CC-MAIN-2020-24, to be fixed in CC-MAIN-2020-33 (August crawl). Partially fixed (not for all segments) in CC-MAIN-2020-29.
Example:
{ "status": "301", "url": "http://1001moviespodcast.libsyn.com/episode-25-the-great-escape-1963", "filename": "crawl-data/CC-MAIN-2020-24/segments/1590347417746.33/crawldiagnostics/CC-MAIN-20200601113849-20200601143849-00536.warc.gz", "offset": "18536", "length": "3691" }