Thanks to @gerhardgossen's pull request #36 the most important problems with the redir field are now fixed. Investigating one of our crawls in more depth, I found further redir values that break the CDX file format due to spaces (I anonymized the mail addresses):
mailto: john.doe@uni-oldenburg.de
mailto:john.doe @Informatik.Uni-Oldenburg.DE
mailto:john.doe@blicher Tbinger Anhang
mailto:john.doe@uni-oldenburg.de?subject=Antrag auf SAP Zugang
E:/SmartSource Data Collector/util/content/wt_dcs.gif
ttp://find.galegroup.com/bncn/infomark.do?serQuery=Locale%28en%2C%2C%29%3AFQE%3D%28JX%2CNone%2C16%29%22Dublin Gazette%22%24&queryType=PH&type=pubIssues&prodId=BBCN&version=1.0&source=library
So the main reasons I found are
- spaces in e-mail addresses (in all parts),
- links to local files (without protocol), and
- broken protocol names
which can be summarized by broken URIs can cause broken CDX files which I think should not be the case.
Another issue I found was a CDX line that did not contain a MIME type column which causes similar problems.
Thanks to @gerhardgossen's pull request #36 the most important problems with the
redirfield are now fixed. Investigating one of our crawls in more depth, I found furtherredirvalues that break the CDX file format due to spaces (I anonymized the mail addresses):So the main reasons I found are
which can be summarized by broken URIs can cause broken CDX files which I think should not be the case.
Another issue I found was a CDX line that did not contain a MIME type column which causes similar problems.