Skip to content

problems in writing proper CDX files with RealCDXExtractorOutput #37

@rjoberon

Description

@rjoberon

Thanks to @gerhardgossen's pull request #36 the most important problems with the redir field are now fixed. Investigating one of our crawls in more depth, I found further redir values that break the CDX file format due to spaces (I anonymized the mail addresses):

mailto: john.doe@uni-oldenburg.de
mailto:john.doe @Informatik.Uni-Oldenburg.DE
mailto:john.doe@blicher Tbinger Anhang
mailto:john.doe@uni-oldenburg.de?subject=Antrag auf SAP Zugang
E:/SmartSource Data Collector/util/content/wt_dcs.gif
ttp://find.galegroup.com/bncn/infomark.do?serQuery=Locale%28en%2C%2C%29%3AFQE%3D%28JX%2CNone%2C16%29%22Dublin Gazette%22%24&queryType=PH&type=pubIssues&prodId=BBCN&version=1.0&source=library

So the main reasons I found are

  1. spaces in e-mail addresses (in all parts),
  2. links to local files (without protocol), and
  3. broken protocol names

which can be summarized by broken URIs can cause broken CDX files which I think should not be the case.

Another issue I found was a CDX line that did not contain a MIME type column which causes similar problems.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions