Skip to content

WARC writer: improved verification or normalization of URIs used as WARC-Target-URI #53

Description

@sebastian-nagel

Nutch internally uses URLs while WARC requires URIs as values in the WARC-Target-URI header. URLs are more permissive and may include characters such as \. The conversion from URL to URI already filters records where the URL failed to convert to an URI.

However, even a valid URI does not guarantee that downstream processing is free from issues:

  1. E.g., if the the host name part of the URI is null, the index table fails to create, see Fix URL decoding for robots.txt  cc-index-table#58. The host name is mandatory and should always exist because only remote content is included in the WARC files.
  2. URLs stemming from robots.txt redirects are not normalized.
    • RFC 9309 just specifies to follow five consecutive redirects and does not require or even recommend normalization of redirect targets.
    • Robots.txt rules are handled inside HTTP protocol plugins. Calling other plugins (URL normalizers) from inside a plugin is not made easy because it would break the encapsulation of plugins.

In order to avoid issues, URIs should be
a. further verified (e.g., check whether they can be parsed into components)
b. or normalized

Option b needs be done with care because it may modify the URL in a way the HTTP request cannot be reproduced. Ideally, the URL used by the Http client for the HTTP request should be used. Cf. NUTCH-3173.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions