You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Nutch internally uses URLs while WARC requires URIs as values in the WARC-Target-URI header. URLs are more permissive and may include characters such as \. The conversion from URL to URI already filters records where the URL failed to convert to an URI.
However, even a valid URI does not guarantee that downstream processing is free from issues:
E.g., if the the host name part of the URI is null, the index table fails to create, see Fix URL decoding for robots.txt cc-index-table#58. The host name is mandatory and should always exist because only remote content is included in the WARC files.
URLs stemming from robots.txt redirects are not normalized.
RFC 9309 just specifies to follow five consecutive redirects and does not require or even recommend normalization of redirect targets.
Robots.txt rules are handled inside HTTP protocol plugins. Calling other plugins (URL normalizers) from inside a plugin is not made easy because it would break the encapsulation of plugins.
In order to avoid issues, URIs should be
a. further verified (e.g., check whether they can be parsed into components)
b. or normalized
Option b needs be done with care because it may modify the URL in a way the HTTP request cannot be reproduced. Ideally, the URL used by the Http client for the HTTP request should be used. Cf. NUTCH-3173.
Nutch internally uses URLs while WARC requires URIs as values in the
WARC-Target-URIheader. URLs are more permissive and may include characters such as\. The conversion from URL to URI already filters records where the URL failed to convert to an URI.However, even a valid URI does not guarantee that downstream processing is free from issues:
In order to avoid issues, URIs should be
a. further verified (e.g., check whether they can be parsed into components)
b. or normalized
Option b needs be done with care because it may modify the URL in a way the HTTP request cannot be reproduced. Ideally, the URL used by the Http client for the HTTP request should be used. Cf. NUTCH-3173.