Skip to content

WARC writer: unit tests for conversion of URLs to URIs #21

@sebastian-nagel

Description

@sebastian-nagel

Nutch uses instances of the class java.net.URL to represent the URLs being crawled. WARC files require URIs for the WARC-Target-URI header. While the conversion to an URI is unproblematic for most of the URLs, there are some issues:

  1. there are instances of java.net.URL which fail to be converted to java.net.URI, see URL.toURI(). Note: the URLs were successfully fetched!
  2. the conversion of java.net.URI to an ASCII-only URI is not free of pitfalls (see WARC writer: use URI.toASCIIString() instead of URI.toString() #20)

Would be good to have unit tests to test and verify these issues - of course, ideally with "solutions" to make the conversion from URL to URI succeed. E.g.,

  • non-ASCII / Unicode components in URLs, including IDNs
  • encoding of white space in the URL path or query
  • encoding of characters invalid in URIs but valid in URLs

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions