The WARC writer should use URI.toASCIIString() instead of URI.toString(). The URI class deviates from RFC 2396 and does allow non-control Unicode characters. Many WARC tools require URI compliant to RFC 2396. See commoncrawl/ia-web-commons#27 how this bug was detected.
The WARC writer should use URI.toASCIIString() instead of URI.toString(). The URI class deviates from RFC 2396 and does allow non-control Unicode characters. Many WARC tools require URI compliant to RFC 2396. See commoncrawl/ia-web-commons#27 how this bug was detected.