Add URL normalisation when writing WARC Records #54
Conversation
sebastian-nagel
left a comment
There was a problem hiding this comment.
Hi @lfoppiano, thanks!
I see two major concerns:
-
Why is an issue of the WARC writer fixed in the robots.txt fetcher / parser? At least, for part of the problematic records, there was no issue following the robots.txt redirects. Eventually, the HTTP status 404 indicates an issue, e.g. for
https:////sites.google.com/site/lebercailgiteennormandie/robots.txt(redirected fromhttp://gitebercail.fr/robots.txt), see the URL index record. At least, right now the normalized URL also returns a 404.- Also: if it's fixed in the robots.txt fetcher, it should be an upstream fix. (The WARC writer is specific to Common Crawl's fork, at least, for now.)
-
The issue with the four slashes is just one of many potential issues. There is also the issue described in commoncrawl/cc-index-table#59, but there might be another one in the future. That's why a more general solution is preferred:
- Either call the configurable URL normalizers from the WARC writer. This should be done only if the URL
- Or use the normalized and "proven" URL from the HTTP protocol, see NUTCH-3173.
|
@sebastian-nagel sorry I made the wrong fix. So, I've reverted everything it and started from scratch. 😅 There is a truncated sentence in the review comment To be sure I understand properly:
|
|
As discussed I'm working on the upstream issue NUTCH-3173. I've been scouting the code and adding some unit tests around okhttp plugin. I also tested running nutch, indeed, HttpUrl would give a cleaner URLs for the invalid However, the URLs with punycodes are not parsed correctly. In fact HttpUrl just silently set them to null: Perhaps, to be further discussed. |
You mean with IDN host names in one of the standard Unicode representations (UTF-8, UTF-16, etc.), that is not in the ASCII representation defined by Punycode? Even then it seems to work: jshell> var u = HttpUrl.parse("https://🧠.s.country/")
u ==> https://xn--qv9h.s.country/
jshell> var u = HttpUrl.parse("https://xn--qv9h.s.country/");
u ==> https://xn--qv9h.s.country/
jshell> var u = HttpUrl.parse("https://%F0%9F%A7%A0.s.country/")
u ==> https://xn--qv9h.s.country/
jshell> okhttp3.OkHttp.VERSION
$11 ==> "5.3.2"In the Common Crawl Nutch setup, all URLs (again: except for robots.txt redirects) are passed through urlnormalizer-basic which is configured with |
|
OK. I see a potential issue. The version I see is |
b64a4b8 to
caa24b2
Compare
sebastian-nagel
left a comment
There was a problem hiding this comment.
Thanks, @lfoppiano! The code generally looks good, it does not compile though.
On a high level I have two major concerns - two because the PR addresses two issues:
-
A protocol implementation (a plugin) may change the URL really used for the request and response. That's not unique to protocol-okhttp. For example, protocol-httpclient allows to follow redirects (including setting cookies) within the request.
The issue is: How the real URL can be stored and passed back to the caller, in case it is needed. Whether this is for web archiving, debugging or anything else. -
The concrete problem causing downstream issues, when the responses of robots.txt redirects, which did not pass any URL normalization, are archived and the URL / URI has an unexpected form.
The question here: what is the purpose of the unit tests? Test the behavior of OkHttp? Only for these very specific examples? What if OkHttp changes it's behavior and decides to implement a new URL standard?
Any solution for point 1 should be done in upstream Nutch and it must be generalized, so any protocol plugin can implement it. A simpler solution, of course, is better.
The unit tests address point 2. It's good to know that using OkHttp's HttpUrl string would avoid the downstream issues.
In any case, I'd split the two points.
| public static final String RAW_URL = "_raw_url_"; | ||
|
|
||
| /** | ||
| * Key to hold the stringified URL after passing through @see HttpUrl |
There was a problem hiding this comment.
This breaks the plugin encapsulation. HttpUrl is not necessarily known to Nutch core classes.
There was a problem hiding this comment.
Any changes in the Response class or any of the core classes and interfaces, overridden or implemented in plugins need to generalize over all kinds of plugins.
This is also a strong argument to implement this in upstream Nutch.
| /** | ||
| * Key to hold the raw URL (it might not be fully compliant with encoding standards) | ||
| */ | ||
| public static final String RAW_URL = "_raw_url_"; |
There was a problem hiding this comment.
This would be the URL which could be retrieved per Response.getUrl() anyway. It's also known to the caller. So, no need to store it.
Another option would be to set the URL hold in the response to that one actually used in the request, that would be inline with the description of getUrl(). And would not require any new metadata.
There might be some undesired side effects, though. And should extend the documentation then, and add a note that the URL can change if required for the request.
| AtomicReference<String> seenHost = new AtomicReference<>(); | ||
| AtomicReference<String> seenUrl = new AtomicReference<>(); | ||
|
|
||
| Interceptor capture = chain -> { |
There was a problem hiding this comment.
Smart solution to emulate requests without relying on a network connection.
Yes that was not finished, in fact there were several things mixed up in the code, including tests that are not pertinent with this change anymore (we try to fix it upstream, instead of downstream).
Yes, I will split it, but lets' keep the discussion here which is easier
I've gave some though about it, the Response.java is an interface but we could have a second method
My idea is the following: we have specific needs, that are covered in the tests, indeed we cannot cover all issues, but if something changes in OkHttp the tests will fail and at least we know. I'd rather have more tests not fully related to the actual business than to have silent failing.
To clarify: with "upstream Nutch", do you mean nutch-core instead of nutch-plugins?
Yes I would keep them, do you think they could be pushed upstream too? |

This PR solves the problem reported in #53 when URLs are malformed (no hostname when passed through a URI) that we have experienced in commoncrawl/cc-index-table#58
The fix is applied to the
HttpRobotRulesParser.javaand uses the same code from the cc-index-table (See commoncrawl/cc-index-table#58). Not sure whether this can be moved to a shared library in a later time.The URL is changed only for redirects and when the
getHost()is empty.Testing by running Nutch directly on one of the frenzy URLs:
http://gitebercail.fr/robots.txt:and the WARC output is written as follow:
We should double check that the exit strategy is correct, case the sanitized URL fails to parse, we skip the cache
Finally, I added tests only on the normalization, while this PR is revised I can check how to create integration tests using a segment with the injected robot.txt URL.