Skip to content

Add URL normalisation when writing WARC Records #54

Open
lfoppiano wants to merge 16 commits into
ccfrom
bugfix/robots-txt-redirect-url-normalization
Open

Add URL normalisation when writing WARC Records #54
lfoppiano wants to merge 16 commits into
ccfrom
bugfix/robots-txt-redirect-url-normalization

Conversation

@lfoppiano

@lfoppiano lfoppiano commented May 4, 2026

Copy link
Copy Markdown

This PR solves the problem reported in #53 when URLs are malformed (no hostname when passed through a URI) that we have experienced in commoncrawl/cc-index-table#58

The fix is applied to the HttpRobotRulesParser.java and uses the same code from the cc-index-table (See commoncrawl/cc-index-table#58). Not sure whether this can be moved to a shared library in a later time.

The URL is changed only for redirects and when the getHost() is empty.

Testing by running Nutch directly on one of the frenzy URLs: http://gitebercail.fr/robots.txt:

2026-05-04 15:04:00,894 INFO o.a.n.p.h.a.HttpRobotRulesParser [FetcherThread] Robots.txt redirect resolved to malformed URL https:////sites.google.com/site/lebercailgiteennormandie/robots.txt (initial: http://gitebercail.fr/robots.txt); will normalize

and the WARC output is written as follow:

2026-05-04 15:05:56,752 INFO o.c.u.WarcRecordWriter [pool-133-thread-1] WARC response record http://gitebercail.fr/robots.txt (2026-05-04T14:04:01Z, status: 301, size: 268)
2026-05-04 15:05:56,766 INFO o.c.u.WarcRecordWriter [pool-133-thread-1] WARC response record http://gitebercail.fr/robots.txt (2026-05-04T14:04:00Z, status: 301, size: 268)
2026-05-04 15:05:56,767 INFO o.c.u.WarcRecordWriter [pool-133-thread-1] WARC response record https://sites.google.com/site/lebercailgiteennormandie/robots.txt (2026-05-04T14:04:01Z, status: 404, size: 109539)

We should double check that the exit strategy is correct, case the sanitized URL fails to parse, we skip the cache

Finally, I added tests only on the normalization, while this PR is revised I can check how to create integration tests using a segment with the injected robot.txt URL.

@lfoppiano lfoppiano marked this pull request as ready for review May 4, 2026 15:22
@lfoppiano lfoppiano requested a review from sebastian-nagel May 4, 2026 15:22

@sebastian-nagel sebastian-nagel left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @lfoppiano, thanks!

I see two major concerns:

  1. Why is an issue of the WARC writer fixed in the robots.txt fetcher / parser? At least, for part of the problematic records, there was no issue following the robots.txt redirects. Eventually, the HTTP status 404 indicates an issue, e.g. for https:////sites.google.com/site/lebercailgiteennormandie/robots.txt (redirected from http://gitebercail.fr/robots.txt), see the URL index record. At least, right now the normalized URL also returns a 404.

    • Also: if it's fixed in the robots.txt fetcher, it should be an upstream fix. (The WARC writer is specific to Common Crawl's fork, at least, for now.)
  2. The issue with the four slashes is just one of many potential issues. There is also the issue described in commoncrawl/cc-index-table#59, but there might be another one in the future. That's why a more general solution is preferred:

    • Either call the configurable URL normalizers from the WARC writer. This should be done only if the URL
    • Or use the normalized and "proven" URL from the HTTP protocol, see NUTCH-3173.

@lfoppiano lfoppiano changed the title Add URL normalisation for invalid URLs for robots.txt redirect Add URL normalisation when writing WARC Records May 5, 2026
@lfoppiano

Copy link
Copy Markdown
Author

@sebastian-nagel sorry I made the wrong fix. So, I've reverted everything it and started from scratch. 😅

There is a truncated sentence in the review comment This should be done only if the URL....

To be sure I understand properly:

  • I'm starting by adding unit tests first (they are failing at the moment). Could you check whether they are correctly framed?
  • We are going to target any URL, not just the redirects of the robot.txt?
  • And, any WARC records (Request, Response, Metadata)? Also Revisit?

@lfoppiano

Copy link
Copy Markdown
Author

As discussed I'm working on the upstream issue NUTCH-3173.

I've been scouting the code and adding some unit tests around okhttp plugin.

I also tested running nutch, indeed, HttpUrl would give a cleaner URLs for the invalid //// slashes:

2026-05-11 09:05:50,825 INFO o.a.n.p.o.OkHttpResponse [FetcherThread] Original URL https:////sites.google.com/site/lebercailgiteennormandie/robots.txt
2026-05-11 09:05:50,825 INFO o.a.n.p.o.OkHttpResponse [FetcherThread] Stringified URL https://sites.google.com/site/lebercailgiteennormandie/robots.txt

However, the URLs with punycodes are not parsed correctly. In fact HttpUrl just silently set them to null:

    @JvmStatic
    @JvmName("parse") fun String.toHttpUrlOrNull(): HttpUrl? {
      return try {
        toHttpUrl()
      } catch (_: IllegalArgumentException) {
        null
      }
    }

Perhaps, to be further discussed.

@sebastian-nagel

sebastian-nagel commented May 12, 2026

Copy link
Copy Markdown

URLs with punycodes

You mean with IDN host names in one of the standard Unicode representations (UTF-8, UTF-16, etc.), that is not in the ASCII representation defined by Punycode?

Even then it seems to work:

jshell> var u = HttpUrl.parse("https://🧠.s.country/")
u ==> https://xn--qv9h.s.country/
jshell> var u = HttpUrl.parse("https://xn--qv9h.s.country/");
u ==> https://xn--qv9h.s.country/
jshell> var u = HttpUrl.parse("https://%F0%9F%A7%A0.s.country/")
u ==> https://xn--qv9h.s.country/
jshell> okhttp3.OkHttp.VERSION
$11 ==> "5.3.2"

In the Common Crawl Nutch setup, all URLs (again: except for robots.txt redirects) are passed through urlnormalizer-basic which is configured with urlnormalizer.basic.host.idn=toAscii. So, normally the URL should be already in the form used by HttpUrl.

@lfoppiano

Copy link
Copy Markdown
Author

I don't understand... this test (and other similar, e.g. with different pynycode) are failing:

    private static final String BRAIN_UNICODE   = "🧠";
    private static final String PARENT = ".s.country";
    private static final String PATH   = "/p/human-protocol-aligning-hearts-bots";

@Test
    public void unicodeHostNormalizesToPunycode() {
        HttpUrl url = HttpUrl.parse("https://" + BRAIN_UNICODE + PARENT + PATH);
        assertNotNull(url, "HttpUrl.parse must accept Unicode host");
        assertEquals(BRAIN_PUNYCODE + PARENT, url.host());
    }
image

@lfoppiano

lfoppiano commented May 12, 2026

Copy link
Copy Markdown
Author

OK. I see a potential issue. The version I see is 4.9.3.. I'm running the test directly in the plugin. This might be the difference.

@lfoppiano lfoppiano force-pushed the bugfix/robots-txt-redirect-url-normalization branch from b64a4b8 to caa24b2 Compare May 12, 2026 08:46

@sebastian-nagel sebastian-nagel left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @lfoppiano! The code generally looks good, it does not compile though.

On a high level I have two major concerns - two because the PR addresses two issues:

  1. A protocol implementation (a plugin) may change the URL really used for the request and response. That's not unique to protocol-okhttp. For example, protocol-httpclient allows to follow redirects (including setting cookies) within the request.
    The issue is: How the real URL can be stored and passed back to the caller, in case it is needed. Whether this is for web archiving, debugging or anything else.

  2. The concrete problem causing downstream issues, when the responses of robots.txt redirects, which did not pass any URL normalization, are archived and the URL / URI has an unexpected form.
    The question here: what is the purpose of the unit tests? Test the behavior of OkHttp? Only for these very specific examples? What if OkHttp changes it's behavior and decides to implement a new URL standard?

Any solution for point 1 should be done in upstream Nutch and it must be generalized, so any protocol plugin can implement it. A simpler solution, of course, is better.

The unit tests address point 2. It's good to know that using OkHttp's HttpUrl string would avoid the downstream issues.

In any case, I'd split the two points.

public static final String RAW_URL = "_raw_url_";

/**
* Key to hold the stringified URL after passing through @see HttpUrl

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This breaks the plugin encapsulation. HttpUrl is not necessarily known to Nutch core classes.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any changes in the Response class or any of the core classes and interfaces, overridden or implemented in plugins need to generalize over all kinds of plugins.

This is also a strong argument to implement this in upstream Nutch.

/**
* Key to hold the raw URL (it might not be fully compliant with encoding standards)
*/
public static final String RAW_URL = "_raw_url_";

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be the URL which could be retrieved per Response.getUrl() anyway. It's also known to the caller. So, no need to store it.

Another option would be to set the URL hold in the response to that one actually used in the request, that would be inline with the description of getUrl(). And would not require any new metadata.
There might be some undesired side effects, though. And should extend the documentation then, and add a note that the URL can change if required for the request.

AtomicReference<String> seenHost = new AtomicReference<>();
AtomicReference<String> seenUrl = new AtomicReference<>();

Interceptor capture = chain -> {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Smart solution to emulate requests without relying on a network connection.

@lfoppiano

Copy link
Copy Markdown
Author

Thanks, @lfoppiano! The code generally looks good, it does not compile though.

Yes that was not finished, in fact there were several things mixed up in the code, including tests that are not pertinent with this change anymore (we try to fix it upstream, instead of downstream).

On a high level I have two major concerns - two because the PR addresses two issues:

Yes, I will split it, but lets' keep the discussion here which is easier

1. A protocol implementation (a plugin) may change the URL really used for the request and response. That's not unique to protocol-okhttp. For example, protocol-httpclient allows to follow redirects (including setting cookies) within the request.
   The issue is: How the real URL can be stored and passed back to the caller, in case it is needed. Whether this is for web archiving, debugging or anything else.

I've gave some though about it, the Response.java is an interface but we could have a second method getRequestURL() that needs to be implemented, that fetches the modified / sanitized URL that was used for the request. Would this work?

2. The concrete problem causing downstream issues, when the responses of robots.txt redirects, which did not pass any URL normalization, are archived and the URL / URI has an unexpected form.
   The question here: what is the purpose of the unit tests? Test the behavior of OkHttp? Only for these very specific examples? What if OkHttp changes it's behavior and decides to implement a new URL standard?

My idea is the following: we have specific needs, that are covered in the tests, indeed we cannot cover all issues, but if something changes in OkHttp the tests will fail and at least we know. I'd rather have more tests not fully related to the actual business than to have silent failing.

Any solution for point 1 should be done in upstream Nutch and it must be generalized, so any protocol plugin can implement it. A simpler solution, of course, is better.

To clarify: with "upstream Nutch", do you mean nutch-core instead of nutch-plugins?

The unit tests address point 2. It's good to know that using OkHttp's HttpUrl string would avoid the downstream issues.

Yes I would keep them, do you think they could be pushed upstream too?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants