Add URL normalisation when writing WARC Records by lfoppiano · Pull Request #54 · commoncrawl/nutch

lfoppiano · 2026-05-04T14:28:36Z

This PR solves the problem reported in #53 when URLs are malformed (no hostname when passed through a URI) that we have experienced in commoncrawl/cc-index-table#58

The fix is applied to the HttpRobotRulesParser.java and uses the same code from the cc-index-table (See commoncrawl/cc-index-table#58). Not sure whether this can be moved to a shared library in a later time.

The URL is changed only for redirects and when the getHost() is empty.

Testing by running Nutch directly on one of the frenzy URLs: http://gitebercail.fr/robots.txt:

2026-05-04 15:04:00,894 INFO o.a.n.p.h.a.HttpRobotRulesParser [FetcherThread] Robots.txt redirect resolved to malformed URL https:////sites.google.com/site/lebercailgiteennormandie/robots.txt (initial: http://gitebercail.fr/robots.txt); will normalize

and the WARC output is written as follow:

2026-05-04 15:05:56,752 INFO o.c.u.WarcRecordWriter [pool-133-thread-1] WARC response record http://gitebercail.fr/robots.txt (2026-05-04T14:04:01Z, status: 301, size: 268)
2026-05-04 15:05:56,766 INFO o.c.u.WarcRecordWriter [pool-133-thread-1] WARC response record http://gitebercail.fr/robots.txt (2026-05-04T14:04:00Z, status: 301, size: 268)
2026-05-04 15:05:56,767 INFO o.c.u.WarcRecordWriter [pool-133-thread-1] WARC response record https://sites.google.com/site/lebercailgiteennormandie/robots.txt (2026-05-04T14:04:01Z, status: 404, size: 109539)

We should double check that the exit strategy is correct, case the sanitized URL fails to parse, we skip the cache

Finally, I added tests only on the normalization, while this PR is revised I can check how to create integration tests using a segment with the injected robot.txt URL.

sebastian-nagel

Hi @lfoppiano, thanks!

I see two major concerns:

Why is an issue of the WARC writer fixed in the robots.txt fetcher / parser? At least, for part of the problematic records, there was no issue following the robots.txt redirects. Eventually, the HTTP status 404 indicates an issue, e.g. for https:////sites.google.com/site/lebercailgiteennormandie/robots.txt (redirected from http://gitebercail.fr/robots.txt), see the URL index record. At least, right now the normalized URL also returns a 404.
- Also: if it's fixed in the robots.txt fetcher, it should be an upstream fix. (The WARC writer is specific to Common Crawl's fork, at least, for now.)
The issue with the four slashes is just one of many potential issues. There is also the issue described in commoncrawl/cc-index-table#59, but there might be another one in the future. That's why a more general solution is preferred:
- Either call the configurable URL normalizers from the WARC writer. This should be done only if the URL
- Or use the normalized and "proven" URL from the HTTP protocol, see NUTCH-3173.

lfoppiano · 2026-05-05T09:22:04Z

@sebastian-nagel sorry I made the wrong fix. So, I've reverted everything it and started from scratch. 😅

There is a truncated sentence in the review comment This should be done only if the URL....

To be sure I understand properly:

I'm starting by adding unit tests first (they are failing at the moment). Could you check whether they are correctly framed?
We are going to target any URL, not just the redirects of the robot.txt?
And, any WARC records (Request, Response, Metadata)? Also Revisit?

lfoppiano · 2026-05-11T10:11:25Z

As discussed I'm working on the upstream issue NUTCH-3173.

I've been scouting the code and adding some unit tests around okhttp plugin.

I also tested running nutch, indeed, HttpUrl would give a cleaner URLs for the invalid //// slashes:

2026-05-11 09:05:50,825 INFO o.a.n.p.o.OkHttpResponse [FetcherThread] Original URL https:////sites.google.com/site/lebercailgiteennormandie/robots.txt
2026-05-11 09:05:50,825 INFO o.a.n.p.o.OkHttpResponse [FetcherThread] Stringified URL https://sites.google.com/site/lebercailgiteennormandie/robots.txt

However, the URLs with punycodes are not parsed correctly. In fact HttpUrl just silently set them to null:

    @JvmStatic
    @JvmName("parse") fun String.toHttpUrlOrNull(): HttpUrl? {
      return try {
        toHttpUrl()
      } catch (_: IllegalArgumentException) {
        null
      }
    }

Perhaps, to be further discussed.

sebastian-nagel · 2026-05-12T07:27:51Z

URLs with punycodes

You mean with IDN host names in one of the standard Unicode representations (UTF-8, UTF-16, etc.), that is not in the ASCII representation defined by Punycode?

Even then it seems to work:

jshell> var u = HttpUrl.parse("https://🧠.s.country/")
u ==> https://xn--qv9h.s.country/
jshell> var u = HttpUrl.parse("https://xn--qv9h.s.country/");
u ==> https://xn--qv9h.s.country/
jshell> var u = HttpUrl.parse("https://%F0%9F%A7%A0.s.country/")
u ==> https://xn--qv9h.s.country/
jshell> okhttp3.OkHttp.VERSION
$11 ==> "5.3.2"

In the Common Crawl Nutch setup, all URLs (again: except for robots.txt redirects) are passed through urlnormalizer-basic which is configured with urlnormalizer.basic.host.idn=toAscii. So, normally the URL should be already in the form used by HttpUrl.

lfoppiano · 2026-05-12T07:37:10Z

I don't understand... this test (and other similar, e.g. with different pynycode) are failing:

    private static final String BRAIN_UNICODE   = "🧠";
    private static final String PARENT = ".s.country";
    private static final String PATH   = "/p/human-protocol-aligning-hearts-bots";

@Test
    public void unicodeHostNormalizesToPunycode() {
        HttpUrl url = HttpUrl.parse("https://" + BRAIN_UNICODE + PARENT + PATH);
        assertNotNull(url, "HttpUrl.parse must accept Unicode host");
        assertEquals(BRAIN_PUNYCODE + PARENT, url.host());
    }

lfoppiano · 2026-05-12T07:42:39Z

OK. I see a potential issue. The version I see is 4.9.3.. I'm running the test directly in the plugin. This might be the difference.

This reverts commit 02173bc.

This reverts commit 7afb835.

sebastian-nagel

Thanks, @lfoppiano! The code generally looks good, it does not compile though.

On a high level I have two major concerns - two because the PR addresses two issues:

A protocol implementation (a plugin) may change the URL really used for the request and response. That's not unique to protocol-okhttp. For example, protocol-httpclient allows to follow redirects (including setting cookies) within the request.
The issue is: How the real URL can be stored and passed back to the caller, in case it is needed. Whether this is for web archiving, debugging or anything else.
The concrete problem causing downstream issues, when the responses of robots.txt redirects, which did not pass any URL normalization, are archived and the URL / URI has an unexpected form.
The question here: what is the purpose of the unit tests? Test the behavior of OkHttp? Only for these very specific examples? What if OkHttp changes it's behavior and decides to implement a new URL standard?

Any solution for point 1 should be done in upstream Nutch and it must be generalized, so any protocol plugin can implement it. A simpler solution, of course, is better.

The unit tests address point 2. It's good to know that using OkHttp's HttpUrl string would avoid the downstream issues.

In any case, I'd split the two points.

sebastian-nagel · 2026-05-14T12:43:16Z

+  public static final String RAW_URL = "_raw_url_";
+
+  /**
+   * Key to hold the stringified URL after passing through @see HttpUrl


This breaks the plugin encapsulation. HttpUrl is not necessarily known to Nutch core classes.

Any changes in the Response class or any of the core classes and interfaces, overridden or implemented in plugins need to generalize over all kinds of plugins.

This is also a strong argument to implement this in upstream Nutch.

sebastian-nagel · 2026-05-14T12:58:45Z

+  /**
+   * Key to hold the raw URL (it might not be fully compliant with encoding standards)
+   */
+  public static final String RAW_URL = "_raw_url_";


This would be the URL which could be retrieved per Response.getUrl() anyway. It's also known to the caller. So, no need to store it.

Another option would be to set the URL hold in the response to that one actually used in the request, that would be inline with the description of getUrl(). And would not require any new metadata.
There might be some undesired side effects, though. And should extend the documentation then, and add a note that the URL can change if required for the request.

sebastian-nagel · 2026-05-14T13:23:40Z

+        AtomicReference<String> seenHost   = new AtomicReference<>();
+        AtomicReference<String> seenUrl    = new AtomicReference<>();
+
+        Interceptor capture = chain -> {


Smart solution to emulate requests without relying on a network connection.

lfoppiano · 2026-05-18T13:54:30Z

Thanks, @lfoppiano! The code generally looks good, it does not compile though.

Yes that was not finished, in fact there were several things mixed up in the code, including tests that are not pertinent with this change anymore (we try to fix it upstream, instead of downstream).

On a high level I have two major concerns - two because the PR addresses two issues:

Yes, I will split it, but lets' keep the discussion here which is easier

1. A protocol implementation (a plugin) may change the URL really used for the request and response. That's not unique to protocol-okhttp. For example, protocol-httpclient allows to follow redirects (including setting cookies) within the request.
   The issue is: How the real URL can be stored and passed back to the caller, in case it is needed. Whether this is for web archiving, debugging or anything else.

I've gave some though about it, the Response.java is an interface but we could have a second method getRequestURL() that needs to be implemented, that fetches the modified / sanitized URL that was used for the request. Would this work?

2. The concrete problem causing downstream issues, when the responses of robots.txt redirects, which did not pass any URL normalization, are archived and the URL / URI has an unexpected form.
   The question here: what is the purpose of the unit tests? Test the behavior of OkHttp? Only for these very specific examples? What if OkHttp changes it's behavior and decides to implement a new URL standard?

My idea is the following: we have specific needs, that are covered in the tests, indeed we cannot cover all issues, but if something changes in OkHttp the tests will fail and at least we know. I'd rather have more tests not fully related to the actual business than to have silent failing.

Any solution for point 1 should be done in upstream Nutch and it must be generalized, so any protocol plugin can implement it. A simpler solution, of course, is better.

To clarify: with "upstream Nutch", do you mean nutch-core instead of nutch-plugins?

The unit tests address point 2. It's good to know that using OkHttp's HttpUrl string would avoid the downstream issues.

Yes I would keep them, do you think they could be pushed upstream too?

This reverts commit bbc1bb5.

This reverts commit f9fa3f1.

lfoppiano marked this pull request as ready for review May 4, 2026 15:22

lfoppiano requested a review from sebastian-nagel May 4, 2026 15:22

sebastian-nagel reviewed May 4, 2026

View reviewed changes

Comment thread src/plugin/lib-http/src/test/org/apache/nutch/protocol/http/api/TestHttpRobotRulesParser.java Outdated

lfoppiano changed the title ~~Add URL normalisation for invalid URLs for robots.txt redirect~~ Add URL normalisation when writing WARC Records May 5, 2026

lfoppiano added 10 commits May 12, 2026 09:02

fix: apply URL normalisation for robots.txt redirect

fc28d7f

fix: exit strategy

5ef314c

feat: add unit tests on the normalization method

d18e8f5

Revert "fix: exit strategy"

f0227a2

This reverts commit 02173bc.

Revert "fix: apply URL normalisation for robots.txt redirect"

26e3a19

This reverts commit 7afb835.

fix: cleanup

2a297bf

fix: remove assertion on the utility class for reading segments

85e67a7

feat: add a test to simulate the required behaviour

c08944d

tests: cover also request and metadata

85698d4

fix: target the right WARC field

caa24b2

lfoppiano force-pushed the bugfix/robots-txt-redirect-url-normalization branch from b64a4b8 to caa24b2 Compare May 12, 2026 08:46

lfoppiano added 2 commits May 12, 2026 14:09

feat: add tests with normalization using HttpUrl

b41a29c

feat: store stringified URL

bbc1bb5

sebastian-nagel requested changes May 14, 2026

View reviewed changes

lfoppiano added 2 commits May 18, 2026 12:33

fix: missing import

f9fa3f1

fix: disable tests related to a part that may be out of scope in this PR

a5a4394

lfoppiano mentioned this pull request May 20, 2026

Allow caller to retrieve the URL used for the request #56

Open

lfoppiano added 2 commits May 22, 2026 12:31

Revert "feat: store stringified URL"

d706ed6

This reverts commit bbc1bb5.

Revert "fix: missing import"

31a2324

This reverts commit f9fa3f1.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add URL normalisation when writing WARC Records #54

Add URL normalisation when writing WARC Records #54
lfoppiano wants to merge 16 commits into
ccfrom
bugfix/robots-txt-redirect-url-normalization

lfoppiano commented May 4, 2026 •

edited

Loading

Uh oh!

sebastian-nagel left a comment

Uh oh!

Uh oh!

lfoppiano commented May 5, 2026

Uh oh!

lfoppiano commented May 11, 2026

Uh oh!

sebastian-nagel commented May 12, 2026 •

edited

Loading

Uh oh!

lfoppiano commented May 12, 2026

Uh oh!

lfoppiano commented May 12, 2026 •

edited

Loading

Uh oh!

sebastian-nagel left a comment

Uh oh!

sebastian-nagel May 14, 2026

Uh oh!

sebastian-nagel May 14, 2026

Uh oh!

sebastian-nagel May 14, 2026

Uh oh!

sebastian-nagel May 14, 2026

Uh oh!

lfoppiano commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

lfoppiano commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sebastian-nagel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lfoppiano commented May 5, 2026

Uh oh!

lfoppiano commented May 11, 2026

Uh oh!

sebastian-nagel commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lfoppiano commented May 12, 2026

Uh oh!

lfoppiano commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sebastian-nagel left a comment

Choose a reason for hiding this comment

Uh oh!

sebastian-nagel May 14, 2026

Choose a reason for hiding this comment

Uh oh!

sebastian-nagel May 14, 2026

Choose a reason for hiding this comment

Uh oh!

sebastian-nagel May 14, 2026

Choose a reason for hiding this comment

Uh oh!

sebastian-nagel May 14, 2026

Choose a reason for hiding this comment

Uh oh!

lfoppiano commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

lfoppiano commented May 4, 2026 •

edited

Loading

sebastian-nagel commented May 12, 2026 •

edited

Loading

lfoppiano commented May 12, 2026 •

edited

Loading