Fix URL decoding for robots.txt by lfoppiano · Pull Request #58 · commoncrawl/cc-index-table

lfoppiano · 2026-04-29T19:24:28Z

This PR addresses the problem found in the April Crawl for processing robots.txt URLs that resulted in an empty/null hostname.

Background information: The target URLs of robots.txt redirects are not passed to URL filters before the redirects are followed. RFC 9309 just specifies to follow five levels of redirect and does not require or even recommend normalization of redirect targets. As a consequence, the robots.txt URLs stemming from redirects may have an unexpected form.

Copilot

Pull request overview

This PR aims to prevent empty/null hostnames when parsing malformed robots.txt-style URLs in WARC target URIs by normalizing malformed http(s) scheme slashes and adding regression tests.

Changes:

Normalize malformed http(s):////... inputs before parsing in WarcUri.
Add JUnit tests covering malformed scheme slashes and extra path slashes.
Ignore IntelliJ project metadata via .gitignore.

Reviewed changes

Copilot reviewed 2 out of 3 changed files in this pull request and generated 7 comments.

File	Description
`src/main/java/org/commoncrawl/net/WarcUri.java`	Normalizes malformed `http(s)` URLs before parsing; adds warning log on parse failure.
`src/test/java/org/commoncrawl/net/WarcUriTest.java`	Adds regression tests for hostname extraction from malformed/oddly-slashed URLs.
`.gitignore`	Adds IntelliJ IDEA project directory ignore entry.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

sebastian-nagel

Thanks, @lfoppiano. See inline comments.

lfoppiano · 2026-05-02T21:21:43Z

@sebastian-nagel I've tighten the application of the normalization only if the hostname is blank. There is a comment in test getHostNameMalformedHttpShouldNotBeEmpty: This is obsolete: if it's expected to be "www.google.com" it cannot be empty.. I'm not sure I understand it. I might have forgotten how it was working before, though.

sebastian-nagel · 2026-05-04T19:56:34Z

comment in test getHostNameMalformedHttpShouldNotBeEmpty: This is obsolete: if it's expected to be "www.google.com" it cannot be empty.. I'm not sure I understand it. I might have forgotten how it was working before, though.

I meant: If there's an assertion for a string to by "xyz" it cannot be empty. One assertion is enough, which means less code but not less tests.

sebastian-nagel

Thanks. Looks good.

lfoppiano added 5 commits April 29, 2026 19:45

fix: patch the HostName URL to avoid empty Host

9515fb4

feat: add more tests

98fa960

fix: add logging for WARC URI parsing errors

083383f

fix: messages

56e0406

chore: spotless

9d3bf11

lfoppiano requested review from Copilot and sebastian-nagel April 29, 2026 19:38

Copilot started reviewing on behalf of lfoppiano April 29, 2026 19:38 View session

Copilot AI reviewed Apr 29, 2026

View reviewed changes

lfoppiano added 3 commits April 29, 2026 21:08

fix: robots.txt

fb0605b

feat: use more efficient approach

c8dd879

fix: forgotten Apache 2 header

309bb5b

sebastian-nagel requested changes Apr 30, 2026

View reviewed changes

sebastian-nagel mentioned this pull request Apr 30, 2026

WARC writer: improved verification or normalization of URIs used as WARC-Target-URI commoncrawl/nutch#53

Open

fix: normalize slashed only when the hostname is blank

6d64095

lfoppiano requested a review from sebastian-nagel May 2, 2026 21:21

lfoppiano added 5 commits May 2, 2026 22:36

fix: removed non-direct dependencies

2cc0c67

test: fix broken test

81f5815

fix: also URLs with a single /

22deb33

fix: also URLs with a single / and add tests

000b3a3

test: add test with a different schema

52252f2

lfoppiano mentioned this pull request May 4, 2026

Add URL normalisation when writing WARC Records commoncrawl/nutch#54

Open

sebastian-nagel approved these changes May 4, 2026

View reviewed changes

test: simplify

8447fcd

lfoppiano merged commit cfa1554 into main May 5, 2026
7 checks passed

lfoppiano deleted the bugfix/robot-txt-url branch May 5, 2026 06:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix URL decoding for robots.txt #58

Fix URL decoding for robots.txt #58
lfoppiano merged 15 commits into
mainfrom
bugfix/robot-txt-url

lfoppiano commented Apr 29, 2026 •

edited by sebastian-nagel

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sebastian-nagel left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lfoppiano commented May 2, 2026

Uh oh!

sebastian-nagel commented May 4, 2026

Uh oh!

sebastian-nagel left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

lfoppiano commented Apr 29, 2026 • edited by sebastian-nagel Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sebastian-nagel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lfoppiano commented May 2, 2026

Uh oh!

sebastian-nagel commented May 4, 2026

Uh oh!

sebastian-nagel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

lfoppiano commented Apr 29, 2026 •

edited by sebastian-nagel

Loading