forked from apache/nutch
-
Notifications
You must be signed in to change notification settings - Fork 3
Add URL normalisation when writing WARC Records #54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
lfoppiano
wants to merge
16
commits into
cc
Choose a base branch
from
bugfix/robots-txt-redirect-url-normalization
base: cc
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+335
−2
Open
Changes from all commits
Commits
Show all changes
16 commits
Select commit
Hold shift + click to select a range
fc28d7f
fix: apply URL normalisation for robots.txt redirect
lfoppiano 5ef314c
fix: exit strategy
lfoppiano d18e8f5
feat: add unit tests on the normalization method
lfoppiano f0227a2
Revert "fix: exit strategy"
lfoppiano 26e3a19
Revert "fix: apply URL normalisation for robots.txt redirect"
lfoppiano 2a297bf
fix: cleanup
lfoppiano 85e67a7
fix: remove assertion on the utility class for reading segments
lfoppiano c08944d
feat: add a test to simulate the required behaviour
lfoppiano 85698d4
tests: cover also request and metadata
lfoppiano caa24b2
fix: target the right WARC field
lfoppiano b41a29c
feat: add tests with normalization using HttpUrl
lfoppiano bbc1bb5
feat: store stringified URL
lfoppiano f9fa3f1
fix: missing import
lfoppiano a5a4394
fix: disable tests related to a part that may be out of scope in this PR
lfoppiano d706ed6
Revert "feat: store stringified URL"
lfoppiano 31a2324
Revert "fix: missing import"
lfoppiano File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
171 changes: 171 additions & 0 deletions
171
...col-okhttp/src/test/org/apache/nutch/protocol/okhttp/TestOkHttpPunyCodeNormalization.java
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,171 @@ | ||
| /* | ||
| * Licensed to the Apache Software Foundation (ASF) under one or more | ||
| * contributor license agreements. See the NOTICE file distributed with | ||
| * this work for additional information regarding copyright ownership. | ||
| * The ASF licenses this file to You under the Apache License, Version 2.0 | ||
| * (the "License"); you may not use this file except in compliance with | ||
| * the License. You may obtain a copy of the License at | ||
| * | ||
| * http://www.apache.org/licenses/LICENSE-2.0 | ||
| * | ||
| * Unless required by applicable law or agreed to in writing, software | ||
| * distributed under the License is distributed on an "AS IS" BASIS, | ||
| * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| * See the License for the specific language governing permissions and | ||
| * limitations under the License. | ||
| */ | ||
| package org.apache.nutch.protocol.okhttp; | ||
|
|
||
| import okhttp3.HttpUrl; | ||
| import okhttp3.Interceptor; | ||
| import okhttp3.OkHttpClient; | ||
| import okhttp3.Request; | ||
| import okhttp3.Response; | ||
| import org.junit.jupiter.api.Test; | ||
|
|
||
| import java.io.IOException; | ||
| import java.net.IDN; | ||
| import java.util.concurrent.TimeUnit; | ||
| import java.util.concurrent.atomic.AtomicReference; | ||
|
|
||
| import static org.junit.jupiter.api.Assertions.*; | ||
|
|
||
| /** | ||
| * Tests for how OkHttp parses and normalizes hosts in three forms: | ||
| * - Unicode (e.g. "https://🧠.s.country/...") | ||
| * - Percent-encoded UTF-8 (e.g. "https://%F0%9F%A7%A0.s.country/...") | ||
| * - Punycode / ACE (e.g. "https://xn--nv8h.s.country/...") | ||
| */ | ||
| public class TestOkHttpPunyCodeNormalization { | ||
|
|
||
| // U+1F9E0 BRAIN | ||
| private static final String BRAIN_UNICODE = "🧠"; | ||
| private static final String BRAIN_PCT_UTF8 = "%F0%9F%A7%A0"; | ||
| private static final String BRAIN_PUNYCODE = "xn--qv9h"; | ||
|
|
||
| private static final String PARENT = ".s.country"; | ||
| private static final String PATH = "/p/human-protocol-aligning-hearts-bots"; | ||
|
|
||
|
|
||
| @Test | ||
| public void testOkHttpVersion() { | ||
| // Just for mental sanity, will be removed | ||
| assertEquals("5.3.2", okhttp3.OkHttp.VERSION); | ||
| } | ||
|
|
||
| @Test | ||
| public void unicodeHostNormalizesToPunycode() { | ||
| HttpUrl url = HttpUrl.parse("https://" + BRAIN_UNICODE + PARENT + PATH); | ||
| assertNotNull(url, "HttpUrl.parse must accept Unicode host"); | ||
| assertEquals(BRAIN_PUNYCODE + PARENT, url.host()); | ||
| } | ||
|
|
||
| @Test | ||
| public void percentEncodedHostNormalizesToPunycode() { | ||
| // This is the CC WARC-Target-URI form. The question: does OkHttp | ||
| // decode the percent-escapes in the host and IDN-normalize, or | ||
| // does it leave them as literal characters / mis-normalize? | ||
| HttpUrl url = HttpUrl.parse("https://" + BRAIN_PCT_UTF8 + PARENT + PATH); | ||
| assertNotNull(url, "HttpUrl.parse must accept percent-encoded host"); | ||
| assertEquals( | ||
| BRAIN_PUNYCODE + PARENT, url.host(), "Percent-encoded UTF-8 host must normalize to Punycode for the SAME emoji"); | ||
| } | ||
|
|
||
| @Test | ||
| public void punycodeHostPassesThrough() { | ||
| HttpUrl url = HttpUrl.parse("https://" + BRAIN_PUNYCODE + PARENT + PATH); | ||
| assertNotNull(url); | ||
| assertEquals(BRAIN_PUNYCODE + PARENT, url.host()); | ||
| } | ||
|
|
||
| @Test | ||
| public void allThreeFormsProduceEquivalentHost() { | ||
| HttpUrl uni = HttpUrl.parse("https://" + BRAIN_UNICODE + PARENT + PATH); | ||
| HttpUrl pct = HttpUrl.parse("https://" + BRAIN_PCT_UTF8 + PARENT + PATH); | ||
| HttpUrl ace = HttpUrl.parse("https://" + BRAIN_PUNYCODE + PARENT + PATH); | ||
| assertNotNull(uni); | ||
| assertNotNull(pct); | ||
| assertNotNull(ace); | ||
| assertEquals(uni.host(), pct.host()); | ||
| assertEquals(pct.host(), ace.host()); | ||
| } | ||
|
|
||
| @Test | ||
| public void pathIsNotMangledByHostNormalization() { | ||
| // Sanity: percent-decoding the host must not bleed into the path. | ||
| HttpUrl url = HttpUrl.parse("https://" + BRAIN_PCT_UTF8 + PARENT + PATH); | ||
| assertNotNull(url); | ||
| assertEquals(PATH, url.encodedPath()); | ||
| } | ||
|
|
||
| @Test | ||
| public void javaIdnAgreesWithOkHttp() { | ||
| // Cross-check OkHttp's host() output against the JDK's IDN.toASCII() | ||
| // so we know which spec OkHttp is following. | ||
| String jdk = IDN.toASCII(BRAIN_UNICODE + PARENT, IDN.ALLOW_UNASSIGNED); | ||
| HttpUrl url = HttpUrl.parse("https://" + BRAIN_UNICODE + PARENT + PATH); | ||
| assertNotNull(url); | ||
| assertEquals(jdk, url.host()); | ||
| } | ||
|
|
||
|
|
||
| @Test | ||
| public void hostHeaderMatchesNormalizedHost() throws IOException { | ||
| // Build a request and intercept it BEFORE it hits the network, so | ||
| // we can read the exact Host header OkHttp would send. We use an | ||
| // application interceptor that short-circuits with a synthetic | ||
| // response — no actual DNS / TCP needed. | ||
| AtomicReference<String> seenHost = new AtomicReference<>(); | ||
| AtomicReference<String> seenUrl = new AtomicReference<>(); | ||
|
|
||
| Interceptor capture = chain -> { | ||
| Request req = chain.request(); | ||
| seenHost.set(req.header("Host") != null | ||
| ? req.header("Host") | ||
| : req.url().host()); // OkHttp adds Host at the network layer | ||
| seenUrl.set(req.url().toString()); | ||
| return new Response.Builder() | ||
| .request(req) | ||
| .protocol(okhttp3.Protocol.HTTP_1_1) | ||
| .code(204) | ||
| .message("No Content (synthetic)") | ||
| .build(); | ||
| }; | ||
|
|
||
| OkHttpClient client = new OkHttpClient.Builder() | ||
| .addInterceptor(capture) | ||
| .callTimeout(2, TimeUnit.SECONDS) | ||
| .build(); | ||
|
|
||
| String input = "https://" + BRAIN_PCT_UTF8 + PARENT + PATH; | ||
| Request req = new Request.Builder().url(input).head().build(); | ||
| try (Response r = client.newCall(req).execute()) { | ||
| assertEquals(204, r.code()); | ||
| } | ||
|
|
||
| assertEquals( | ||
| BRAIN_PUNYCODE + PARENT, seenHost.get(), | ||
| "Effective host derived from a percent-encoded UTF-8 input must be the matching Punycode"); | ||
| } | ||
|
|
||
| // -- Mismatch detector (the CC bug, reproduced if it triggers) ----------- | ||
|
|
||
| @Test | ||
| public void parsedHostMustMatchOriginalEmoji() { | ||
| // If this ever fails, OkHttp itself is producing a host that | ||
| // disagrees with the input — which would be the CC WARC bug | ||
| // happening inside OkHttp. Currently expected to pass. | ||
| String[] inputs = { | ||
| "https://" + BRAIN_UNICODE + PARENT + PATH, | ||
| "https://" + BRAIN_PCT_UTF8 + PARENT + PATH, | ||
| "https://" + BRAIN_PUNYCODE + PARENT + PATH, | ||
| }; | ||
| for (String s : inputs) { | ||
| HttpUrl u = HttpUrl.parse(s); | ||
| assertNotNull(u, "parse failed for " + s); | ||
| assertTrue( | ||
| u.host().startsWith(BRAIN_PUNYCODE + "."), | ||
| "Host for " + s + " was " + u.host() + ", expected to contain " + BRAIN_PUNYCODE); | ||
| } | ||
| } | ||
| } | ||
42 changes: 42 additions & 0 deletions
42
...test/org/apache/nutch/protocol/okhttp/TestOkHttpRobotsTxtInvalidSlashesNormalization.java
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,42 @@ | ||
| /* | ||
| * Licensed to the Apache Software Foundation (ASF) under one or more | ||
| * contributor license agreements. See the NOTICE file distributed with | ||
| * this work for additional information regarding copyright ownership. | ||
| * The ASF licenses this file to You under the Apache License, Version 2.0 | ||
| * (the "License"); you may not use this file except in compliance with | ||
| * the License. You may obtain a copy of the License at | ||
| * | ||
| * http://www.apache.org/licenses/LICENSE-2.0 | ||
| * | ||
| * Unless required by applicable law or agreed to in writing, software | ||
| * distributed under the License is distributed on an "AS IS" BASIS, | ||
| * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| * See the License for the specific language governing permissions and | ||
| * limitations under the License. | ||
| */ | ||
| package org.apache.nutch.protocol.okhttp; | ||
|
|
||
| import okhttp3.*; | ||
| import org.junit.jupiter.api.Test; | ||
|
|
||
| import java.io.IOException; | ||
| import java.net.IDN; | ||
| import java.util.concurrent.TimeUnit; | ||
| import java.util.concurrent.atomic.AtomicReference; | ||
|
|
||
| import static org.junit.jupiter.api.Assertions.*; | ||
|
|
||
| /** | ||
| * Tests for how OkHttp parses and normalizes hosts in three forms: | ||
| */ | ||
| public class TestOkHttpRobotsTxtInvalidSlashesNormalization { | ||
|
|
||
| @Test | ||
| public void unicodeHostNormalizesToPunycode() { | ||
| HttpUrl url = HttpUrl.parse("https:////sites.google.com/bao"); | ||
| assertNotNull(url, "HttpUrl.parse must accept Unicode host"); | ||
| assertEquals("sites.google.com", url.host()); | ||
| } | ||
|
|
||
|
|
||
| } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Binary file added
BIN
+288 Bytes
src/testresources/test-segments/20260505091103-malformed-urls/content/part-r-00000/.data.crc
Binary file not shown.
Binary file added
BIN
+12 Bytes
...testresources/test-segments/20260505091103-malformed-urls/content/part-r-00000/.index.crc
Binary file not shown.
Binary file added
BIN
+34.9 KB
src/testresources/test-segments/20260505091103-malformed-urls/content/part-r-00000/data
Binary file not shown.
Binary file added
BIN
+225 Bytes
src/testresources/test-segments/20260505091103-malformed-urls/content/part-r-00000/index
Binary file not shown.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Smart solution to emulate requests without relying on a network connection.