Skip to content

BasicURLCanonicalizer: more efficient normalization of dots in host name#129

Merged
ato merged 2 commits into
iipc:masterfrom
sebastian-nagel:host-normalize-dots-speedup
Nov 14, 2025
Merged

BasicURLCanonicalizer: more efficient normalization of dots in host name#129
ato merged 2 commits into
iipc:masterfrom
sebastian-nagel:host-normalize-dots-speedup

Conversation

@sebastian-nagel

Copy link
Copy Markdown
Collaborator

Replaces a chain of String.replaceAll(...) by a dedicated method. More but faster code, avoiding unnecessary work.

}
int start = 0, end = host.length();
boolean changed = false;
while (host.charAt(start) == '.') {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably guard the loops to prevent StringIndexOutOfBoundsException on input like "." or "...". Obviously a proper URL isn't going to have an all dots host but someone could be using the canonicalizer on bogus extracted links.

Suggested change
while (host.charAt(start) == '.') {
while (start < end && host.charAt(start) == '.') {

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, good catch!

start++;
changed = true;
}
while (host.charAt(end - 1) == '.') {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here too:

Suggested change
while (host.charAt(end - 1) == '.') {
while (end > start && host.charAt(end - 1) == '.') {

Add unit test and prevent from StringIndexOutOfBoundsException.
@sebastian-nagel

Copy link
Copy Markdown
Collaborator Author

Thanks, @ato. The StringIndexOutOfBoundsException is fixed. Unit test added.

@ato ato merged commit 3881951 into iipc:master Nov 14, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants