Skip to content

Commit 116da86

Browse files
author
Markus Jelsma
committed
NUTCH-1011 Remove duplicate slashes from URLs
git-svn-id: https://svn.apache.org/repos/asf/nutch/trunk@1143468 13f79535-47bb-0310-9956-ffa450edef68
1 parent 5663efb commit 116da86

3 files changed

Lines changed: 17 additions & 0 deletions

File tree

CHANGES.txt

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,8 @@ Nutch Change Log
22

33
Release 2.0 - Current Development
44

5+
* NUTCH-1011 Normalize duplicate slashes in URL's (markus)
6+
57
* NUTCH-1013 Migrate RegexURLNormalizer from Apache ORO to java.util.regex (markus)
68

79
* NUTCH-1016 Strip UTF-8 non-character codepoints and add logging for SolrWriter (markus)

conf/regex-normalize.xml.template

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -63,4 +63,10 @@
6363
<substitution></substitution>
6464
</regex>
6565

66+
<!-- removes duplicate slashes -->
67+
<regex>
68+
<pattern>(?&lt;!:)/{2,}</pattern>
69+
<substitution>/</substitution>
70+
</regex>
71+
6672
</regex-normalize>

src/test/org/apache/nutch/net/TestURLNormalizers.java

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,15 @@ public void testURLNormalizers() {
3939
} catch (MalformedURLException mue) {
4040
fail(mue.toString());
4141
}
42+
43+
// NUTCH-1011 - Get rid of superfluous slashes
44+
try {
45+
String normalizedSlashes = normalizers.normalize("http://www.example.org//path/to//somewhere.html", URLNormalizers.SCOPE_DEFAULT);
46+
assertEquals(normalizedSlashes, "http://www.example.org/path/to/somewhere.html");
47+
} catch (MalformedURLException mue) {
48+
fail(mue.toString());
49+
}
50+
4251
// check the order
4352
int pos1 = -1, pos2 = -1;
4453
URLNormalizer[] impls = normalizers.getURLNormalizers(URLNormalizers.SCOPE_DEFAULT);

0 commit comments

Comments
 (0)