Skip to content

Make regular expression to extract URLs from CSS more restrictive#63

Merged
MohammedElsayyed merged 3 commits into
iipc:masterfrom
sebastian-nagel:CssExtractLongRunner
Feb 1, 2017
Merged

Make regular expression to extract URLs from CSS more restrictive#63
MohammedElsayyed merged 3 commits into
iipc:masterfrom
sebastian-nagel:CssExtractLongRunner

Conversation

@sebastian-nagel

Copy link
Copy Markdown
Collaborator

(allow only ", ', \" or \' in front of or after the URL).
Avoid long-runners when matching the regex due to heavy back-tracking.

See commoncrawl#2 for more details.

(allow only `"`, `'`, `\"` or `\'` in front of or after the URL).
Avoid long-runners when matching the regex due to heavy back-tracking.
@kris-sigur kris-sigur added this to the 1.2.0 Release milestone Aug 8, 2016
@MohammedElsayyed

Copy link
Copy Markdown
Member

I added this test case

"url(''i261.photobucket.com'')","i261.photobucket.com"

inside ExtractingParseObserverTest.testHandleStyleNode, but it didn't work as expected:

expected:<i261.photobucket.com> but was:<'i261.photobucket.com'>
junit.framework.ComparisonFailure
	at junit.framework.Assert.assertEquals(Assert.java:81)
	at junit.framework.Assert.assertEquals(Assert.java:87)
	at org.archive.resource.html.ExtractingParseObserverTest.checkExtract(ExtractingParseObserverTest.java:91)
	at org.archive.resource.html.ExtractingParseObserverTest.testHandleStyleNode(ExtractingParseObserverTest.java:52)

Do you think you could double-check it and let me know your findings? I might be mistaken.

@sebastian-nagel

Copy link
Copy Markdown
Collaborator Author

Hi @MohammedElsayyed,
good point! The CSS spec does only allow one quotation mark (single or double). However, the parser should be able to extract also multiply quoted CSS links. That there are still quotes around the URL is not a problem of the extraction pattern but of the cleansing code which removes only one pair of quotes. I'll try to improve the code. Thanks!

- clip multiple quotation marks
Fix StringIndexOutOfBoundsException in patternCSSExtract
- correct check for min. required URL lenght when stripping 4 characters (2 at each end)
- simplified code, use non-capturing groups in regular expression
@sebastian-nagel

Copy link
Copy Markdown
Collaborator Author

Hi @MohammedElsayyed, your test case "url(''i261.photobucket.com'')","i261.photobucket.com" should now be covered, I've added a similar one. The pr includes #57 which also affects the stripping of quote characters.

@MohammedElsayyed

Copy link
Copy Markdown
Member

It worked perfectly normal. But I am not sure about the following test case:

"url('foo.gif'')","foo.gif"

If it could happen, then it should be handled. I included it because I noticed that # of apostrophes before and after URL is not the same as stated in the main issue commoncrawl/ia-web-commons#2

body {
     background-image: url('''... (524288 apostrophes in total!) ...'''http://i261.photobucket.com/albums/ii67/iglup/l10/1280/foto_74967ac2.jpg'''... (515996 apostrophes in total!) ...'''

@sebastian-nagel

Copy link
Copy Markdown
Collaborator Author

Stripping of quotes was done pairwise (one leading, one trailing quote) before. It would be even easier to strip leading and trailing quotes independent whether they are paired. I'll update the pull request. Thanks!

@sebastian-nagel

Copy link
Copy Markdown
Collaborator Author

Also unpaired quotation marks are now stripped. Thanks!

sebastian-nagel added a commit to commoncrawl/ia-web-commons that referenced this pull request Jan 24, 2017
@ldko ldko added the accepted label Jan 27, 2017
@MohammedElsayyed

Copy link
Copy Markdown
Member

Thank you, Mr. @sebastian-nagel. It worked as expected.
Well done!

@MohammedElsayyed MohammedElsayyed merged commit 08c1329 into iipc:master Feb 1, 2017
@sebastian-nagel sebastian-nagel deleted the CssExtractLongRunner branch October 18, 2024 14:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants