Make regular expression to extract URLs from CSS more restrictive#63
Conversation
(allow only `"`, `'`, `\"` or `\'` in front of or after the URL). Avoid long-runners when matching the regex due to heavy back-tracking.
|
I added this test case
inside ExtractingParseObserverTest.testHandleStyleNode, but it didn't work as expected: Do you think you could double-check it and let me know your findings? I might be mistaken. |
|
Hi @MohammedElsayyed, |
- clip multiple quotation marks Fix StringIndexOutOfBoundsException in patternCSSExtract - correct check for min. required URL lenght when stripping 4 characters (2 at each end) - simplified code, use non-capturing groups in regular expression
|
Hi @MohammedElsayyed, your test case |
|
It worked perfectly normal. But I am not sure about the following test case:
If it could happen, then it should be handled. I included it because I noticed that # of apostrophes before and after URL is not the same as stated in the main issue commoncrawl/ia-web-commons#2 |
|
Stripping of quotes was done pairwise (one leading, one trailing quote) before. It would be even easier to strip leading and trailing quotes independent whether they are paired. I'll update the pull request. Thanks! |
|
Also unpaired quotation marks are now stripped. Thanks! |
merged improvements from iipc#63
|
Thank you, Mr. @sebastian-nagel. It worked as expected. |
(allow only
",',\"or\'in front of or after the URL).Avoid long-runners when matching the regex due to heavy back-tracking.
See commoncrawl#2 for more details.