One WARC file (s3://commoncrawl/crawl-data/CC-MAIN-2016-30/segments/1469258944256.88/warc/CC-MAIN-20160723072904-00218-ip-10-185-27-174.ec2.internal.warc.gz) of the July crawl makes the WEATGenerator hanging for hours. This happens when processing record 91422 (91421 records already processed according to job counters):
...
2016-08-04 16:47:09,085 INFO [main] org.archive.hadoop.jobs.WEATGenerator: Start: s3a://commoncrawl/crawl-data/CC-MAIN-2016-30/segments/1469258944256.88/warc/CC-MAIN-20160723072904-00218-ip-10-185-27-174.ec2.internal.warc.gz
2016-08-04 16:47:09,086 INFO [main] org.archive.hadoop.jobs.WEATGenerator: About to write out to s3a://commoncrawl/crawl-data/CC-MAIN-2016-30/segments/1469258944256.88/wat/CC-MAIN-20160723072904-00218-ip-10-185-27-174.ec2.internal.warc.wat.gz and s3a://commoncrawl/crawl-data/CC-MAIN-2016-30/segments/1469258944256.88/wet/CC-MAIN-20160723072904-00218-ip-10-185-27-174.ec2.internal.warc.wet.gz
...
2016-08-04 16:55:10,741 INFO [main] org.archive.hadoop.jobs.WEATGenerator: Outputting new record 91000
Attaching to the task JVM several times with 2 hours shows the following stack (only calls inside java.util.regex vary):
at java.util.regex.Matcher.find(Matcher.java:592)
at org.archive.resource.html.ExtractingParseObserver.patternCSSExtract(ExtractingParseObserver.java:417)
at org.archive.resource.html.ExtractingParseObserver.handleStyleNode(ExtractingParseObserver.java:201)
One WARC file (
s3://commoncrawl/crawl-data/CC-MAIN-2016-30/segments/1469258944256.88/warc/CC-MAIN-20160723072904-00218-ip-10-185-27-174.ec2.internal.warc.gz) of the July crawl makes the WEATGenerator hanging for hours. This happens when processing record 91422 (91421 records already processed according to job counters):Attaching to the task JVM several times with 2 hours shows the following stack (only calls inside java.util.regex vary):