If a task of WEATGenerator fails while uploading the resulting WAT and WET files (e.g., due to a task timeout), an unpaired WAT or WET file may remain. This causes restarted tasks to fail:
17/06/30 19:20:51 INFO mapreduce.Job: Task Id : attempt_1497855985973_0182_m_000097_1001, Status : FAILED
Error: java.io.IOException: org.apache.hadoop.fs.FileAlreadyExistsException: s3a://commoncrawl/crawl-data/CC-MAIN-2017-26/segments/1498128320063.74/wat/CC-MAIN-20170623133357-20170623153357-00390.warc.wat.gz already exists
at org.archive.hadoop.jobs.WEATGenerator$WEATGeneratorMapper.map(WEATGenerator.java:126)
at org.archive.hadoop.jobs.WEATGenerator$WEATGeneratorMapper.map(WEATGenerator.java:48)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:459)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1920)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: s3a://commoncrawl/crawl-data/CC-MAIN-2017-26/segments/1498128320063.74/wat/CC-MAIN-20170623133357-20170623153357-00390.warc.wat.gz already exists
at org.apache.hadoop.fs.s3a.S3AFileSystem.create(S3AFileSystem.java:633)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:925)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:906)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:803)
at org.archive.hadoop.jobs.WEATGenerator$WEATGeneratorMapper.map(WEATGenerator.java:97)
... 9 more
and makes it necessary to manually remove the unpaired file and restart the job with a new list of WARC files to be converted to WAT/WET. Manual interaction is slow and error-prone. Ideally, WEATGenerator should log an unpaired file and overwrite it.
If a task of WEATGenerator fails while uploading the resulting WAT and WET files (e.g., due to a task timeout), an unpaired WAT or WET file may remain. This causes restarted tasks to fail:
and makes it necessary to manually remove the unpaired file and restart the job with a new list of WARC files to be converted to WAT/WET. Manual interaction is slow and error-prone. Ideally, WEATGenerator should log an unpaired file and overwrite it.