Skip to content

WEATGenerator to recover from failed tasks with partial uploads #2

@sebastian-nagel

Description

@sebastian-nagel

If a task of WEATGenerator fails while uploading the resulting WAT and WET files (e.g., due to a task timeout), an unpaired WAT or WET file may remain. This causes restarted tasks to fail:

17/06/30 19:20:51 INFO mapreduce.Job: Task Id : attempt_1497855985973_0182_m_000097_1001, Status : FAILED
Error: java.io.IOException: org.apache.hadoop.fs.FileAlreadyExistsException: s3a://commoncrawl/crawl-data/CC-MAIN-2017-26/segments/1498128320063.74/wat/CC-MAIN-20170623133357-20170623153357-00390.warc.wat.gz already exists
        at org.archive.hadoop.jobs.WEATGenerator$WEATGeneratorMapper.map(WEATGenerator.java:126)
        at org.archive.hadoop.jobs.WEATGenerator$WEATGeneratorMapper.map(WEATGenerator.java:48)
        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
        at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:459)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1920)
        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: s3a://commoncrawl/crawl-data/CC-MAIN-2017-26/segments/1498128320063.74/wat/CC-MAIN-20170623133357-20170623153357-00390.warc.wat.gz already exists
        at org.apache.hadoop.fs.s3a.S3AFileSystem.create(S3AFileSystem.java:633)
        at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:925)
        at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:906)
        at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:803)
        at org.archive.hadoop.jobs.WEATGenerator$WEATGeneratorMapper.map(WEATGenerator.java:97)
        ... 9 more

and makes it necessary to manually remove the unpaired file and restart the job with a new list of WARC files to be converted to WAT/WET. Manual interaction is slow and error-prone. Ideally, WEATGenerator should log an unpaired file and overwrite it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions