integrity integration by jt55401 · Pull Request #10 · commoncrawl/ia-hadoop-tools

jt55401 · 2024-07-02T21:08:32Z

The intention of this PR is to enable integrity file output.
This will involve:

modifying WEATGenerator.java to output cdx files.

I've started by laying out TODO's in the places where I think we will need to make changes.

This will need to coordinate with a PR in crawl-tools: https://github.com/commoncrawl/crawl-tools/pull/37

sebastian-nagel · 2024-07-03T08:22:07Z

Few remarks:

the actual work is done by code in commoncrawl/ia-web-commons. Likely, the bulk needs to be implemented there, using existing class implementations to build upon.
this project has a quite long list of dependencies, most of them in very old versions. Setting up a dev environment to test the job might be painful. It could be easier to move the job definition to a new project with a short dependency list ( ia-web-commons, hadoop-client, utilities), eventually also upgrading the job to MapReduce v2.

coordinate with a PR in crawl-tools: https://github.com/commoncrawl/crawl-tools/pull/37

While ia-hadoop-tools is a public repository, crawl-tools isn't. Please, keep in mind that it may be annoying for anybody reading about this issue if they cannot read the information given in the linked issue. So, all information related to this issue should be shared here or in other public repositories. And, of course, it's possible to link a public repository from a private one, for example to discuss the integration of the new job into internal tools and workflows.

…ile yet, local env not available today.

sebastian-nagel

Hi @jt55401, thanks! Just a few comments, didn't try to run it.

sebastian-nagel · 2024-07-28T16:02:37Z

        if(path.endsWith(".gz")) {
          watOutputBasename = inputBasename.substring(0,inputBasename.length()-3) + ".wat.gz";
          wetOutputBasename = inputBasename.substring(0,inputBasename.length()-3) + ".wet.gz";
+          cdxWatOutputBasename = inputBasename.substring(0,inputBasename.length()-3) + ".cdxwat.gz";


This results in:

name.warc.gz name.cdx.gz name.warc.wat.gz name.warc.wat.cdxwat.gz name.warc.wet.gz name.warc.wet.cdxwet.gz

"wat" is given twice

".warc" is removed for CDX files derived from WARC files

should be the same for WAT/WET files

a CDX file does not follow the WARC format

(a WAT or WET file does)

Maybe the following looks better?

name.warc.gz name.cdx.gz name.warc.wat.gz name.wat.cdx.gz name.warc.wet.gz name.wet.cdx.gz

sebastian-nagel · 2024-07-28T16:03:06Z

        } else {
          watOutputBasename = inputBasename + ".wat.gz";
          wetOutputBasename = inputBasename + ".wet.gz";
+          cdxWatOutputBasename = inputBasename + ".cdxwat.gz";


sebastian-nagel · 2024-07-28T16:10:10Z


        String watOutputFileString = basePath.toString() + "/wat/" + watOutputBasename;
        String wetOutputFileString = basePath.toString() + "/wet/" + wetOutputBasename;
+        String cdxWetOutputFileString = basePath.toString() + "/cdxwet/" + cdxWetOutputBasename;


This is a fixed output path. Do we want to have the CDX files for WAT/WET there?

The configuration for the CDX indexing uses different output paths, cf. https://github.com/commoncrawl/webarchive-indexing/blob/main/run_index_hadoop.sh

jt55401 · 2024-08-04T19:10:37Z

@sebastian-nagel - I've made a few commits and this code should be more to your liking.

filenames should be more consistent now
new config param for cdx base path
fixed a few bugs preventing compile

this project has a quite long list of dependencies, most of them in very old versions. Setting up a dev environment to test the job might be painful. It could be easier to move the job definition to a new project with a short dependency list ( ia-web-commons, hadoop-client, utilities), eventually also upgrading the job to MapReduce v2.

I compiled and tested this with a pretty vanilla Java 11 environment, and everything seemed to work fine. The only issue I ran into is that the internetarchvie maven repo has numerous http (as opposed to https) dependencies, which Maven doesn't like by default. I overrode this behavior in my local maven settings, and everything worked fine.

in ~/.m2/settings.xml:

<settings xmlns="http://maven.apache.org/SETTINGS/1.2.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://maven.apache.org/SETTINGS/1.2.0 http://maven.apache.org/xsd/settings-1.2.0.xsd">
     <mirrors>
          <mirror>
               <id>maven-default-http-blocker</id>
               <mirrorOf>dummy</mirrorOf>
               <name>Dummy mirror to override default blocking mirror that blocks http</name>
               <url>http://0.0.0.0/</url>
         </mirror>
    </mirrors>
</settings>

sebastian-nagel · 2024-09-03T11:00:56Z

Hi @jt55401, I've run a test with on a Hadoop single-node cluster.

The job run by

yarn jar target/ia-hadoop-tools-jar-with-dependencies.jar WEATGenerator -outputCDX -cdxBasePath cdx t03 warc/CC-MAIN-20240412194614-20240412224614-00370.warc.gz

finished with status success. However, the generated CDX files index the WARC file but not the WAT resp. WET file:

 CDX N b a m s k r M S V g
warcinfo:/CC-MAIN-20240412194614-20240412224614-00370.warc.gz/ia-web-commons.1.1.10-SNAPSHOT-20240903092205 20240412194614 warcinfo:/CC-MAIN-20240412194614-20240412224614-00370.warc.gz/ia-web-commons.1.1.10-SNAPSHOT-20240903092205 warc-info - - - - 471 0 CC-MAIN-20240412194614-20240412224614-00370.warc.gz
com,kristinroksphotography,0f)/list-959.html 20240412221412 http://0f.kristinroksphotography.com/list-959.html warc/request - - - - 441 471 CC-MAIN-20240412194614-20240412224614-00370.warc.gz
com,kristinroksphotography,0f)/list-959.html 20240412221412 http://0f.kristinroksphotography.com/list-959.html text/html 200 GWZOIQR42OCBZEQAOHVP423CK3NTWBZB - - 25173 912 CC-MAIN-20240412194614-20240412224614-00370.warc.gz

This needs to be fixed. I'd start to try implementing this in ia-web-commons by extending the classes org.archive.extract.WATExtractorOutput (resp. WETExtractorOutput) to, say, WatCdxExtractorOutput. I'm not 100% sure whether this approach works, needs a try.

I've observed three more points which can be ignored for now:

the CDX is not compressed
not in CDXJ format
both CDX files are the same, except that the *.wet.cdx.gz misses the last CDX record - I'm unable to explain why

jt55401 · 2024-09-03T14:37:59Z

OK, I will take a look @sebastian-nagel , thank you.

Did you just run single node hadoop with config in our nutch project? (and feed it some small seed list of 1 site or something?) or did you do something more to test this? (I will try to get some time set aside to set this up for myself as well)

sebastian-nagel · 2024-09-05T19:56:04Z

A plain single-node setup with minimal configuration, see nutch-test-single-node-cluster but without Nutch installed. For testing I took one WARC file from April 2024 and copied it from local disk to HDFS via:

hadoop fs -mkdir -p /user/$USER/warc
hadoop fs -copyFromLocal CC-MAIN-20240412194614-20240412224614-00370.warc.gz warc/

See above for the command to launch the job. Output is then in hdfs:/user/$USER/{cdx,wat,wet}/

jt55401 · 2024-09-26T13:31:58Z

move into IA web commons
this really will need to be it's own step in the crawl, since the files need to exist so we get offsets, length, etc.

laying down TODO's for integrity integration

cb6188f

jt55401 changed the title ~~laying down TODO's for integrity integration~~ integrity integration Jul 3, 2024

Added CDX generation (-outputCDX) to the WEATGenerator - may not comp…

42f6479

…ile yet, local env not available today.

sebastian-nagel reviewed Jul 28, 2024

View reviewed changes

jt55401 added 2 commits August 4, 2024 18:48

Fixing a few java bugs

10f91c0

fix up paths/naming per Sebastian's comments

e734d5b

sebastian-nagel mentioned this pull request Oct 7, 2024

Remove unneeded classes and their dependencies #5

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

integrity integration#10

integrity integration#10
jt55401 wants to merge 4 commits into
masterfrom
enable-integrity

jt55401 commented Jul 2, 2024 •

edited

Loading

Uh oh!

sebastian-nagel commented Jul 3, 2024

Uh oh!

sebastian-nagel left a comment

Uh oh!

sebastian-nagel Jul 28, 2024

Uh oh!

sebastian-nagel Jul 28, 2024

Uh oh!

sebastian-nagel Jul 28, 2024

Uh oh!

jt55401 commented Aug 4, 2024

Uh oh!

sebastian-nagel commented Sep 3, 2024

Uh oh!

jt55401 commented Sep 3, 2024

Uh oh!

sebastian-nagel commented Sep 5, 2024 •

edited

Loading

Uh oh!

jt55401 commented Sep 26, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jt55401 commented Jul 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sebastian-nagel commented Jul 3, 2024

Uh oh!

sebastian-nagel left a comment

Choose a reason for hiding this comment

Uh oh!

sebastian-nagel Jul 28, 2024

Choose a reason for hiding this comment

Uh oh!

sebastian-nagel Jul 28, 2024

Choose a reason for hiding this comment

Uh oh!

sebastian-nagel Jul 28, 2024

Choose a reason for hiding this comment

Uh oh!

jt55401 commented Aug 4, 2024

Uh oh!

sebastian-nagel commented Sep 3, 2024

Uh oh!

jt55401 commented Sep 3, 2024

Uh oh!

sebastian-nagel commented Sep 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jt55401 commented Sep 26, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jt55401 commented Jul 2, 2024 •

edited

Loading

sebastian-nagel commented Sep 5, 2024 •

edited

Loading