Skip to content

integrity integration#10

Draft
jt55401 wants to merge 4 commits into
masterfrom
enable-integrity
Draft

integrity integration#10
jt55401 wants to merge 4 commits into
masterfrom
enable-integrity

Conversation

@jt55401

@jt55401 jt55401 commented Jul 2, 2024

Copy link
Copy Markdown

The intention of this PR is to enable integrity file output.
This will involve:

  1. modifying WEATGenerator.java to output cdx files.

I've started by laying out TODO's in the places where I think we will need to make changes.

This will need to coordinate with a PR in crawl-tools: https://github.com/commoncrawl/crawl-tools/pull/37

@jt55401 jt55401 changed the title laying down TODO's for integrity integration integrity integration Jul 3, 2024
@sebastian-nagel

Copy link
Copy Markdown

Few remarks:

  • the actual work is done by code in commoncrawl/ia-web-commons. Likely, the bulk needs to be implemented there, using existing class implementations to build upon.
  • this project has a quite long list of dependencies, most of them in very old versions. Setting up a dev environment to test the job might be painful. It could be easier to move the job definition to a new project with a short dependency list ( ia-web-commons, hadoop-client, utilities), eventually also upgrading the job to MapReduce v2.

coordinate with a PR in crawl-tools: https://github.com/commoncrawl/crawl-tools/pull/37

While ia-hadoop-tools is a public repository, crawl-tools isn't. Please, keep in mind that it may be annoying for anybody reading about this issue if they cannot read the information given in the linked issue. So, all information related to this issue should be shared here or in other public repositories. And, of course, it's possible to link a public repository from a private one, for example to discuss the integration of the new job into internal tools and workflows.

@sebastian-nagel sebastian-nagel left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jt55401, thanks! Just a few comments, didn't try to run it.

if(path.endsWith(".gz")) {
watOutputBasename = inputBasename.substring(0,inputBasename.length()-3) + ".wat.gz";
wetOutputBasename = inputBasename.substring(0,inputBasename.length()-3) + ".wet.gz";
cdxWatOutputBasename = inputBasename.substring(0,inputBasename.length()-3) + ".cdxwat.gz";

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This results in:

name.warc.gz      name.cdx.gz
name.warc.wat.gz  name.warc.wat.cdxwat.gz
name.warc.wet.gz  name.warc.wet.cdxwet.gz
  • "wat" is given twice
  • ".warc" is removed for CDX files derived from WARC files
    • should be the same for WAT/WET files
    • a CDX file does not follow the WARC format
    • (a WAT or WET file does)

Maybe the following looks better?

name.warc.gz      name.cdx.gz
name.warc.wat.gz  name.wat.cdx.gz
name.warc.wet.gz  name.wet.cdx.gz

} else {
watOutputBasename = inputBasename + ".wat.gz";
wetOutputBasename = inputBasename + ".wet.gz";
cdxWatOutputBasename = inputBasename + ".cdxwat.gz";

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above.


String watOutputFileString = basePath.toString() + "/wat/" + watOutputBasename;
String wetOutputFileString = basePath.toString() + "/wet/" + wetOutputBasename;
String cdxWetOutputFileString = basePath.toString() + "/cdxwet/" + cdxWetOutputBasename;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a fixed output path. Do we want to have the CDX files for WAT/WET there?

The configuration for the CDX indexing uses different output paths, cf. https://github.com/commoncrawl/webarchive-indexing/blob/main/run_index_hadoop.sh

@jt55401

jt55401 commented Aug 4, 2024

Copy link
Copy Markdown
Author

@sebastian-nagel - I've made a few commits and this code should be more to your liking.

  1. filenames should be more consistent now
  2. new config param for cdx base path
  3. fixed a few bugs preventing compile

this project has a quite long list of dependencies, most of them in very old versions. Setting up a dev environment to test the job might be painful. It could be easier to move the job definition to a new project with a short dependency list ( ia-web-commons, hadoop-client, utilities), eventually also upgrading the job to MapReduce v2.

I compiled and tested this with a pretty vanilla Java 11 environment, and everything seemed to work fine. The only issue I ran into is that the internetarchvie maven repo has numerous http (as opposed to https) dependencies, which Maven doesn't like by default. I overrode this behavior in my local maven settings, and everything worked fine.

in ~/.m2/settings.xml:

<settings xmlns="http://maven.apache.org/SETTINGS/1.2.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://maven.apache.org/SETTINGS/1.2.0 http://maven.apache.org/xsd/settings-1.2.0.xsd">
     <mirrors>
          <mirror>
               <id>maven-default-http-blocker</id>
               <mirrorOf>dummy</mirrorOf>
               <name>Dummy mirror to override default blocking mirror that blocks http</name>
               <url>http://0.0.0.0/</url>
         </mirror>
    </mirrors>
</settings>

@sebastian-nagel

Copy link
Copy Markdown

Hi @jt55401, I've run a test with on a Hadoop single-node cluster.

The job run by

yarn jar target/ia-hadoop-tools-jar-with-dependencies.jar WEATGenerator -outputCDX -cdxBasePath cdx t03 warc/CC-MAIN-20240412194614-20240412224614-00370.warc.gz

finished with status success. However, the generated CDX files index the WARC file but not the WAT resp. WET file:

 CDX N b a m s k r M S V g
warcinfo:/CC-MAIN-20240412194614-20240412224614-00370.warc.gz/ia-web-commons.1.1.10-SNAPSHOT-20240903092205 20240412194614 warcinfo:/CC-MAIN-20240412194614-20240412224614-00370.warc.gz/ia-web-commons.1.1.10-SNAPSHOT-20240903092205 warc-info - - - - 471 0 CC-MAIN-20240412194614-20240412224614-00370.warc.gz
com,kristinroksphotography,0f)/list-959.html 20240412221412 http://0f.kristinroksphotography.com/list-959.html warc/request - - - - 441 471 CC-MAIN-20240412194614-20240412224614-00370.warc.gz
com,kristinroksphotography,0f)/list-959.html 20240412221412 http://0f.kristinroksphotography.com/list-959.html text/html 200 GWZOIQR42OCBZEQAOHVP423CK3NTWBZB - - 25173 912 CC-MAIN-20240412194614-20240412224614-00370.warc.gz

This needs to be fixed. I'd start to try implementing this in ia-web-commons by extending the classes org.archive.extract.WATExtractorOutput (resp. WETExtractorOutput) to, say, WatCdxExtractorOutput. I'm not 100% sure whether this approach works, needs a try.

I've observed three more points which can be ignored for now:

  • the CDX is not compressed
  • not in CDXJ format
  • both CDX files are the same, except that the *.wet.cdx.gz misses the last CDX record - I'm unable to explain why

@jt55401

jt55401 commented Sep 3, 2024

Copy link
Copy Markdown
Author

OK, I will take a look @sebastian-nagel , thank you.

Did you just run single node hadoop with config in our nutch project? (and feed it some small seed list of 1 site or something?) or did you do something more to test this? (I will try to get some time set aside to set this up for myself as well)

@sebastian-nagel

sebastian-nagel commented Sep 5, 2024

Copy link
Copy Markdown

A plain single-node setup with minimal configuration, see nutch-test-single-node-cluster but without Nutch installed. For testing I took one WARC file from April 2024 and copied it from local disk to HDFS via:

hadoop fs -mkdir -p /user/$USER/warc
hadoop fs -copyFromLocal CC-MAIN-20240412194614-20240412224614-00370.warc.gz warc/

See above for the command to launch the job. Output is then in hdfs:/user/$USER/{cdx,wat,wet}/

@jt55401

jt55401 commented Sep 26, 2024

Copy link
Copy Markdown
Author
  1. move into IA web commons
  2. this really will need to be it's own step in the crawl, since the files need to exist so we get offsets, length, etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants