Skip to content

commoncrawl/cc-warcinfo-index-builder

Repository files navigation

cc-warcinfo-index-builder

For each crawl, generate parquet which has the following fields:

  • warcinfo_id
  • warc_filename

The make all-warcinfo step runs one extractor per crawl. On the first run, the first crawl extraction finished in 1h 35m and the last in 6h 56m.

A copy of the actual index can be found on rf:/home/cc-pds/warcinfo-id.parquet

Why?

When WARC records are repackaged into different WARC files, sometimes you need to figure out what WARC file the original record was in. This index answers that question.

How to query

Look at the test code, test_pandas.py and test_duck.py

Updating the index

The code uses smart_open() to read the initial part of every warc, extracting the first record, which should be the warcinfo record.

The code is smart enough to not re-download anything, and runs in parallel for every crawl. It only needs about 3% of a core per extractor, but network latency slows it down to as slow as 7 hours for a single crawl. And if you are doing many crawls in parallel, the slowest one could be much slower than the fastest.

make collinfo
make all-crawls
make all-warcinfo
make parquet
make test

To add a single new crawl, edit the Makefile to change the CRAWL variable, then

make one-paths
make one-warcinfo
make parquet
make test

Install

If happy, copy to place:

make install

About

Code to build an index that maps warcinfo-id to crawl / warc

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors