cc-warcinfo-index-builder

For each crawl, generate parquet which has the following fields:

warcinfo_id
warc_filename

The make all-warcinfo step runs one extractor per crawl. On the first run, the first crawl extraction finished in 1h 35m and the last in 6h 56m.

A copy of the actual index can be found on rf:/home/cc-pds/warcinfo-id.parquet

Why?

When WARC records are repackaged into different WARC files, sometimes you need to figure out what WARC file the original record was in. This index answers that question.

How to query

Look at the test code, test_pandas.py and test_duck.py

Updating the index

The code uses smart_open() to read the initial part of every warc, extracting the first record, which should be the warcinfo record.

The code is smart enough to not re-download anything, and runs in parallel for every crawl. It only needs about 3% of a core per extractor, but network latency slows it down to as slow as 7 hours for a single crawl. And if you are doing many crawls in parallel, the slowest one could be much slower than the fastest.

make collinfo
make all-crawls
make all-warcinfo
make parquet
make test

To add a single new crawl, edit the Makefile to change the CRAWL variable, then

make one-paths
make one-warcinfo
make parquet
make test

Install

If happy, copy to place:

make install

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
duck-lookup.py		duck-lookup.py
make-warcinfo-index.py		make-warcinfo-index.py
merge-parquets.py		merge-parquets.py
requirements.txt		requirements.txt
test_duck.py		test_duck.py
test_pandas.py		test_pandas.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

cc-warcinfo-index-builder

Why?

How to query

Updating the index

Install

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

cc-warcinfo-index-builder

Why?

How to query

Updating the index

Install

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages