Name	Name	Last commit message	Last commit date
Latest commit History 11 Commits
conf	conf
LICENSE	LICENSE
README.md	README.md
crawl.sh	crawl.sh

Name

Last commit message

Last commit date

Example Usage of Common Crawl's Fork of Apache Nutch to Crawl and Write WARC files

A short description how to set up Common Crawl's Fork of Apache Nutch for crawling and to store the crawled content in WARC files.

Requirements and installation

Linux (tested on Ubuntu 24.04)
Java 11 (higher Java versions should also work)
ant and maven
Compact Language Detector 2

sudo apt install libcld2-0 libcld2-dev ant maven

Compile Nutch and required projects

git clone git@github.com:crawler-commons/crawler-commons.git
cd crawler-commons/
mvn install
cd ..

git clone git@github.com:commoncrawl/language-detection-cld2.git
cd language-detection-cld2/
mvn install
cd ..

git clone https://github.com/commoncrawl/nutch.git nutch-cc
cd nutch-cc/
ant runtime
cd ..

Configuration

Edit the configuration file in the conf/nutch-site.xml:

It's required to configure at least the property http.agent.name in the file conf/nutch-site.xml.

Now there are two options:

If it's ensured that the script crawl.sh is kept in this project's root folder, continue with Run crawl. The configuration directory is added in front of the Java classpath, and the nutch-site.xml is picked from the first occurrence on the classpath.
If you plan to move scripts around or need more configurations, e.g., adapt the URL filter configuration files to your use case, then copy the file into nutch-cc/conf/:
```
cp -p conf/nutch-site.xml nutch-cc/conf/
```
After having done all configuration changes, Nutch needs to be recompiled because configuration files are contained in the job file (runtime/local/apache-nutch-*.job):
```
cd nutch-cc/
ant runtime
cd ..
```

Run crawl

echo -e "https://nutch.apache.org/\tnutch.score=1.0" >urls.txt

./crawl.sh crawl 3 urls.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Example Usage of Common Crawl's Fork of Apache Nutch to Crawl and Write WARC files

Requirements and installation

Compile Nutch and required projects

Configuration

Run crawl

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Example Usage of Common Crawl's Fork of Apache Nutch to Crawl and Write WARC files

Requirements and installation

Compile Nutch and required projects

Configuration

Run crawl

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages