Skip to content

Latest commit

 

History

History
72 lines (54 loc) · 2.07 KB

File metadata and controls

72 lines (54 loc) · 2.07 KB

Example Usage of Common Crawl's Fork of Apache Nutch to Crawl and Write WARC files

A short description how to set up Common Crawl's Fork of Apache Nutch for crawling and to store the crawled content in WARC files.

Requirements and installation

sudo apt install libcld2-0 libcld2-dev ant maven

Compile Nutch and required projects

git clone git@github.com:crawler-commons/crawler-commons.git
cd crawler-commons/
mvn install
cd ..

git clone git@github.com:commoncrawl/language-detection-cld2.git
cd language-detection-cld2/
mvn install
cd ..

git clone https://github.com/commoncrawl/nutch.git nutch-cc
cd nutch-cc/
ant runtime
cd ..

Configuration

Edit the configuration file in the conf/nutch-site.xml:

  • It's required to configure at least the property http.agent.name in the file conf/nutch-site.xml.

Now there are two options:

  1. If it's ensured that the script crawl.sh is kept in this project's root folder, continue with Run crawl. The configuration directory is added in front of the Java classpath, and the nutch-site.xml is picked from the first occurrence on the classpath.
  2. If you plan to move scripts around or need more configurations, e.g., adapt the URL filter configuration files to your use case, then copy the file into nutch-cc/conf/:
    cp -p conf/nutch-site.xml nutch-cc/conf/
    
    After having done all configuration changes, Nutch needs to be recompiled because configuration files are contained in the job file (runtime/local/apache-nutch-*.job):
    cd nutch-cc/
    ant runtime
    cd ..
    

Run crawl

echo -e "https://nutch.apache.org/\tnutch.score=1.0" >urls.txt

./crawl.sh crawl 3 urls.txt