Skip to content

commoncrawl/cc-nutch-example

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Example Usage of Common Crawl's Fork of Apache Nutch to Crawl and Write WARC files

A short description how to set up Common Crawl's Fork of Apache Nutch for crawling and to store the crawled content in WARC files.

Requirements and installation

sudo apt install libcld2-0 libcld2-dev ant maven

Compile Nutch and required projects

git clone git@github.com:crawler-commons/crawler-commons.git
cd crawler-commons/
mvn install
cd ..

git clone git@github.com:commoncrawl/language-detection-cld2.git
cd language-detection-cld2/
mvn install
cd ..

git clone https://github.com/commoncrawl/nutch.git nutch-cc
cd nutch-cc/
ant runtime
cd ..

Configuration

Edit the configuration file in the conf/nutch-site.xml:

  • It's required to configure at least the property http.agent.name in the file conf/nutch-site.xml.

Now there are two options:

  1. If it's ensured that the script crawl.sh is kept in this project's root folder, continue with Run crawl. The configuration directory is added in front of the Java classpath, and the nutch-site.xml is picked from the first occurrence on the classpath.
  2. If you plan to move scripts around or need more configurations, e.g., adapt the URL filter configuration files to your use case, then copy the file into nutch-cc/conf/:
    cp -p conf/nutch-site.xml nutch-cc/conf/
    
    After having done all configuration changes, Nutch needs to be recompiled because configuration files are contained in the job file (runtime/local/apache-nutch-*.job):
    cd nutch-cc/
    ant runtime
    cd ..
    

Run crawl

echo -e "https://nutch.apache.org/\tnutch.score=1.0" >urls.txt

./crawl.sh crawl 3 urls.txt

About

Apache Nutch example project to archive content in WARC files

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages