diff --git a/README.md b/README.md index 902d87ceee..07af23b4d1 100644 --- a/README.md +++ b/README.md @@ -1,16 +1,17 @@ Common Crawl Fork of Apache Nutch ================================= -Please also have a look at the [Apache Nutch](/apache/nutch) repository and all information about Apache Nutch given below. +Please also have a look at the [Apache Nutch](https://github.com/apache/nutch) repository and all information about Apache Nutch given below. Notable additions in Common Crawl's fork of Nutch (not yet pushed to upstream Nutch although this is planned): - WARC and CDX writer integrated into Fetcher and able to detect the language of HTML pages using the CLD2 language detector - [Generator2](src/java/org/apache/nutch/crawl/Generator2.java): alternative implementation of Generator - allowing to combine per-domain and per-host limits and - optimized to create many (eg. 100) segments in a single job +- Unused plugins disabled in `build.xml`, to achieve a considerably more lightweight installation for our massively parallel setup. How to install additional requirements to build this fork of Nutch: -- [crawler-commons](/crawler-commons/crawler-commons) development snapshot package: +- [crawler-commons](https://github.com/crawler-commons/crawler-commons) development snapshot package: ``` git clone https://github.com/crawler-commons/crawler-commons.git cd crawler-commons/ @@ -20,7 +21,7 @@ How to install additional requirements to build this fork of Nutch: ``` wget https://publicsuffix.org/list/public_suffix_list.dat -O conf/effective_tld_names.dat ``` -- [Java wrapper for CLD2 language detection](/commoncrawl/language-detection-cld2) +- [Java wrapper for CLD2 language detection](https://github.com/commoncrawl/language-detection-cld2) ``` git clone https://github.com/commoncrawl/language-detection-cld2.git cd language-detection-cld2/ @@ -31,6 +32,8 @@ How to install additional requirements to build this fork of Nutch: sudo apt install libcld2-0 libcld2-dev ``` +- An example for running this version can be found [here](https://github.com/commoncrawl/cc-nutch-example). + Apache Nutch ============