From b98c1abff71d3c2926181db5f40c13994522df6e Mon Sep 17 00:00:00 2001 From: Hande Celikkanat <7702228+handecelikkanat@users.noreply.github.com> Date: Wed, 30 Jul 2025 15:35:58 +0300 Subject: [PATCH 1/3] docs: Add comment about disabled plugins. Fix broken Apache Nutch link. --- README.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index 902d87ceee..43fd8685fc 100644 --- a/README.md +++ b/README.md @@ -1,13 +1,14 @@ Common Crawl Fork of Apache Nutch ================================= -Please also have a look at the [Apache Nutch](/apache/nutch) repository and all information about Apache Nutch given below. +Please also have a look at the [Apache Nutch](https://github.com/apache/nutch) repository and all information about Apache Nutch given below. Notable additions in Common Crawl's fork of Nutch (not yet pushed to upstream Nutch although this is planned): - WARC and CDX writer integrated into Fetcher and able to detect the language of HTML pages using the CLD2 language detector - [Generator2](src/java/org/apache/nutch/crawl/Generator2.java): alternative implementation of Generator - allowing to combine per-domain and per-host limits and - optimized to create many (eg. 100) segments in a single job +- Unused plugins disabled in `build.xml`, to achieve a considerably more lightweight installation for our massively parallel setup. How to install additional requirements to build this fork of Nutch: - [crawler-commons](/crawler-commons/crawler-commons) development snapshot package: From c9534b08f60a08ec42a8127a30f9a620035893af Mon Sep 17 00:00:00 2001 From: Hande Celikkanat <7702228+handecelikkanat@users.noreply.github.com> Date: Wed, 30 Jul 2025 16:02:35 +0300 Subject: [PATCH 2/3] docs: Add link to https://github.com/commoncrawl/cc-nutch-example --- README.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/README.md b/README.md index 43fd8685fc..2960c37804 100644 --- a/README.md +++ b/README.md @@ -32,6 +32,8 @@ How to install additional requirements to build this fork of Nutch: sudo apt install libcld2-0 libcld2-dev ``` +- An example for running this version can be found [here](https://github.com/commoncrawl/cc-nutch-example). + Apache Nutch ============ From c24bd383cf4f996dfafec99964e2fb7597a03210 Mon Sep 17 00:00:00 2001 From: Hande Celikkanat <7702228+handecelikkanat@users.noreply.github.com> Date: Wed, 30 Jul 2025 17:20:52 +0300 Subject: [PATCH 3/3] docs(README): Fix two more broken links --- README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 2960c37804..07af23b4d1 100644 --- a/README.md +++ b/README.md @@ -11,7 +11,7 @@ Notable additions in Common Crawl's fork of Nutch (not yet pushed to upstream Nu - Unused plugins disabled in `build.xml`, to achieve a considerably more lightweight installation for our massively parallel setup. How to install additional requirements to build this fork of Nutch: -- [crawler-commons](/crawler-commons/crawler-commons) development snapshot package: +- [crawler-commons](https://github.com/crawler-commons/crawler-commons) development snapshot package: ``` git clone https://github.com/crawler-commons/crawler-commons.git cd crawler-commons/ @@ -21,7 +21,7 @@ How to install additional requirements to build this fork of Nutch: ``` wget https://publicsuffix.org/list/public_suffix_list.dat -O conf/effective_tld_names.dat ``` -- [Java wrapper for CLD2 language detection](/commoncrawl/language-detection-cld2) +- [Java wrapper for CLD2 language detection](https://github.com/commoncrawl/language-detection-cld2) ``` git clone https://github.com/commoncrawl/language-detection-cld2.git cd language-detection-cld2/