Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
FROM storm:2.8.4
FROM storm:2.8.8

RUN apt-get update -qq && \
apt-get install -yq --no-install-recommends \
Expand All @@ -10,7 +10,7 @@ RUN apt-get update -qq && \
#
# news-crawler
#
ENV CRAWLER_VERSION=3.5.1
ENV CRAWLER_VERSION=3.6.0

RUN mkdir /news-crawler/ && \
mkdir /news-crawler/conf/ && \
Expand Down
14 changes: 7 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,8 @@ Crawler for news based on [StormCrawler](https://stormcrawler.apache.org/). Prod

## Prerequisites

* Install OpenSearch 2.19.4
* Install Apache Storm 2.8.4
* Install OpenSearch 2.19.5
* Install Apache Storm 2.8.8
* Start OpenSearch and Storm
* Create the OpenSearch indices by running [bin/OS_IndexInit.sh](bin/OS_IndexInit.sh) and the dashboards by [OS_ImportDashboards.sh](bin/OS_ImportDashboards.sh)

Expand All @@ -32,14 +32,14 @@ mvn clean package

And run ...
``` sh
storm local target/crawler-3.5.1.jar --local-ttl 60 -- org.commoncrawl.stormcrawler.news.CrawlTopology -conf $PWD/conf/opensearch-conf.yaml -conf $PWD/conf/crawler-conf.yaml $PWD/seeds/ feeds.txt
storm local target/crawler-3.6.0.jar --local-ttl 60 -- org.commoncrawl.stormcrawler.news.CrawlTopology -conf $PWD/conf/opensearch-conf.yaml -conf $PWD/conf/crawler-conf.yaml $PWD/seeds/ feeds.txt
```

This will launch the crawl topology in local mode for 60 seconds. It will also "inject" all URLs found in the file `./seeds/feeds.txt` in the status index. The URLs point to news feeds and sitemaps from which links to news articles are extracted and fetched. The topology will create WARC files in the directory specified in the configuration under the key `warc.dir`. This directory must be created beforehand.

Of course, it's also possible to add (or remove) the seeds (feeds and sitemaps) using the Elasticsearch API. In this case, the can topology can be run without the last two arguments.
Of course, it's also possible to add (or remove) the seeds (feeds and sitemaps) using the OpenSearch API. In this case, the can topology can be run without the last two arguments.

Alternatively, the topology can be run from the [crawler.flux](./conf/crawler.flux), please see the [Storm Flux documentation](https://storm.apache.org/releases/2.8.4/flux.html). Make sure to adapt the Flux definition to your needs!
Alternatively, the topology can be run from the [crawler.flux](./conf/crawler.flux), please see the [Storm Flux documentation](https://storm.apache.org/releases/2.8.8/flux.html). Make sure to adapt the Flux definition to your needs!

In production, you should use `storm jar ...` to run the topology in distributed mode and continuously (no time limit) including the Storm UI and logging.

Expand Down Expand Up @@ -88,7 +88,7 @@ NOTE:
- Make sure that the OpenSearch port 9200 is not already in use or mapped by a running OpenSearch instance. Otherwise OpenSearch commands may affect the running instance!


To launch the topology using [Storm Flux](https://storm.apache.org/releases/2.8.4/flux.html):
To launch the topology using [Storm Flux](https://storm.apache.org/releases/2.8.8/flux.html):
```
docker compose run --rm news-crawler \
storm jar lib/crawler.jar org.apache.storm.flux.Flux --remote /news-crawler/conf/crawler.flux
Expand All @@ -104,7 +104,7 @@ After 1-2 minutes if everything is up, connect to OpenSearch on port [9200](http

For inspecting the worker log files:
```
docker exec storm-supervisor /bin/bash -c 'cat /logs/workers-artifacts/*/*/worker.log'
docker exec storm-supervisor-news-crawl /bin/bash -c 'cat /logs/workers-artifacts/*/*/worker.log'
```

To stop the topology:
Expand Down
8 changes: 4 additions & 4 deletions bin/status
Original file line number Diff line number Diff line change
Expand Up @@ -5,14 +5,14 @@ __ES_STATUS_URL_DEFAULT='http://localhost:9200/status'
function ____show_help() {
echo "$0 [-v|-V] [-C] <command> [<args>]"
echo
echo "Query StormCrawler's Elasticsearch status index"
echo "Query StormCrawler's Elasticsearch or OpenSearch status index"
echo " with help of curl, jq and bash"
echo
echo "Global options"
echo " -h show detailed help"
echo " -v verbose, print commands before execution"
echo " -V very verbose"
echo " -D dry run, do not execute request to ES (use in combination with -v)"
echo " -D dry run, do not execute request (use in combination with -v)"
echo " -C colorize JSON output"
echo
echo "Commands"
Expand Down Expand Up @@ -134,12 +134,12 @@ ES_STATUS_URL=${ES_STATUS_URL:-$__ES_STATUS_URL_DEFAULT}
set -e


# current time in Elasticsearch date format
# current time in Elasticsearch/OpenSearch date format
function ____now () {
date -u '+%Y-%m-%dT%H:%M:%S.000Z'
}

# given date in Elasticsearch date format
# given date in Elasticsearch/OpenSearch date format
function ____date () {
date -d"$1" -u '+%Y-%m-%dT%H:%M:%S.000Z'
}
Expand Down
7 changes: 7 additions & 0 deletions conf/crawler-conf.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -77,6 +77,13 @@ config:
http.protocol.implementation: org.apache.stormcrawler.protocol.okhttp.HttpProtocol
https.protocol.implementation: org.apache.stormcrawler.protocol.okhttp.HttpProtocol

# the http/https protocol versions to use, in order of preference
# - the WARC writer handles HTTP/1.1 and HTTP/2 (cf. storm-crawler#1010)
# - okhttp does not support HTTP/1.0 requests (it supports responses however)
# http.protocol.versions:
# - "h2"
# - "http/1.1"

# do not fail on unknown SSL certificates
http.trust.everything: true

Expand Down
2 changes: 1 addition & 1 deletion conf/crawler.flux
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ components:
- name: "put"
args:
- "software"
- "StormCrawler 2.10 https://stormcrawler.net/"
- "StormCrawler 3.6.0 https://stormcrawler.apache.org/"
- name: "put"
args:
- "description"
Expand Down
26 changes: 13 additions & 13 deletions docker-compose.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -17,14 +17,14 @@ services:
# Apache Storm components
# - Zookeeper coordinates the communication between Nimbus and the Supervisors
zookeeper:
image: zookeeper:${ZOOKEEPER_VERSION:-3.9.3}
container_name: zookeeper
image: zookeeper:${ZOOKEEPER_VERSION:-3.9.4}
container_name: zookeeper-news-crawl
restart: always

# - the daemon Nimbus runs on the master node
storm-nimbus:
image: storm:${STORM_VERSION:-2.8.4}
container_name: storm-nimbus
image: storm:${STORM_VERSION:-2.8.8}
container_name: storm-nimbus-news-crawl
hostname: nimbus
command: storm nimbus
depends_on:
Expand All @@ -37,8 +37,8 @@ services:

# - the Supervisors run on the worker nodes
storm-supervisor:
image: storm:${STORM_VERSION:-2.8.4}
container_name: storm-supervisor
image: storm:${STORM_VERSION:-2.8.8}
container_name: storm-supervisor-news-crawl
command: storm supervisor -c worker.childopts=-Xmx%HEAP-MEM%m
depends_on:
- zookeeper
Expand All @@ -50,7 +50,7 @@ services:
# which need to be able to access
# - (in case a indexing topology is run) the
# OpenSearch (http://opensearch:9200/) and
- opensearch-news-crawl
- opensearch
# - the WARC output folder
# - and the seed folder
volumes:
Expand All @@ -60,8 +60,8 @@ services:

# - the Storm UI provides diagnostics about the Storm cluster
storm-ui:
image: storm:${STORM_VERSION:-2.8.4}
container_name: storm-ui
image: storm:${STORM_VERSION:-2.8.8}
container_name: storm-ui-news-crawl
command: storm ui
depends_on:
- storm-nimbus
Expand All @@ -71,8 +71,8 @@ services:
- "127.0.0.1:8080:8080"
restart: always

opensearch-news-crawl:
image: opensearchproject/opensearch:${OPENSEARCH_VERSION:-2.19.4}
opensearch:
image: opensearchproject/opensearch:${OPENSEARCH_VERSION:-2.19.5}
container_name: opensearch-news-crawl
environment:
- cluster.name=opensearch-news-crawl-cluster
Expand All @@ -94,8 +94,8 @@ services:
ports:
- "127.0.0.1:9200:9200" # REST API

opensearch-dashboard-news-crawl:
image: opensearchproject/opensearch-dashboards:${OPENSEARCH_VERSION:-2.19.4}
opensearch-dashboard:
image: opensearchproject/opensearch-dashboards:${OPENSEARCH_VERSION:-2.19.5}
container_name: opensearch-dashboard-news-crawl
ports:
- "127.0.0.1:5601:5601"
Expand Down
19 changes: 4 additions & 15 deletions pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ under the License.
<modelVersion>4.0.0</modelVersion>
<groupId>org.commoncrawl.stormcrawler.news</groupId>
<artifactId>crawler</artifactId>
<version>3.5.1</version>
<version>3.6.0</version>
<packaging>jar</packaging>
<licenses>
<license>
Expand All @@ -39,10 +39,10 @@ under the License.

<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<stormcrawler.version>3.5.1</stormcrawler.version>
<storm.version>2.8.4</storm.version>
<stormcrawler.version>3.6.0</stormcrawler.version>
<storm.version>2.8.8</storm.version>
<aws.version>1.12.797</aws.version>
<jackson.version>2.18.1</jackson.version>
<jackson.version>2.21.3</jackson.version>
<crawler-commons.version>1.6</crawler-commons.version>
<mockito.version>5.23.0</mockito.version>
<wiremock.version>3.0.1</wiremock.version>
Expand Down Expand Up @@ -171,17 +171,6 @@ under the License.
<version>${aws.version}</version>
</dependency>

<!-- set version explicitly -->
<dependency>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-annotations</artifactId>
<version>${jackson.version}</version>
</dependency>
<dependency>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-core</artifactId>
<version>${jackson.version}</version>
</dependency>
<dependency>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-databind</artifactId>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@
import org.slf4j.LoggerFactory;

/**
* Dummy topology to play with the spouts and bolts on ElasticSearch
* Dummy topology to play with the spouts and bolts on OpenSearch
*/
public class CrawlTopology extends ConfigurableTopology {

Expand Down