Allow to follow news sites not providing RSS/Atom feed or news sitemap

The news crawler (as of now) relies exclusively on [RSS](https://en.wikipedia.org/wiki/RSS)/[Atom](https://en.wikipedia.org/wiki/Atom_(Web_standard)) feeds and [news sitemaps](https://en.wikipedia.org/wiki/Sitemaps#Google_News_Sitemaps) to find links to news articles. However, some news sites do not provide feeds or sitemaps. In order to follow these news sites, the crawler should be able monitor HTML pages manually marked as seeds and extract links from it:
- add a parser class to the topology which
  - exclusively parses URLs marked as verified HTML seeds (eg. by a metadata key `isHtmlSeed`)
  - extracts links from the HTML and sends them to the status index as DISCOVERED
  - (optionally) outlinks are filtered: same host or domain, configurable URL patterns stored in status index for the HTML seed
- the (adaptive) scheduler must be configured to schedule the refetch of HTML seeds

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow to follow news sites not providing RSS/Atom feed or news sitemap #41

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Allow to follow news sites not providing RSS/Atom feed or news sitemap #41

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions