Skip to content

WARC writer: review deduplication #47

@sebastian-nagel

Description

@sebastian-nagel

Although it is guaranteed that every URL appears only a single time in the fetch list, following HTTP redirects may cause duplicates:

  • Deduplication of redirect targets in the Fetcher is done only lazily, using a deduplication cache of limited size.
  • The cache is bound to a Fetcher map task, not covering entire jobs or even all 100 segments.

To further reduce the number of duplicates also the WARC writer performs deduplication by only "recording" one capture per unique URL. See ccc558a. This is done purely based on the URL. In the majority, this affects two records of the same HTTP status, assuming true duplicates. However, some duplicates have a different HTTP status. This could be, for example, a redirect from A to B to "grab" a cookie, and then a redirect back to A. In such situations we might record all three captures (two redirects, one true response).

Of course, also changes over time could be the reason. In this case, it might be a better strategy to select the "best" record, i.e. the successful capture. Currently, the first record is selected.

Below the HTTP status combinations for the same URL are extracted from WARC writer logs and counted:

frequency   status counts
68461   2 x 200
6686    2 x 301
4253    2 x 404
2535    2 x 302
1751    3 x 200
1160    2 x 403
676     4 x 200
579     1 x 200
485     1 x 200, 1 x 429
474     2 x 307
424     1 x 200, 1 x 304
302     3 x 301
236     2 x 308
227     3 x 403
192     1 x 200, 1 x 302

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions