Skip to content

panicked when CC returns a non-200 #4

Description

@senecaso

I have been trying to run this, but after a few minutes, Common Crawl appears to trigger some sort of a rate-limit and begins returning 403s for all requests. This results in the code panicking in a tight loop, and seemingly continuing to make requests. This further exacerbates the issue by spamming the service with even more requests.

thread '<unnamed>' panicked at src/lib.rs:329:31:
called `Result::unwrap()` on an `Err` value: Custom { kind: InvalidInput, error: "invalid gzip header" }

This occurred using 128, 64, 32, 10, 5 and ultimately even 2 with threads. It seem that only using a single thread is acceptable.

Ideally, when the CC endpoint returns a non-200, it should be handled gracefully. In this case a 403 is being returned, so it should probably acknowledge that status and either back-off or abort. Similarly, if the endpoints returns a 429 (not sure if it does), a backoff should be employed to respect the service.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions