I have been trying to run this, but after a few minutes, Common Crawl appears to trigger some sort of a rate-limit and begins returning 403s for all requests. This results in the code panicking in a tight loop, and seemingly continuing to make requests. This further exacerbates the issue by spamming the service with even more requests.
thread '<unnamed>' panicked at src/lib.rs:329:31:
called `Result::unwrap()` on an `Err` value: Custom { kind: InvalidInput, error: "invalid gzip header" }
This occurred using 128, 64, 32, 10, 5 and ultimately even 2 with threads. It seem that only using a single thread is acceptable.
Ideally, when the CC endpoint returns a non-200, it should be handled gracefully. In this case a 403 is being returned, so it should probably acknowledge that status and either back-off or abort. Similarly, if the endpoints returns a 429 (not sure if it does), a backoff should be employed to respect the service.
I have been trying to run this, but after a few minutes, Common Crawl appears to trigger some sort of a rate-limit and begins returning 403s for all requests. This results in the code panicking in a tight loop, and seemingly continuing to make requests. This further exacerbates the issue by spamming the service with even more requests.
This occurred using 128, 64, 32, 10, 5 and ultimately even 2 with threads. It seem that only using a single thread is acceptable.
Ideally, when the CC endpoint returns a non-200, it should be handled gracefully. In this case a
403is being returned, so it should probably acknowledge that status and either back-off or abort. Similarly, if the endpoints returns a429(not sure if it does), a backoff should be employed to respect the service.