Host Index Testing

The data in this location is related to the testing of our Host Index.

The Host Index is a dataset with one row per web host per crawl, combining crawl stats, status codes, languages, and bot defence data. Queryable via AWS tools or downloadable.

It contains summary information from the crawl, indexes, the Web Graph, and our raw crawler logs. You can use it directly from AWS using SQL tools such as Amazon Athena or duckdb, or you can download it to your own disk (24 crawls x 7 gigabytes each.)

The code can be accessed in its GitHub repository.

For more information please see this related blog post.