On 27–28 November 2025 a Web Archives for Social Sciences Datathon was organised at the University of Bristol. This was in collaboration between the Common Crawl Foundation and UK Web Archive at The British Library. The Datathon took place at the BDFI Neutral Lab.
This two-day event aimed to build capacity in the social science research community to use large-scale Web Archive data for policy-relevant, socio-economic research. Participants worked in teams with curated data extracts from Common Crawl to address real-world research challenges, supported by expert facilitators.
This Datathon was part of the Atlas of Economic Activities project and was funded by SDR UK.
Facilitators: Emmanouil Tranos, Leonardo Castro Gonzalez, Jon Reades, Laurie Burchell and Thom Vaughan.
| File | Size | MD5 | SHA-1 |
|---|---|---|---|
| problem1_2021_finance.csv | 77 MiB | a3da6824dd12be8ea5e3869c2395c465 |
053a9dc490d785761fd1c91708749df38466ce30 |
| problem2_2021_creative.csv | 331 MiB | 16245a7181d53534060a5df43899b2ab |
32ab95d43d6a1986c0035c4e883306dee8e40f85 |
| problem3_2021_manchester.csv | 91 MiB | 3c14a900d2da78d19cb768f10ccb1c1b |
9a2461770784d072f2e46cde9b60d72a69e1f5fb |
| problem3_2024_manchester.csv | 134 MiB | 2653f8d66f6fbc8a7c2e09c769e8c834 |
52ebb928a911e8786a468d9c66652da78cb89379 |
| problem3_2021_birmingham.csv | 201 MiB | 07620c9a9a8c25482d6378dcfbc1d024 |
9417b54c37f45ff182103fe6f04ef0d2376bb034 |
| problem3_2024_birmingham.csv | 124 MiB | bf789a192328d610740f2e42479354ec |
0a85f8b7220fef7e5f1f999bd717023cf3f3092c |
| problem4_govuk_split1.zip | 1.0 GiB | c1af269f1fca80218bf47d57cd4e8516 |
f152de705dbab874c449f789e7c4dbd3a7773f4f |
| problem4_govuk_split2.zip | 1.0 GiB | 3c6d44de8273d2a8af15e179a285797a |
8d016a4a8aeaf2bf2eb7e4d9cdca206ac9ccf986 |
| problem5_govuk.zip | 1.5 GiB | 767d19825d4533017f80a5d7fe6da133 |
74750f8dd7f0be4ce4f65743a2bcbb414e47213e |
You will find the following columns in the different data packages. Please be aware that not all columns are present in all data packages.
id
: An ID. Not linked to any other external data
content
: the web text from the landing page of each website
summary
: an LLM-generated summary of the web text
explanation
: an LLM-generated explanation of how the summary of was
conducted
clean_content
: LLM-cleaned web text from the landing page of each website
url
: the website URL
parent_url
: the website domain
postcodes
: UK postcodes found in the web text
The rest of the fields were produced by our TNT-LLM inspired pipeline that classifies websites into a two-level typology (for now) of economic activities (high level clusters and low level clusters) using LLMs. These fields are:
partition_id
: high level cluster ID (imagine something like sector)
label_id
: low level cluster ID (imagine something like industry)
label_name
: low level cluster name (imagine something like industry)
label_description
: low level cluster description (imagine something like
industry)
The results can be found in the following GitHub repositories in order of the problems:
You can download an archive of the presentations from the event here.