Web Archives for Social Sciences Datathon

Bristol Digital Futures Institute, University of Bristol, November 2025

On 27–28 November 2025 a Web Archives for Social Sciences Datathon was organised at the University of Bristol. This was in collaboration between the Common Crawl Foundation and UK Web Archive at The British Library. The Datathon took place at the BDFI Neutral Lab.

This two-day event aimed to build capacity in the social science research community to use large-scale Web Archive data for policy-relevant, socio-economic research. Participants worked in teams with curated data extracts from Common Crawl to address real-world research challenges, supported by expert facilitators.

This Datathon was part of the Atlas of Economic Activities project and was funded by SDR UK.

Facilitators: Emmanouil Tranos, Leonardo Castro Gonzalez, Jon Reades, Laurie Burchell and Thom Vaughan.

Problems

This is a cache of web data containing all commercial websites (.co.uk, landing webpages only) archived by Common Crawl (2021) that (1) include at least one UK postcode in their web text, and (2) we believe represent Financial Services. Identify the different sub-classes (industries) within Financial Services and highlight the websites that provide Fintech services.

This is a cache of web data containing all commercial websites (.co.uk, landing webpages only) archived by Common Crawl (2021) that (1) include at least one UK postcode in their web text, and (2) we believe represent the Creative Industries. Identify the different sub-classes (industries) within the Creative Industries and highlight the CreaTech websites.

This is a cache of web data containing all commercial websites (.co.uk, landing webpages only) archived by Common Crawl (all of 2021 and 2024) that include at least one postcode from Manchester and Birmingham in their web text. The 2021 commercial websites are classified by economic activity, while the 2024 websites are not. Use the 2021 data to classify the 2024 data for both cities. Compare the industrial structure of these cities and analyse how it evolved over time. The data include Manchester for 2021 and 2024 and Birmingham for 2021 and 2024.

This is a cache of web data (provided in two parts) containing all the UK governmental webpages (.gov.uk) that have been archived by the Common Crawl (February to March 2024). Identify the key policy areas Local Authorities in the UK focus on. Are there any Local Authorities that have implemented distinct policies or actions?

This is a cache of web data, which contains all the UK governmental webpages (.gov.uk) that have been archived by Common Crawl in two points in time: February to March 2024 (same data as Problem 4) and October 2025. Identify changes in a specific policy domain associated with the new government elected in July 2024.

Data files and checksums

File	Size	MD5	SHA-1
problem1_2021_finance.csv	77 MiB	`a3da6824dd12be8ea5e3869c2395c465`	`053a9dc490d785761fd1c91708749df38466ce30`
problem2_2021_creative.csv	331 MiB	`16245a7181d53534060a5df43899b2ab`	`32ab95d43d6a1986c0035c4e883306dee8e40f85`
problem3_2021_manchester.csv	91 MiB	`3c14a900d2da78d19cb768f10ccb1c1b`	`9a2461770784d072f2e46cde9b60d72a69e1f5fb`
problem3_2024_manchester.csv	134 MiB	`2653f8d66f6fbc8a7c2e09c769e8c834`	`52ebb928a911e8786a468d9c66652da78cb89379`
problem3_2021_birmingham.csv	201 MiB	`07620c9a9a8c25482d6378dcfbc1d024`	`9417b54c37f45ff182103fe6f04ef0d2376bb034`
problem3_2024_birmingham.csv	124 MiB	`bf789a192328d610740f2e42479354ec`	`0a85f8b7220fef7e5f1f999bd717023cf3f3092c`
problem4_govuk_split1.zip	1.0 GiB	`c1af269f1fca80218bf47d57cd4e8516`	`f152de705dbab874c449f789e7c4dbd3a7773f4f`
problem4_govuk_split2.zip	1.0 GiB	`3c6d44de8273d2a8af15e179a285797a`	`8d016a4a8aeaf2bf2eb7e4d9cdca206ac9ccf986`
problem5_govuk.zip	1.5 GiB	`767d19825d4533017f80a5d7fe6da133`	`74750f8dd7f0be4ce4f65743a2bcbb414e47213e`

Data description

You will find the following columns in the different data packages. Please be aware that not all columns are present in all data packages.

id : An ID. Not linked to any other external data
content : the web text from the landing page of each website
summary : an LLM-generated summary of the web text
explanation : an LLM-generated explanation of how the summary of was conducted
clean_content : LLM-cleaned web text from the landing page of each website
url : the website URL
parent_url : the website domain
postcodes : UK postcodes found in the web text

The rest of the fields were produced by our TNT-LLM inspired pipeline that classifies websites into a two-level typology (for now) of economic activities (high level clusters and low level clusters) using LLMs. These fields are:

partition_id : high level cluster ID (imagine something like sector)
label_id : low level cluster ID (imagine something like industry)
label_name : low level cluster name (imagine something like industry)
label_description : low level cluster description (imagine something like industry)

Results

The results can be found in the following GitHub repositories in order of the problems:

You can download an archive of the presentations from the event here.

Contributors

Aditi Dutta
Camilo Andrés López Barra
Céline Van Migerode
Christina Palantza
Do Ngoc Thao
Esha Sadia Nasir
Fanqi Zeng
Filippo Dionigi
Gabriel A. Pierzynski
Giovanni Maria Pala
Helena Byrne
James Thomas
Jia Zhao
Jo Kent
Kelly Yubini Yubini
Mariam Cook
Meihui He
Meng Le Zhang
Nirat Rujimora
Nora Ramsey
Paddy Smith
Rita Rasteiro
Thomas Carey-Wilson
Timothy Monteath
Wander Demuynck
Wong E. Chern