Web Archives for Social Sciences Datathon

Bristol Digital Futures Institute, University of Bristol, November 2025

On 27–28 November 2025 a Web Archives for Social Sciences Datathon was organised at the University of Bristol. This was in collaboration between the Common Crawl Foundation and UK Web Archive at The British Library. The Datathon took place at the BDFI Neutral Lab.

This two-day event aimed to build capacity in the social science research community to use large-scale Web Archive data for policy-relevant, socio-economic research. Participants worked in teams with curated data extracts from Common Crawl to address real-world research challenges, supported by expert facilitators.

This Datathon was part of the Atlas of Economic Activities project and was funded by SDR UK.

Facilitators: Emmanouil Tranos, Leonardo Castro Gonzalez, Jon Reades, Laurie Burchell and Thom Vaughan.

Problems

  1. This is a cache of web data containing all commercial websites (.co.uk, landing webpages only) archived by Common Crawl (2021) that (1) include at least one UK postcode in their web text, and (2) we believe represent Financial Services. Identify the different sub-classes (industries) within Financial Services and highlight the websites that provide Fintech services.
  1. This is a cache of web data containing all commercial websites (.co.uk, landing webpages only) archived by Common Crawl (2021) that (1) include at least one UK postcode in their web text, and (2) we believe represent the Creative Industries. Identify the different sub-classes (industries) within the Creative Industries and highlight the CreaTech websites.
  1. This is a cache of web data containing all commercial websites (.co.uk, landing webpages only) archived by Common Crawl (all of 2021 and 2024) that include at least one postcode from Manchester and Birmingham in their web text. The 2021 commercial websites are classified by economic activity, while the 2024 websites are not. Use the 2021 data to classify the 2024 data for both cities. Compare the industrial structure of these cities and analyse how it evolved over time. The data include Manchester for 2021 and 2024 and Birmingham for 2021 and 2024.
  1. This is a cache of web data (provided in two parts) containing all the UK governmental webpages (.gov.uk) that have been archived by the Common Crawl (February to March 2024). Identify the key policy areas Local Authorities in the UK focus on. Are there any Local Authorities that have implemented distinct policies or actions?
  1. This is a cache of web data, which contains all the UK governmental webpages (.gov.uk) that have been archived by Common Crawl in two points in time: February to March 2024 (same data as Problem 4) and October 2025. Identify changes in a specific policy domain associated with the new government elected in July 2024.

Data files and checksums

File Size MD5 SHA-1
problem1_2021_finance.csv 77 MiB a3da6824dd12be8ea5e3869c2395c465 053a9dc490d785761fd1c91708749df38466ce30
problem2_2021_creative.csv 331 MiB 16245a7181d53534060a5df43899b2ab 32ab95d43d6a1986c0035c4e883306dee8e40f85
problem3_2021_manchester.csv 91 MiB 3c14a900d2da78d19cb768f10ccb1c1b 9a2461770784d072f2e46cde9b60d72a69e1f5fb
problem3_2024_manchester.csv 134 MiB 2653f8d66f6fbc8a7c2e09c769e8c834 52ebb928a911e8786a468d9c66652da78cb89379
problem3_2021_birmingham.csv 201 MiB 07620c9a9a8c25482d6378dcfbc1d024 9417b54c37f45ff182103fe6f04ef0d2376bb034
problem3_2024_birmingham.csv 124 MiB bf789a192328d610740f2e42479354ec 0a85f8b7220fef7e5f1f999bd717023cf3f3092c
problem4_govuk_split1.zip 1.0 GiB c1af269f1fca80218bf47d57cd4e8516 f152de705dbab874c449f789e7c4dbd3a7773f4f
problem4_govuk_split2.zip 1.0 GiB 3c6d44de8273d2a8af15e179a285797a 8d016a4a8aeaf2bf2eb7e4d9cdca206ac9ccf986
problem5_govuk.zip 1.5 GiB 767d19825d4533017f80a5d7fe6da133 74750f8dd7f0be4ce4f65743a2bcbb414e47213e

Data description

You will find the following columns in the different data packages. Please be aware that not all columns are present in all data packages.

The rest of the fields were produced by our TNT-LLM inspired pipeline that classifies websites into a two-level typology (for now) of economic activities (high level clusters and low level clusters) using LLMs. These fields are:

Results

The results can be found in the following GitHub repositories in order of the problems:

  1. https://github.com/kellyyubini/datathon_bristol
  2. https://github.com/JGIBristol/team2-createch
  3. https://github.com/jatonline/common-crawl-datathon-group-3
  4. https://github.com/laurieburchell/datathon_problem4
  5. https://github.com/eshasadia/G5-CommonCrawl

You can download an archive of the presentations from the event here.

Contributors