layout	default
title	Home

Who are we?

We are a group of data scientists at a datathon using commoncrawl data to solve a problem hosted by Bristol University (link)[https://www.urbaneconomies.co.uk/datathon.html#datathon-materials].

This repository contains all our code.

What was the datathon problem?

This is a cache of web data, which contains all the UK governmental webpages (.gov.uk) that have been archived by the Common Crawl in two points in time: February-March 2024 (part 1 and part 2 same as Problem 4) and October 2025. Identify changes in a specific policy domain associated with the new government elected in July 2024.

What did we do?

To tackle the problem, we first:

Targeted our searched. We had 2.4 million website entries in the raw WET file. To look at policies, we explicitly filtered our data to website in the gov.uk/government/news subdomain. This is where new policies are announced. This gives us roughly to 3000 webpages.

To analyse the webiste content, we took two approach:

Target approached. We had a 'seed' of 12 policy names or terms. We filtered our data to only webpages that mentioned these terms. Then we generated embeddings to look at differences in context for the same terms across time.
LLM approach to read documents and classify any policy instruments into certain categories. Then we compared differences in classifications over time.

What did we find:

Classifications:

This plot shows the difference over time in the types of policy instruments that are announced. Following the general election, we find a XXXX

Embediing

The seed policy terms that we wanted to look at were

“Best Start in Life”
Sure Start
Family Hub
Net Zero
Healthy Start Vouchers
Child
Children
Tax-Free Childcare
Universal Credit childcare
Free Childcare for Working Parents
Babies
Infants
School meals
Free school meals
Breakfast clubs
Free breakfast club
Vitamins
Fresh fruit and veg
Nutrients
Health visitor
Development checks
School readiness
Literacy
Numeracy
Digital literacy
Clean air

We found xxx that mentions etc etc

Other findings

We also human validated the LLM results/ classifications. We found XXX

What did we learn?

The raw files are very large -- we used polars library to only read in read specific lines of the data that match our subdomain url.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Who are we?

What was the datathon problem?

What did we do?

What did we find:

Classifications:

Embediing

Other findings

What did we learn?

FilesExpand file tree

index.md

Latest commit

History

index.md

File metadata and controls

Who are we?

What was the datathon problem?

What did we do?

What did we find:

Classifications:

Embediing

Other findings

What did we learn?