| layout | default |
|---|---|
| title | Home |
We are a group of data scientists at a datathon using commoncrawl data to solve a problem hosted by Bristol University (link)[https://www.urbaneconomies.co.uk/datathon.html#datathon-materials].
This repository contains all our code.
This is a cache of web data, which contains all the UK governmental webpages (.gov.uk) that have been archived by the Common Crawl in two points in time: February-March 2024 (part 1 and part 2 same as Problem 4) and October 2025. Identify changes in a specific policy domain associated with the new government elected in July 2024.
To tackle the problem, we first:
- Targeted our searched. We had 2.4 million website entries in the raw WET file. To look at policies, we explicitly filtered our data to website in the
gov.uk/government/newssubdomain. This is where new policies are announced. This gives us roughly to 3000 webpages.
To analyse the webiste content, we took two approach:
- Target approached. We had a 'seed' of 12 policy names or terms. We filtered our data to only webpages that mentioned these terms. Then we generated embeddings to look at differences in context for the same terms across time.
- LLM approach to read documents and classify any policy instruments into certain categories. Then we compared differences in classifications over time.
This plot shows the difference over time in the types of policy instruments that are announced. Following the general election, we find a XXXX
The seed policy terms that we wanted to look at were
“Best Start in Life”
Sure Start
Family Hub
Net Zero
Healthy Start Vouchers
Child
Children
Tax-Free Childcare
Universal Credit childcare
Free Childcare for Working Parents
Babies
Infants
School meals
Free school meals
Breakfast clubs
Free breakfast club
Vitamins
Fresh fruit and veg
Nutrients
Health visitor
Development checks
School readiness
Literacy
Numeracy
Digital literacy
Clean air
We found xxx that mentions etc etc
We also human validated the LLM results/ classifications. We found XXX
- The raw files are very large -- we used
polarslibrary to only read in read specific lines of the data that match our subdomain url.