Bristol common crawl datathon repository (G5-CommonCrawl)

This repository contains the workflow, data, and analysis code produced by Group 5 for the Bristol Datathon on Common Crawl. Our project investigates policy change within the “Creating Opportunities / Breaking Down Barriers to Opportunity” mission, focusing specifically on the Best Start in Life agenda.

Using archived .gov.uk webpages from Feb–Mar 2024 and Oct 2025, we apply text processing, embedding-based semantic comparison, and clustering to identify how the discourse around early-years policy has shifted following the UK General Election (July 2024).

Bristol CommonCrawl Datathon Repository

Problem Statement

This is a cache of web data, which contains all the UK governmental webpages (.gov.uk) that have been archived by the Common Crawl in two points in time: February-March 2024 (part 1 and part 2 same as Problem 4) and October 2025. Identifying changes in a specific policy domain associated with the new government elected in July 2024.

Team

Meng Le Zhang, Aditi Dutta, Esha Sadia Nasir, Mariam Cook, Helena Byrne, E Chern Wong.

Requires

Python for data wrangling and classification.
R for some data visualisation and some proto-type scripts (e.g. making API calls to LLMs).

Team Members

Meng Le Zhang

Aditi Dutta

validation: Validation file of 20 url contents (2025). Helena and Meng Le picked out keyword independently. Also sent to copilot to check accuracy and agreement between human/LLM.

Esha Sadia Nasir

Mariam Cook

Helena Byrne

Data

E Chern Wong

Project Overview

We analysed a large cache of UK governmental webpages extracted from the Common Crawl. Due to file size constraints, the data was accessed via cloud notebooks and filtered to isolate policy-relevant subdomains.

Our workflow included:

URL identification for domains associated with early-years policy.

policy_classes_xxx: Contents classified but using a LLM (Gemma 3) on VM.
\classified: Contents classified but this time using a LLM on Groq (llama3.1) on colab.

Test embedded graph

My Report

Keyword-driven filtering to extract relevant text spans from large .csv datasets.
Embedding generation (BERT) for semantic comparison of policy language across time.
Clustering & PCA to visualise shifts in discourse between 2024 and 2025.
Press release metadata extraction (date, organisation, headline, subtopics).
Validation using a combination of human coders and LLM-based summarisation.

Repository Structure

`/data/`

Contains processed and filtered datasets.

news_filtered_*.csv Filtered samples from both 2024 and 2025 crawls, limited to URLs containing gov.uk/government/news. These were generated using FilterURL.py, which uses polars to check URLs efficiently before loading content.

policy_classes_* Text content classified using Gemma 3 on a VM for topic detection.

classified/ Classification outputs produced using Groq-hosted LLMs for comparison.

`/validation/`

Contains validation results for 20 manually reviewed URL contents (2025).

Helena and Meng independently extracted keywords.

The same URLs were processed using Copilot to assess LLM vs human agreement.

Notes include common LLM failure modes (e.g., misinterpreting link text as body content due to WET file formatting).

`/notebooks/` (referenced externally in Colab)

Links used during development:

Reading & parsing data: https://colab.research.google.com/drive/1Y8OIDSFNeWqP1hBvSiCvSt0-ozkKrd8J?usp=sharing

Generating embeddings: https://colab.research.google.com/drive/1MSg-4xyR1RRjLLxVMFJxvWXHUjMAWIOI?usp=sharing

These notebooks contain:

Data loading and preprocessing scripts

Keyword-based sampling of text windows

Embedding generation and similarity matrices

PCA and clustering visualisation code

Requirements

Python: Data wrangling (Polars, Pandas), Keyword extraction, Embedding generation, Clustering and semantic analysis

R: Data visualisation prototypes, Scripts for calling external LLM APIs

Analysis Summary

Core policy terms included:

['Best Start in Life', 'Sure Start', 'family hubs', 'free school meals', 'School meals', 'Development checks', 'School readiness', 'breakfast clubs', 'free breakfast', 'childcare hours', 'free childcare hours', 'free childcare for working parents', 'tax-free childcare', 'universal credit childcare']

We compared 2024 vs 2025 text embeddings to explore:

Policy topics that persisted across governments
Emerging or discontinued discourse
Semantic drift (e.g., Best Start in Life and family hubs becoming nearly identical in 2025, suggesting policy repositioning)

Outputs include:

Cosine similarity matrices
PCA embeddings
Cluster maps
Summary tables of policy mentions

A comparison spreadsheet is available here: https://docs.google.com/spreadsheets/d/13n3gUIBtA1DpU14Wpvb-2o08nLG0drN4iTlRpzgIKEQ/edit?usp=sharing

Visualisation Example

Interactive PCA & cluster plots are included in /docs.

Example:

Notes on Validation & Limitations

The WET files contain plain text only, causing LLMs to misinterpret navigation links as body content.

Human validation showed good accuracy overall, but LLMs sometimes hallucinated or over-interpreted hyperlinks.

Some ministerial subdomains lacked coverage in specific crawls (0 occurrences).

Future Work

Extend analysis to additional subdomains (e.g., /organisations/)

Improve causal relation extraction between policy mentions

Apply topic modelling or supervised classifiers on a larger corpus

Construct a time-series narrative of policy shifts post-2024 election

Name		Name	Last commit message	Last commit date
Latest commit History 80 Commits
.github/workflows		.github/workflows
classified		classified
data		data
data_viz		data_viz
docs		docs
src		src
validation		validation
.gitignore		.gitignore
.nojekyl		.nojekyl
FilterURL.py		FilterURL.py
G5-CommonCrawl.Rproj		G5-CommonCrawl.Rproj
Org_postcodes.pdf		Org_postcodes.pdf
README.md		README.md
URL scraper.py		URL scraper.py
index.md		index.md
llama_prompt_engineering.ipynb		llama_prompt_engineering.ipynb
news_filtered_2024_p1_1.csv		news_filtered_2024_p1_1.csv
news_filtered_2024_p1_2.csv		news_filtered_2024_p1_2.csv
news_postcodes.pdf		news_postcodes.pdf
problem4_filtered_split1.csv		problem4_filtered_split1.csv
problem4_filtered_split2.csv		problem4_filtered_split2.csv
problem5_govuk_filtered.csv		problem5_govuk_filtered.csv
test-httr-llm.r		test-httr-llm.r

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Bristol common crawl datathon repository (G5-CommonCrawl)

Bristol CommonCrawl Datathon Repository

Problem Statement

Team

Requires

Team Members

Data

Project Overview

Test embedded graph

My Report

Repository Structure

`/data/`

`/validation/`

`/notebooks/` (referenced externally in Colab)

Requirements

Analysis Summary

Visualisation Example

Notes on Validation & Limitations

Future Work

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Bristol common crawl datathon repository (G5-CommonCrawl)

Bristol CommonCrawl Datathon Repository

Problem Statement

Team

Requires

Team Members

Data

Project Overview

Test embedded graph

My Report

Repository Structure

/data/

/validation/

/notebooks/ (referenced externally in Colab)

Requirements

Analysis Summary

Visualisation Example

Notes on Validation & Limitations

Future Work

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`/data/`

`/validation/`

`/notebooks/` (referenced externally in Colab)

Packages