Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 13 additions & 0 deletions G5-CommonCrawl.Rproj
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
Version: 1.0

RestoreWorkspace: Default
SaveWorkspace: Default
AlwaysSaveHistory: Default

EnableCodeIndexing: Yes
UseSpacesForTab: Yes
NumSpacesForTab: 2
Encoding: UTF-8

RnwWeave: knitr
LaTeX: pdfLaTeX
35 changes: 31 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,38 @@
# G5-CommonCrawl
# Bristol common crawl datathon repository (G5-CommonCrawl)

Access to VM: https://bdfi.atlassian.net/wiki/external/OTY3MjIxYTc2NjFhNDhjZTk5NzIzMjY1YWJlNGU1NTQ
We are a group of data scientists at a datathon using commoncrawl data to solve a problem:

Presentation: https://livewarwickac-my.sharepoint.com/:p:/g/personal/u5552013_live_warwick_ac_uk/IQAO_X-5AL-UR6hv6A1eKt4SAXkzxACUuJMgJ5pHpYVU1bg?e=YuHmoA
> This is a cache of web data, which contains all the UK governmental webpages (.gov.uk) that have been archived by the Common Crawl in two points in time: February-March 2024 (part 1 and part 2 same as Problem 4) and October 2025. Identify changes in a specific policy domain associated with the new government elected in July 2024.

This repository contains all our code.

To tackle the problem, we first:
- Targeted our searched. We had 2.4 million website entries in the raw WET file. To look at policies, we explicitly filtered our data to website in the `gov.uk/government/news` subdomain. This is where new policies are announced. This gives us roughly to 3000 webpages.


To analyse the webiste content, we took two approach:
- Target approached. We had a 'seed' of 12 policy names or terms. We filtered our data to only webpages that mentioned these terms. Then we generated embeddings to look at differences in context for the same terms across time.
- LLM approach to read documents and classify any policy instruments into certain categories. Then we compared differences in classifications over time.

# Results


# Classifications:


## Task/ folder

`validation`: Validation file of 20 url contents (2025). Helena and Meng Le picked out keyword independently. Also sent to copilot to check accuracy and agreement between human/llm
`validation`: Validation file of 20 url contents (2025). Helena and Meng Le picked out keyword independently. Also sent to copilot to check accuracy and agreement between human/llm

## data

`news_filtered_xxxx.csv`: Filtered file from 2024/2025 crawl. Filtered to urls that contain `gov.uk/government/news`. Produced by code in `FilterURL.py` where polars is used to check url first before reading in lines.


## test embedded graph

# My Report

Here’s the interactive chart:

<iframe src="docs/example plot.html" width="800" height="600" style="border:none;"></iframe>
27 changes: 27 additions & 0 deletions data_viz/plotly data classifications.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
## plots of data to save
library(dplyr)
library(plotly)

n_sample = 100
?sample
df =
data.frame(
policy.instrument = 1:100,
main.classification = letters[1:4] %>% sample(size = n_sample, replace = T),
secondary.classification = letters[1:4] %>% sample(size = n_sample, replace = T),
time = 2024:2025 %>% as.character
)
df

library(ggplot2)
library(plotly)
library(htmlwidgets)

gg_version =
ggplot(df, aes(y = main.classification, fill = time, group = time)) +
geom_bar(position = position_dodge())


## save by hand
gg_version %>% ggplotly() %>% htmlwidgets::saveWidget('docs/example plot.html')
?ggplotly
Loading