eshasadia · MengLeZhang · Nov 28, 2025 · Nov 27, 2025 · Nov 28, 2025 · Nov 28, 2025
diff --git a/G5-CommonCrawl.Rproj b/G5-CommonCrawl.Rproj
@@ -0,0 +1,13 @@
+Version: 1.0
+
+RestoreWorkspace: Default
+SaveWorkspace: Default
+AlwaysSaveHistory: Default
+
+EnableCodeIndexing: Yes
+UseSpacesForTab: Yes
+NumSpacesForTab: 2
+Encoding: UTF-8
+
+RnwWeave: knitr
+LaTeX: pdfLaTeX
diff --git a/README.md b/README.md
@@ -1,11 +1,38 @@
-# G5-CommonCrawl
+# Bristol common crawl datathon repository (G5-CommonCrawl)
 
-Access to VM: https://bdfi.atlassian.net/wiki/external/OTY3MjIxYTc2NjFhNDhjZTk5NzIzMjY1YWJlNGU1NTQ
+We are a group of data scientists at a datathon using commoncrawl data to solve a problem:
 
-Presentation: https://livewarwickac-my.sharepoint.com/:p:/g/personal/u5552013_live_warwick_ac_uk/IQAO_X-5AL-UR6hv6A1eKt4SAXkzxACUuJMgJ5pHpYVU1bg?e=YuHmoA
+> This is a cache of web data, which contains all the UK governmental webpages (.gov.uk) that have been archived by the Common Crawl in two points in time: February-March 2024 (part 1 and part 2 same as Problem 4) and October 2025. Identify changes in a specific policy domain associated with the new government elected in July 2024.
 
+This repository contains all our code.
+
+To tackle the problem, we first:
+- Targeted our searched. We had 2.4 million website entries in the raw WET file. To look at policies, we explicitly filtered our data to website in the `gov.uk/government/news` subdomain. This is where new policies are announced. This gives us roughly to 3000 webpages. 
+
+
+To analyse the webiste content, we took two approach:
+- Target approached. We had a 'seed' of 12 policy names or terms. We filtered our data to only webpages that mentioned these terms. Then we generated embeddings to look at differences in context for the same terms across time.
+- LLM approach to read documents and classify any policy instruments into certain categories. Then we compared differences in classifications over time. 
+
+# Results
+
+
+# Classifications:
 
 
 ## Task/ folder
 
-`validation`: Validation file of 20 url contents (2025). Helena and Meng Le picked out keyword independently. Also sent to copilot to check accuracy and agreement between human/llm
+`validation`: Validation file of 20 url contents (2025). Helena and Meng Le picked out keyword independently. Also sent to copilot to check accuracy and agreement between human/llm
+
+## data
+
+`news_filtered_xxxx.csv`: Filtered file from 2024/2025 crawl. Filtered to urls that contain `gov.uk/government/news`. Produced by code in `FilterURL.py` where polars is used to check url first before reading in lines.
+
+
+## test embedded graph
+
+# My Report
+
+Here’s the interactive chart:
+
+<iframe src="docs/example plot.html" width="800" height="600" style="border:none;"></iframe>
diff --git a/data_viz/plotly data classifications.R b/data_viz/plotly data classifications.R
@@ -0,0 +1,27 @@
+## plots of data to save 
+library(dplyr)
+library(plotly)
+
+n_sample = 100
+?sample
+df = 
+  data.frame(
+    policy.instrument = 1:100,
+    main.classification = letters[1:4] %>% sample(size = n_sample, replace = T),
+    secondary.classification = letters[1:4] %>% sample(size = n_sample, replace = T),
+    time = 2024:2025 %>% as.character
+  )
+df
+
+library(ggplot2)
+library(plotly)
+library(htmlwidgets)
+
+gg_version = 
+  ggplot(df, aes(y = main.classification, fill = time, group = time)) +
+  geom_bar(position = position_dodge())
+
+
+## save by hand
+gg_version %>% ggplotly() %>% htmlwidgets::saveWidget('docs/example plot.html')
+?ggplotly