diff --git a/README.md b/README.md index 1ad03b0..07c17d4 100644 --- a/README.md +++ b/README.md @@ -1,13 +1,15 @@ -# Bristol common crawl datathon repository (G5-CommonCrawl) +# Bristol CommonCrawl Datathon -We are a group of data scientists at a datathon using commoncrawl data to solve a problem: +# Problem Statement +This is a cache of web data, which contains all the UK governmental webpages (.gov.uk) that have been archived by the Common Crawl in two points in time: February-March 2024 (part 1 and part 2 same as Problem 4) and October 2025. Identify changes in a specific policy domain associated with the new government elected in July 2024. -> This is a cache of web data, which contains all the UK governmental webpages (.gov.uk) that have been archived by the Common Crawl in two points in time: February-March 2024 (part 1 and part 2 same as Problem 4) and October 2025. Identify changes in a specific policy domain associated with the new government elected in July 2024. +# Team +We are a group of data scientists at a datathon using commoncrawl data to solve a problem -This repository contains all our code. +# Methods To tackle the problem, we first: -- Targeted our searched. We had 2.4 million website entries in the raw WET file. To look at policies, we explicitly filtered our data to website in the `gov.uk/government/news` subdomain. This is where new policies are announced. This gives us roughly to 3000 webpages. +- Targeted our searched. we had 2.4 million website entries in the raw WET file. To look at policies, we explicitly filtered our data to website in the `gov.uk/government/news` subdomain. This is where new policies are announced. This gives us roughly to 3000 webpages. To analyse the webiste content, we took two approach: diff --git a/topic_count.png b/topic_count.png new file mode 100644 index 0000000..1dce969 Binary files /dev/null and b/topic_count.png differ diff --git a/wordcloud.png b/wordcloud.png new file mode 100644 index 0000000..d568c8c Binary files /dev/null and b/wordcloud.png differ