You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: content/blog/entries/cc-datacatalog-data-processing/contents.lr
+17-8Lines changed: 17 additions & 8 deletions
Original file line number
Diff line number
Diff line change
@@ -1,4 +1,4 @@
1
-
title: Visualize CC Catalog data
1
+
title: Visualize CC Catalog data - data processing
2
2
---
3
3
categories:
4
4
announcements
@@ -18,13 +18,22 @@ Welcome to the data processing part of the GSoC project! In this blog post, I am
18
18
19
19
### Data extraction
20
20
21
-
Each month, Creative Commons uses [Common Crawl](http://commoncrawl.org/) data to find all domains that contain *CC* licensed content. As you might be guessing, the amount of data is very big, so the CC Catalog data is stored in S3 buckets and [Apache Spark](https://spark.apache.org/) is used to extract the data from Common Crawl.
21
+
Each month, Creative Commons uses [Common Crawl](http://commoncrawl.org/) data to find all domains that contain CC licensed content. As you might be guessing, the amount of data is very big, so the CC Catalog data is stored in [S3](http://commoncrawl.org/the-data/get-started/) buckets and [Apache Spark](https://spark.apache.org/) is used to extract the data from Common Crawl.
22
22
23
-
Spark is used is used again in this project to extract the data, in the form of parquet files, from the buckets. In order to facilitate the analysis and processing of the data, the files are converted to TSV (tab-separated values). Each file can easily contain dozens of millions of rows. Our first aproach is to load the information in a Pandas Dataframe, but this can become very slow. Therefore, we will test the scripts for the data processing with a portion of the real data. Afterwards, I will use [Dask](https://dask.org/) with the entire dataset. Dask provides advanced parallelism for analytics, and has a behaviour similar to Pandas.
23
+
Spark is used again in this project to extract the data, in the form of parquet files, from the buckets. In order to facilitate the analysis and processing of the data, the files are converted to TSV (tab-separated values). The dataset I work on contains the following fields:
24
+
25
+
- provider_domain: name of the domain with licensed content.
26
+
- cc_license: path to the Creative Commons license deed used by the _provider\\_domain_.
27
+
- images: number of images showed in the _provider\\_domain_ web page.
28
+
- links: JSON field that contains a dictionary with domains as keys, and number of links as values. A link appears when _provider\\_domain_ has an href tag in its web page that points to the domain key.
29
+
30
+
31
+
32
+
Each file can easily contain dozens of millions of rows. My first aproach is to load the information in a Pandas Dataframe, but this can become very slow. Therefore, I will test the scripts for the data processing with a portion of the real data. Afterwards, I will use [Dask](https://dask.org/) with the entire dataset. Dask provides advanced parallelism for analytics, and has a behaviour similar to Pandas.
24
33
25
34
### Cleansing and filtering
26
35
27
-
This step is about preparing the data for analysis and reducing the amount of data, in order to get a meaningful visualization. The data that comes from the S3 buckets is actually pretty neat (no strange characters for example, or incomplete rows). Nevertheless as a first step, duplicate rows are deleted (given by duplicate URLs). Next I, develop pruning rules. We try to:
36
+
This step is about preparing the data for analysis and reducing the amount of data, in order to get a meaningful visualization. The data that comes from the S3 buckets is actually pretty neat (no strange characters for example, or incomplete rows). Nevertheless as a first step, duplicate rows are deleted (given by duplicate URLs). Next I, develop pruning rules. I try to:
28
37
- exclude cycles (cyclic edges),
29
38
- exclude lonely nodes,
30
39
- avoid duplicates (for example, subdomains which are part of a single domain),
@@ -33,17 +42,17 @@ This step is about preparing the data for analysis and reducing the amount of da
33
42
34
43
### Formatting Domain Names
35
44
36
-
Now in the dataset, we have domain names in the form of URLs. But we want to make the graph looks pretty well. This is why we are going to extract the domain name from the URLs we have in the dataset. For this purpose, we use [tldextract](https://github.com/john-kurkowski/tldextract), which is a simple and complete open source library for extracting the parts of the domains (say: suffix, subdomain, domain name). This package is available in conda-forge too. Here is how tldextract works:
45
+
Now in the dataset, we have domain names in the form of URLs. But we want to make the graph looks pretty well. This is why I am going to extract the domain name from the URLs we have in the dataset. For this purpose, I use [tldextract](https://github.com/john-kurkowski/tldextract), which is a simple and complete open source library for extracting the parts of the domains (say: suffix, subdomain, domain name). This package is available in conda-forge too. Here is how tldextract works:
('forums', 'bbc', 'co.uk') #we extract the domain name "bbc"
50
+
('forums', 'bbc', 'co.uk') #extract the domain name "bbc"
42
51
```
43
-
The main part is the extraction of the domain name. This will be applied to all de domains and html metadata (the links between domains).
52
+
The main part is the extraction of the domain name. This will be applied to the _provider\\_domain_and _links_ fields in order to build the graph. The domain names will be the ones displayed over the nodes, as depicted in [my first blog post](https://creativecommons.github.io/blog/entries/cc-datacatalog-visualization/).
44
53
45
54
### License validation
46
-
Another important aspect is the licenses types. In the dataset, we do not have the exact license name; rather, we have a URL that directs to the license definition on [creativecommons.org](creativecommons.org). We have developed a function with some regular expressions to validate the correct format of these URls, and extracts from them the license name and version. This information will be shown in the pie chart that appears after the user clicks on a node.
55
+
Another important aspect is the licenses types. In the dataset, we do not have the exact license name; rather, we have a URL that directs to the license definition on [creativecommons.org](creativecommons.org). We have developed a [function](https://github.com/creativecommons/cccatalog/blob/master/src/providers/api/modules/etlMods.py#L75) with some regular expressions to validate the correct format of these URls, and extracts from them the license name and version. This information will be shown in the pie chart that appears after the user clicks on a node.
0 commit comments