Skip to content

Commit 9dbd66a

Browse files
committed
fixing typos and adding more information to the post
1 parent 76ebb71 commit 9dbd66a

File tree

1 file changed

+17
-8
lines changed
  • content/blog/entries/cc-datacatalog-data-processing

1 file changed

+17
-8
lines changed

content/blog/entries/cc-datacatalog-data-processing/contents.lr

Lines changed: 17 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
title: Visualize CC Catalog data
1+
title: Visualize CC Catalog data - data processing
22
---
33
categories:
44
announcements
@@ -18,13 +18,22 @@ Welcome to the data processing part of the GSoC project! In this blog post, I am
1818

1919
### Data extraction
2020

21-
Each month, Creative Commons uses [Common Crawl](http://commoncrawl.org/) data to find all domains that contain *CC* licensed content. As you might be guessing, the amount of data is very big, so the CC Catalog data is stored in S3 buckets and [Apache Spark](https://spark.apache.org/) is used to extract the data from Common Crawl.
21+
Each month, Creative Commons uses [Common Crawl](http://commoncrawl.org/) data to find all domains that contain CC licensed content. As you might be guessing, the amount of data is very big, so the CC Catalog data is stored in [S3](http://commoncrawl.org/the-data/get-started/) buckets and [Apache Spark](https://spark.apache.org/) is used to extract the data from Common Crawl.
2222

23-
Spark is used is used again in this project to extract the data, in the form of parquet files, from the buckets. In order to facilitate the analysis and processing of the data, the files are converted to TSV (tab-separated values). Each file can easily contain dozens of millions of rows. Our first aproach is to load the information in a Pandas Dataframe, but this can become very slow. Therefore, we will test the scripts for the data processing with a portion of the real data. Afterwards, I will use [Dask](https://dask.org/) with the entire dataset. Dask provides advanced parallelism for analytics, and has a behaviour similar to Pandas.
23+
Spark is used again in this project to extract the data, in the form of parquet files, from the buckets. In order to facilitate the analysis and processing of the data, the files are converted to TSV (tab-separated values). The dataset I work on contains the following fields:
24+
25+
- provider_domain: name of the domain with licensed content.
26+
- cc_license: path to the Creative Commons license deed used by the _provider\\_domain_.
27+
- images: number of images showed in the _provider\\_domain_ web page.
28+
- links: JSON field that contains a dictionary with domains as keys, and number of links as values. A link appears when _provider\\_domain_ has an href tag in its web page that points to the domain key.
29+
30+
31+
32+
Each file can easily contain dozens of millions of rows. My first aproach is to load the information in a Pandas Dataframe, but this can become very slow. Therefore, I will test the scripts for the data processing with a portion of the real data. Afterwards, I will use [Dask](https://dask.org/) with the entire dataset. Dask provides advanced parallelism for analytics, and has a behaviour similar to Pandas.
2433

2534
### Cleansing and filtering
2635

27-
This step is about preparing the data for analysis and reducing the amount of data, in order to get a meaningful visualization. The data that comes from the S3 buckets is actually pretty neat (no strange characters for example, or incomplete rows). Nevertheless as a first step, duplicate rows are deleted (given by duplicate URLs). Next I, develop pruning rules. We try to:
36+
This step is about preparing the data for analysis and reducing the amount of data, in order to get a meaningful visualization. The data that comes from the S3 buckets is actually pretty neat (no strange characters for example, or incomplete rows). Nevertheless as a first step, duplicate rows are deleted (given by duplicate URLs). Next I, develop pruning rules. I try to:
2837
- exclude cycles (cyclic edges),
2938
- exclude lonely nodes,
3039
- avoid duplicates (for example, subdomains which are part of a single domain),
@@ -33,17 +42,17 @@ This step is about preparing the data for analysis and reducing the amount of da
3342

3443
### Formatting Domain Names
3544

36-
Now in the dataset, we have domain names in the form of URLs. But we want to make the graph looks pretty well. This is why we are going to extract the domain name from the URLs we have in the dataset. For this purpose, we use [tldextract](https://github.com/john-kurkowski/tldextract), which is a simple and complete open source library for extracting the parts of the domains (say: suffix, subdomain, domain name). This package is available in conda-forge too. Here is how tldextract works:
45+
Now in the dataset, we have domain names in the form of URLs. But we want to make the graph looks pretty well. This is why I am going to extract the domain name from the URLs we have in the dataset. For this purpose, I use [tldextract](https://github.com/john-kurkowski/tldextract), which is a simple and complete open source library for extracting the parts of the domains (say: suffix, subdomain, domain name). This package is available in conda-forge too. Here is how tldextract works:
3746

3847
```python
3948
>>> ext = tldextract.extract('http://forums.bbc.co.uk')
4049
>>> (ext.subdomain, ext.domain, ext.suffix)
41-
('forums', 'bbc', 'co.uk') #we extract the domain name "bbc"
50+
('forums', 'bbc', 'co.uk') #extract the domain name "bbc"
4251
```
43-
The main part is the extraction of the domain name. This will be applied to all de domains and html metadata (the links between domains).
52+
The main part is the extraction of the domain name. This will be applied to the _provider\\_domain_ and _links_ fields in order to build the graph. The domain names will be the ones displayed over the nodes, as depicted in [my first blog post](https://creativecommons.github.io/blog/entries/cc-datacatalog-visualization/).
4453

4554
### License validation
46-
Another important aspect is the licenses types. In the dataset, we do not have the exact license name; rather, we have a URL that directs to the license definition on [creativecommons.org](creativecommons.org). We have developed a function with some regular expressions to validate the correct format of these URls, and extracts from them the license name and version. This information will be shown in the pie chart that appears after the user clicks on a node.
55+
Another important aspect is the licenses types. In the dataset, we do not have the exact license name; rather, we have a URL that directs to the license definition on [creativecommons.org](creativecommons.org). We have developed a [function](https://github.com/creativecommons/cccatalog/blob/master/src/providers/api/modules/etlMods.py#L75) with some regular expressions to validate the correct format of these URls, and extracts from them the license name and version. This information will be shown in the pie chart that appears after the user clicks on a node.
4756

4857
```
4958
'https://creativecommons.org/licenses/by/4.0/' #valid license

0 commit comments

Comments
 (0)