Skip to content

Commit eba9d57

Browse files
committed
blog post creativecommons#2 cc catalog data visualization
1 parent 74cbf1a commit eba9d57

File tree

1 file changed

+68
-0
lines changed
  • content/blog/entries/cc-datacatalog-data-processing

1 file changed

+68
-0
lines changed
Lines changed: 68 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,68 @@
1+
title: Visualize CC Catalog data
2+
---
3+
categories:
4+
announcements
5+
cc-catalog
6+
product
7+
gsoc
8+
gsoc-2019
9+
open-source
10+
---
11+
author: soccerdroid
12+
---
13+
pub_date: 2019-07-10
14+
---
15+
body:
16+
17+
Welcome to the data processing part of the GSoC project! In this blog post, I am going to tell you about my firsts thoughts with the real data, and give you some details of the implementation developed so far.
18+
19+
### Data extraction
20+
21+
Each month, Creative Commons, using [Common Crawl](http://commoncrawl.org/), parses data from all the Web, looking for domains that contain licensed content. As you might be guessing, the amount of data is very big, so the CC Catalog data is stored in S3 buckets.
22+
23+
Spark is used to extract the data from the buckets, in the form of PARQUET files. In order to facilitate the analysis and processing of the data, the files are converted to TSV (tab-separated values). Each file can easily contain dozens of millions of rows. Our first aproach is to load the information in a Pandas Dataframe, but this can become very slow. Therefore, we will test the scripts for the data processing with a portion of the real data. Afterwards, we will use [Dask](https://dask.org/), which provides advanced parallelism for analytics, and has a behaviour similar to Pandas, with the entire dataset.
24+
25+
### Cleansing and filtering
26+
27+
This step is about depuring and reducing as much as possible the data, in order to get a meaningful visualization. The data that comes from the S3 buckets is actually pretty neat (no strange characters for example, or incomplete rows). Nevertheless as a first step, duplicate rows are deleted (given by duplicate URLs). Here, we are developing pruning rules. We try to:
28+
- exclude cycles (cyclic edges),
29+
- exclude lonely nodes,
30+
- avoid duplicates (for example, subdomains which are part of a single domain),
31+
32+
and so on.
33+
34+
### Formats
35+
36+
Now in the dataset, we have domain names in the form of URLs. But we want to make the graph looks pretty well. This is why we are going to extract the domain name from the URLs we have in the dataset. For this purpose, we use [tldextract](https://github.com/john-kurkowski/tldextract), which is a simple and complete open source library for extracting the parts of the domains (say: suffix, subdomain, domain name). This package is available in conda-forge too. Here is how tldextract works:
37+
38+
```python
39+
>>> ext = tldextract.extract('http://forums.bbc.co.uk')
40+
>>> (ext.subdomain, ext.domain, ext.suffix)
41+
('forums', 'bbc', 'co.uk') #we extract the domain name "bbc"
42+
```
43+
44+
Another important aspect is the licenses types. In the dataset, we do not have the exact license name; rather, we have a URL that directs to the license definition on [creativecommons.org](creativecommons.org). We have developed a function with some regular expressions to validate the correct format of these URls, and extracts from them the license name and version. This information will be shown in the pie chart that appears after the user clicks on a node.
45+
46+
```
47+
'https://creativecommons.org/licenses/by/4.0/' #valid license
48+
'https://creativecommons.org/licenses/zero/' #non-valid license
49+
50+
```
51+
52+
### Coming soon
53+
54+
- Data aggregation
55+
- Visualization with the data + perfectioning pruning/filtering rules
56+
57+
You can follow the project development in the [Github repo](https://github.com/creativecommons/cccatalog-dataviz).
58+
59+
CC Data Catalog Visualization is my GSoC 2019 project under the guidance of [Sophine
60+
Clachar](https://creativecommons.org/author/sclachar/), who has been greatly helpful and considerate since the GSoC application period. Also, my backup mentor, [Breno Ferreira](https://creativecommons.org/author/brenoferreira/), and engineering director [Kriti
61+
Godey](https://creativecommons.org/author/kriticreativecommons-org/), have been very supportive.
62+
63+
Have a nice week!
64+
65+
Maria
66+
67+
68+

0 commit comments

Comments
 (0)