Skip to content

Commit f413eb0

Browse files
authored
Merge pull request creativecommons#113 from soccerdroid/master
CC catalog data visualization-data processing part 3
2 parents 2272018 + 289bb97 commit f413eb0

File tree

3 files changed

+68
-0
lines changed

3 files changed

+68
-0
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,68 @@
1+
title: Visualize CC Catalog data - data processing part 3
2+
---
3+
categories:
4+
announcements
5+
cc-catalog
6+
product
7+
gsoc
8+
gsoc-2019
9+
open-source
10+
---
11+
author: soccerdroid
12+
---
13+
pub_date: 2019-08-12
14+
---
15+
body:
16+
17+
This is a continuation of my last blog post about the data processing part 2 of the CC-data catalog visualization project. I recommend you to read that [last post](https://opensource.creativecommons.org/blog/entries/cc-datacatalog-data-processing-2/) for a better understanding of what I'll explain here.
18+
19+
Hello! In this post I am going to talk you about the extraction of unique nodes, and links, and the visualization of the force-directed graph with the processed data.
20+
21+
### Nodes and links generation
22+
23+
The nodes and links will be visualized using [force-graph](https://github.com/vasturiano/force-graph). I spoke about this library in my blog post: [Visualize CC Catalog data](https://opensource.creativecommons.org/blog/entries/cc-datacatalog-visualization/). My first step is to use the data to generate the json file that it needs. Force-graph needs to be passed a single json file with two lists: one containing information about the nodes, and the other containing the links. They are both arrays of dictionaries. I have huge input files (and over 100 million unique domains in total). So in Pandas I need to build a DataFrame of a tsv input file using chunks. The challenge I am facing now is to generate a list of unique nodes. Here is why this is a challenge:
24+
25+
- In order to build the nodes list, I need to think aboutwhat are the source and target nodes.
26+
- Take into account a source node can also be a target node.
27+
- I can delete duplicate entries per column, but as I process the data in chunks, my scope is limited to the chunk size.
28+
- A domain can be repeated not only within a chunk, but in different chunks too.
29+
- Source and target must have licensed content
30+
31+
So as you can see, dealing with duplications is not that trivial when you have a lot of data. Next what I tried was to analyze smaller files, in order to be able to keep the data in memory in a single DataFrame. So for each TSV file I had before, now I have several small TSV files. This may extend the data analysis, but it can smooth the coding complexity.
32+
33+
I first started by formatting the data into source and target columns to generate the unique nodes for the graph. I iterate through each row of the current DataFrame I have (the one with provider_domain, cc_licences,links column, etc), and by reading the _links_ column, I load the json of each row. For each key in the json, I create a new row with provider_domain as source, they key as target, and the value of the key as a _value_ feature. I append that new row to a new DataFrame. I build a new row each time I read a line, so I have a DataFrame with all the links of a single provider_domain. When I finish iterating over the rows, I convert the DataFrames to list and save the output. That is how I get a new DataFrame containing all the existing links of the graph, with source, target and value columns. Yeah!
34+
35+
36+
37+
The visualization I get is the following:
38+
<div>
39+
<img src="graph.png" alt="Force-directed graph with the real data"/><br/>
40+
<small class="muted">Force-directed graph with the real data. </small>
41+
</div>
42+
<br>
43+
<div>
44+
<img src="graph-2.png" alt="Force-directed graph with the real data"/><br/>
45+
<small class="muted">Force-directed graph with the real data. </small>
46+
</div>
47+
<br>
48+
49+
### Coming soon
50+
51+
As you could visualize before, there are a lot of lonely nodes (nodes with no neighbors). My thoughts are to filter out the provider_domains that do not have a minimum quantity of licensed content. I will try with different values, starting form 100 to 1000, and see how the graph changes. I believe a lot of those nodes do not have a relevant amount of licensed content, so with this filter rule they will be removed.
52+
The other tasks left to do are:
53+
54+
- Visualization of the pie chart
55+
- Development or modification of pruning/filtering rules.
56+
57+
You can follow the project development in the [Github repo](https://github.com/creativecommons/cccatalog-dataviz).
58+
59+
CC Data Catalog Visualization is my GSoC 2019 project under the guidance of [Sophine
60+
Clachar](https://creativecommons.org/author/sclachar/), who has been greatly helpful and considerate since the GSoC application period. Also, my backup mentor, [Breno Ferreira](https://creativecommons.org/author/brenoferreira/), and engineering director [Kriti
61+
Godey](https://creativecommons.org/author/kriticreativecommons-org/), have been very supportive.
62+
63+
Have a nice week!
64+
65+
Maria
66+
67+
68+
Loading
Loading

0 commit comments

Comments
 (0)