Skip to content

Commit 68beab9

Browse files
authored
Merge pull request creativecommons#471 from subhamX/dataviz-blog
Add blog on the linked commons data update
2 parents 85bab7a + 7c6df9a commit 68beab9

File tree

4 files changed

+92
-3
lines changed

4 files changed

+92
-3
lines changed

content/blog/entries/linked-commons-autocomplete-feature/contents.lr

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,7 @@ The following blog intends to explain the very recent feature integrated to the
3232
## Motivation
3333
One of the newest features integrated last month into Linked Commons is Filtering by node name. Here a user can search for his/her favourite node and explore all its neighbours. Since the list is very big, it was self-evident for us to have a text box (and not a drop-down) where the user is supposed to type the node name.
3434

35-
Some of the reasons why to have a text box or filtering by node option.
35+
Some of the reasons why to have "autocomplete feature" in the filtering by node name -
3636
- Some of the node names are very uncommon and lengthy. There is a high probability of misspelling it.
3737
- Submitting the form and getting a response of “Node doesn’t exist” isn’t a very good user flow, and we want to minimise such incidents.
3838

@@ -73,8 +73,6 @@ Here are some aggregated result statistics.
7373
| Max Requests/s |** 214 **|
7474
| Failures/s |** 0 **|
7575

76-
Since SQLlite has a serverless design, disk io usually has a significant impact on the performance. The above results were executed on a server with HDD storage. Linked Commons server is equipped with faster disk io. It will certainly improve the performance but will be countered by the network latency and other factors like the number of nodes in the dB. So the above results to some degree resemble the actual performance.
77-
7876

7977
## Next steps
8078
In the next blog, we will be covering the long awaited data update and the new architecture.
Lines changed: 91 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,91 @@
1+
title: Linked Commons: Data Update
2+
---
3+
categories:
4+
announcements
5+
cc-catalog
6+
product
7+
gsoc
8+
gsoc-2020
9+
open-source
10+
cc-dataviz
11+
---
12+
author: subhamX
13+
---
14+
series: gsoc-2020-dataviz
15+
---
16+
pub_date: 2020-08-25
17+
---
18+
19+
body:
20+
In this blog, I will be explaining the task we were working on for the last 3-4 weeks. It will take you on a journey of optimizations from million graph traversals in building the database to just a few traversals in the end. Also, we will be covering the new architecture for the upcoming version of the Linked Commons and the reason behind the change.
21+
22+
## Where does it fit?
23+
24+
So far the Linked Commons was using a tiny subset of the data available in the CC Catalog. One of the primary targets of our team was to update the data. If you observe closely all tasks so far starting from adding "Graph Filtering Methods" to "Autocomplete Feature". These were actually bringing us closer towards this task. i.e. the much-awaited **"Scale the Data of Linked Commons"**. We aim to add around **235k nodes and 4.14 million links** into the Linked Commons project from around **400 nodes and 500 links** in the current version. This drastic addition of new data is one of its kind, which makes this task very challenging and exciting.
25+
26+
## Pilot
27+
28+
The raw CC Catalog data cannot be used directly in the Linked Commons. Our first task involves processing it, which includes removing isolated nodes, etc. You can read more about it in the data processing series [blog](https://opensource.creativecommons.org/blog/entries/cc-datacatalog-data-processing/) written by my mentor Maria. After this, we need to build a database which stores the **"distance list"** of all the nodes.
29+
30+
### What is "distance list"?
31+
32+
<div style="text-align: center; width: 90%; margin-left: 5%;">
33+
<figure>
34+
<img src="distance-list.png" alt="Distance List" style="border: 1px solid black">
35+
<figcaption>Distance list representation* of the node 'icij' part of a hypothetical graph</figcaption>
36+
</figure>
37+
</div>
38+
39+
***
40+
41+
**Distance List** is a method of graph representation. It is similar to [Adjacency List](https://en.wikipedia.org/wiki/Adjacency_list) representation of graphs but instead of storing data of just immediate neighbouring nodes, "distance list" groups all vertices based on their distance from the root node and stores this grouped data for every vertex in the graph. In short, "distance list" is a more general form of the Adjacency List representation.
42+
43+
To build this "distance list", we created a script for this, let’s name it **build-dB-script.py**, which uses the [Breadth-First Search(BFS)](https://en.wikipedia.org/wiki/Breadth-first_search) algorithm on every node to traverse the graph and gradually build this distance list. The filtering nodes feature of our web page connects to the server, which uses the aforementioned database and serves a smaller chunk of data.
44+
45+
46+
## Problem
47+
48+
Now that we know where the *build-dB-script* is used, let’s discuss the problems with it. The new graph data we are going to use is enormous and is in millions. A full traversal of a graph with million nodes, million times is very slow. Just to give some helpful numbers, the script was taking around 10 minutes to process a hundred nodes. Assuming the growth is linear(in the best case), it will take more than **15 days** to complete the computations. **It is scary, and thus, optimizations in the *build-dB-script* are the need of the hour!!**
49+
50+
51+
## Optimizations
52+
53+
In this section, we will talk of the different versions of the build database script, starting from the brute force BFS method.
54+
55+
The brute force BFS was the most simple and technically correct solution, but as the name suggests it was slow. In the next iteration, I stored the details of last n nodes, 10 to be precise and performed the same old BFS. It was faster but it had a logic error. Say, there is a link from a node to an already visited/traversed node. The script was not putting all the nodes which could have been explored from this path. After a few more leaps from Depth-first Search, to Breadth-first search, and other methods, eventually with the help of my mentors, we built a new approach - **"Sequential dB Build"**.
56+
57+
To keep this blog short, I won’t be going too much into implementation details, but here are some of the critical points.
58+
59+
### Key points of the Sequential dB Build:
60+
61+
- It was the fastest of all the predecessors and reduced the script timing significantly.
62+
- In this approach, we aimed to build the all distance list of [1, 2, 3,... ., k-1] before building kth distance list.
63+
64+
Unfortunately, still, it was not enough for our current requirements. Just to give you some insights, the distance two list computation was taking around **4 hours**, and **distance three list** computation was taking **20+ hours**. It shows that all these optimizations were not enough and were incapable of handling this big dataset.
65+
66+
## New Architecture
67+
68+
As the optimizations in "build-dB-scripts" weren’t enough, we started looking to simplify the current architecture. In the end, we want to have a viable product which is scalable to this massive data. Although we are still not dropping the multi-distance filtering, we will continue our research on it and hopefully will have it in **Linked Commons 3.0**. 😎
69+
70+
71+
For any node, it is more likely that any person would wish to know the immediate neighbours who are linking to some arbitrary node. Nodes at a distance greater than one exhibits very less information on the reach and connectivity of the root node. It was because of this we decided to change our current logic of having the distance list up to 10; instead, we reduced it to 1 and also stored the immediate incomming nodes list (Nodes which are at distance 1 in the [transpose graph](https://en.wikipedia.org/wiki/Transpose_graph)).
72+
73+
This small change in the design simplified a lot of things, and now the new graph build was taking around 2 minutes. By the time I am writing this blog we have upgraded our database from **shelve to MongoDB** where the build time is further reduced. 🔥🔥
74+
75+
76+
<div style="text-align: center; width: 90%; margin-left: 5%;">
77+
<figure>
78+
<img src="graph.png" alt="Light Theme" style="border: 1px solid black">
79+
<figcaption>Graph showing neighbouring nodes. Incoming link are coloured with Turquoise and outgoing are coloured with Red.</figcaption>
80+
</figure>
81+
</div>
82+
83+
84+
## Conclusion
85+
86+
This task was really challenging and I learnt a lot. It was really mesmerizing to see the **Linked Commons grow and evolve**. I hope you enjoyed reading this blog. You can follow the project development [here](https://github.com/creativecommons/cccatalog-dataviz/), and access the stable version of linked commons [here](http://dataviz.creativecommons.engineering/).
87+
88+
Feel free to report bugs and suggest features. It will help us improve this project. If you wish to join the our team, consider joining our [slack](https://creativecommons.slack.com/channels/cc-dev-cc-catalog-viz) channel. Read more about our community teams [here](https://opensource.creativecommons.org/community/). See you in my next blog! 🚀
89+
___
90+
91+
**Linked Commons uses a more complex schema. The picture is just for illustration.*
Loading
Loading

0 commit comments

Comments
 (0)