Skip to content

Commit 4d4f70f

Browse files
authored
Merge branch 'master' into srinidhi_gsoc_final_blog
2 parents 16f4bcb + 68beab9 commit 4d4f70f

File tree

6 files changed

+180
-4
lines changed

6 files changed

+180
-4
lines changed

content/blog/entries/linked-commons-autocomplete-feature/contents.lr

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,7 @@ The following blog intends to explain the very recent feature integrated to the
3232
## Motivation
3333
One of the newest features integrated last month into Linked Commons is Filtering by node name. Here a user can search for his/her favourite node and explore all its neighbours. Since the list is very big, it was self-evident for us to have a text box (and not a drop-down) where the user is supposed to type the node name.
3434

35-
Some of the reasons why to have a text box or filtering by node option.
35+
Some of the reasons why to have "autocomplete feature" in the filtering by node name -
3636
- Some of the node names are very uncommon and lengthy. There is a high probability of misspelling it.
3737
- Submitting the form and getting a response of “Node doesn’t exist” isn’t a very good user flow, and we want to minimise such incidents.
3838

@@ -73,8 +73,6 @@ Here are some aggregated result statistics.
7373
| Max Requests/s |** 214 **|
7474
| Failures/s |** 0 **|
7575

76-
Since SQLlite has a serverless design, disk io usually has a significant impact on the performance. The above results were executed on a server with HDD storage. Linked Commons server is equipped with faster disk io. It will certainly improve the performance but will be countered by the network latency and other factors like the number of nodes in the dB. So the above results to some degree resemble the actual performance.
77-
7876

7977
## Next steps
8078
In the next blog, we will be covering the long awaited data update and the new architecture.
Lines changed: 91 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,91 @@
1+
title: Linked Commons: Data Update
2+
---
3+
categories:
4+
announcements
5+
cc-catalog
6+
product
7+
gsoc
8+
gsoc-2020
9+
open-source
10+
cc-dataviz
11+
---
12+
author: subhamX
13+
---
14+
series: gsoc-2020-dataviz
15+
---
16+
pub_date: 2020-08-25
17+
---
18+
19+
body:
20+
In this blog, I will be explaining the task we were working on for the last 3-4 weeks. It will take you on a journey of optimizations from million graph traversals in building the database to just a few traversals in the end. Also, we will be covering the new architecture for the upcoming version of the Linked Commons and the reason behind the change.
21+
22+
## Where does it fit?
23+
24+
So far the Linked Commons was using a tiny subset of the data available in the CC Catalog. One of the primary targets of our team was to update the data. If you observe closely all tasks so far starting from adding "Graph Filtering Methods" to "Autocomplete Feature". These were actually bringing us closer towards this task. i.e. the much-awaited **"Scale the Data of Linked Commons"**. We aim to add around **235k nodes and 4.14 million links** into the Linked Commons project from around **400 nodes and 500 links** in the current version. This drastic addition of new data is one of its kind, which makes this task very challenging and exciting.
25+
26+
## Pilot
27+
28+
The raw CC Catalog data cannot be used directly in the Linked Commons. Our first task involves processing it, which includes removing isolated nodes, etc. You can read more about it in the data processing series [blog](https://opensource.creativecommons.org/blog/entries/cc-datacatalog-data-processing/) written by my mentor Maria. After this, we need to build a database which stores the **"distance list"** of all the nodes.
29+
30+
### What is "distance list"?
31+
32+
<div style="text-align: center; width: 90%; margin-left: 5%;">
33+
<figure>
34+
<img src="distance-list.png" alt="Distance List" style="border: 1px solid black">
35+
<figcaption>Distance list representation* of the node 'icij' part of a hypothetical graph</figcaption>
36+
</figure>
37+
</div>
38+
39+
***
40+
41+
**Distance List** is a method of graph representation. It is similar to [Adjacency List](https://en.wikipedia.org/wiki/Adjacency_list) representation of graphs but instead of storing data of just immediate neighbouring nodes, "distance list" groups all vertices based on their distance from the root node and stores this grouped data for every vertex in the graph. In short, "distance list" is a more general form of the Adjacency List representation.
42+
43+
To build this "distance list", we created a script for this, let’s name it **build-dB-script.py**, which uses the [Breadth-First Search(BFS)](https://en.wikipedia.org/wiki/Breadth-first_search) algorithm on every node to traverse the graph and gradually build this distance list. The filtering nodes feature of our web page connects to the server, which uses the aforementioned database and serves a smaller chunk of data.
44+
45+
46+
## Problem
47+
48+
Now that we know where the *build-dB-script* is used, let’s discuss the problems with it. The new graph data we are going to use is enormous and is in millions. A full traversal of a graph with million nodes, million times is very slow. Just to give some helpful numbers, the script was taking around 10 minutes to process a hundred nodes. Assuming the growth is linear(in the best case), it will take more than **15 days** to complete the computations. **It is scary, and thus, optimizations in the *build-dB-script* are the need of the hour!!**
49+
50+
51+
## Optimizations
52+
53+
In this section, we will talk of the different versions of the build database script, starting from the brute force BFS method.
54+
55+
The brute force BFS was the most simple and technically correct solution, but as the name suggests it was slow. In the next iteration, I stored the details of last n nodes, 10 to be precise and performed the same old BFS. It was faster but it had a logic error. Say, there is a link from a node to an already visited/traversed node. The script was not putting all the nodes which could have been explored from this path. After a few more leaps from Depth-first Search, to Breadth-first search, and other methods, eventually with the help of my mentors, we built a new approach - **"Sequential dB Build"**.
56+
57+
To keep this blog short, I won’t be going too much into implementation details, but here are some of the critical points.
58+
59+
### Key points of the Sequential dB Build:
60+
61+
- It was the fastest of all the predecessors and reduced the script timing significantly.
62+
- In this approach, we aimed to build the all distance list of [1, 2, 3,... ., k-1] before building kth distance list.
63+
64+
Unfortunately, still, it was not enough for our current requirements. Just to give you some insights, the distance two list computation was taking around **4 hours**, and **distance three list** computation was taking **20+ hours**. It shows that all these optimizations were not enough and were incapable of handling this big dataset.
65+
66+
## New Architecture
67+
68+
As the optimizations in "build-dB-scripts" weren’t enough, we started looking to simplify the current architecture. In the end, we want to have a viable product which is scalable to this massive data. Although we are still not dropping the multi-distance filtering, we will continue our research on it and hopefully will have it in **Linked Commons 3.0**. 😎
69+
70+
71+
For any node, it is more likely that any person would wish to know the immediate neighbours who are linking to some arbitrary node. Nodes at a distance greater than one exhibits very less information on the reach and connectivity of the root node. It was because of this we decided to change our current logic of having the distance list up to 10; instead, we reduced it to 1 and also stored the immediate incomming nodes list (Nodes which are at distance 1 in the [transpose graph](https://en.wikipedia.org/wiki/Transpose_graph)).
72+
73+
This small change in the design simplified a lot of things, and now the new graph build was taking around 2 minutes. By the time I am writing this blog we have upgraded our database from **shelve to MongoDB** where the build time is further reduced. 🔥🔥
74+
75+
76+
<div style="text-align: center; width: 90%; margin-left: 5%;">
77+
<figure>
78+
<img src="graph.png" alt="Light Theme" style="border: 1px solid black">
79+
<figcaption>Graph showing neighbouring nodes. Incoming link are coloured with Turquoise and outgoing are coloured with Red.</figcaption>
80+
</figure>
81+
</div>
82+
83+
84+
## Conclusion
85+
86+
This task was really challenging and I learnt a lot. It was really mesmerizing to see the **Linked Commons grow and evolve**. I hope you enjoyed reading this blog. You can follow the project development [here](https://github.com/creativecommons/cccatalog-dataviz/), and access the stable version of linked commons [here](http://dataviz.creativecommons.engineering/).
87+
88+
Feel free to report bugs and suggest features. It will help us improve this project. If you wish to join the our team, consider joining our [slack](https://creativecommons.slack.com/channels/cc-dev-cc-catalog-viz) channel. Read more about our community teams [here](https://opensource.creativecommons.org/community/). See you in my next blog! 🚀
89+
___
90+
91+
**Linked Commons uses a more complex schema. The picture is just for illustration.*
Loading
Loading
Lines changed: 88 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,88 @@
1+
title: Overview of the GSoC 2020 Project
2+
---
3+
categories:
4+
5+
cc-catalog
6+
gsoc
7+
gsoc-2020
8+
---
9+
author: charini
10+
---
11+
series: gsoc-2020-cccatalog
12+
---
13+
pub_date: 2020-08-26
14+
---
15+
body:
16+
This is my final blog post under the [GSoC 2020: CC catalog][cc_catalog_series] series, where I will highlight and
17+
summarize my contributions to Creative Commons (CC) as part of my GSoC project. The CC Catalog project collects and
18+
stores CC licensed images scattered across the internet, such that they can be made accessible to the general public via
19+
the [CC Search][cc_search] and [CC Catalog API][cc_api] tools. I got the opportunity to work on different aspects of the
20+
CC Catalog repository which ultimately enhances the user experience of the CC Search and CC Catalog API tools. My
21+
primary contributions in the duration of GSoC, and the related pull requests (PR) are as follows.
22+
23+
1. **Sub-provider retrieval**: The first task I completed as part of my GSoC project was the retrieval of sub-providers
24+
(also known as _source_) such that images could be categorised under these sources, ensuring an enhanced search
25+
experience for the users. I completed the implementation of sub-provider retrieval for three providers; Flickr,
26+
Europeana, and Smithsonian. If you are interested in learning how the retrieval logic works, please check my
27+
[initial blog post][flickr_blog_post] of this series. The PRs related to this task are as follows.
28+
- PR #[420][pr_420]: Retrieve sub-providers within Flickr
29+
- PR #[442][pr_442]: Retrieve sub-providers within Europeana
30+
- PR #[455][pr_455]: Retrieve sub-providers within Smithsonian
31+
- PR #[461][pr_461]: Add new source as a sub-provider of Flickr
32+
33+
2. **Alert updates to Smithsonian unit codes**: For the Smithsonian provider, we rely on the field known as _unit code_
34+
to determine the sub-provider (for Smithsonian it is often a museum) each image belongs to. However, it is possible for
35+
the _unit code_ values to change over time at the upstream, and if CC is unaware of these changes, it could hinder the
36+
successful categorisation of Smithsonian images under unique sub-provider values. I have therefore introduced a
37+
mechanism of alerting the CC code maintainers of potential changes to _unit code_ values at the upstream. More
38+
information is provided in my [second blog post][unit_code_blog_post] of this series. The PR related to this task
39+
is #[465][pr_465].
40+
41+
3. **Improvements to the Smithsonian provider API script**: Smithsonian is an important provider which aggregates images
42+
from 19 museums. However, due to the fact that the different museums have different data models and the resultant
43+
incompatibility of the JSON responses returned from requests to the Smithsonian API, it is difficult to know which
44+
fields to rely on to obtain the information necessary for CC. This results in CC missing out on certain important
45+
information. As part of my GSoC project, I improved the completeness of _creator_ and _description_ information, by
46+
identifying previously unknown fields from which these details could be retrieved. Even though my improvements did not
47+
result in the identification of a comprehensive list of fields, the completeness of data was considerably improved for
48+
some Smithsonian museums compared to how it was before. For more context about this issue please refer to the ticket
49+
#[397][issue_397]. Apart from improving information of Smithsonian data, I was also able to identify issues with certain
50+
Smithsonian API responses which did not contain mandatory information for some of the museums. We have informed the
51+
Smithsonian technical team of these issues and they are highlighted in ticket #[397][issue_397] as well. The PRs related
52+
to this task are as follows.
53+
- PR #[474][pr_474]: Improve the creator and description information of the Smithsonian source _National Museum of
54+
Natural History_ (NMNH). This is the largest museum (source) under the Smithsonian provider.
55+
- PR #[476][pr_476]: Improve the _creator_ and _description_ information of other sources coming under the Smithsonian
56+
provider.
57+
58+
4. **Expiration of outdated images**: The final task I completed as part of my GSoC project was implementing a strategy
59+
for expiring outdated images in the CC database. CC has a mechanism for keeping the images they have retrieved from
60+
providers up-to-date, based on how old an image is. This is called the [re-ingestion strategy][reingest_blog_post],
61+
where newer images are updated more frequently compared to older images. However, this re-ingestion strategy does not
62+
detect images which have been deleted at the upstream. Thus, it is possible that some of the images stored in the CC
63+
database are obsolete, which could result in broken links being presented via the [CC Search][cc_search] tool. As a
64+
solution, I have implemented a mechanism of identifying whether images in the CC database are obsolete by looking at the
65+
*updated_on* column value of the CC image table. Depending on the re-ingestion strategy per provider, we can know what
66+
the oldest *updated_on* value, an image can assume. If the *updated_on* value is older than the oldest valid value, we
67+
flag the corresponding image record as obsolete. The PR related to this task is #[483][pr_483].
68+
69+
I will continue to take the responsibility for maintaining my code in the CC Catalog repository, and I hope to continue
70+
contributing to the CC codebase. It has been a wonderful GSoC journey for me and special thanks goes to my supervisor
71+
Brent for his guidance.
72+
73+
74+
[cc_catalog_series]: ./#series
75+
[cc_search]: https://ccsearch.creativecommons.org/
76+
[cc_api]: https://api.creativecommons.engineering/v1/
77+
[flickr_blog_post]: ../flickr-sub-provider-retrieval/
78+
[unit_code_blog_post]: ../smithsonian-unit-code-update/
79+
[reingest_blog_post]: ../date-partitioned-data-reingestion/
80+
[pr_420]: https://github.com/creativecommons/cccatalog/pull/420
81+
[pr_442]: https://github.com/creativecommons/cccatalog/pull/442
82+
[pr_455]: https://github.com/creativecommons/cccatalog/pull/455
83+
[pr_461]: https://github.com/creativecommons/cccatalog/pull/461
84+
[pr_465]: https://github.com/creativecommons/cccatalog/pull/465
85+
[pr_474]: https://github.com/creativecommons/cccatalog/pull/474
86+
[pr_476]: https://github.com/creativecommons/cccatalog/pull/476
87+
[pr_483]: https://github.com/creativecommons/cccatalog/pull/483
88+
[issue_397]: https://github.com/creativecommons/cccatalog/issues/397

themes/vocabulary_theme/templates/layout.html

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -78,7 +78,6 @@
7878
['/contributing-code/projects', 'Project List'],
7979
['/contributing-code/pr-guidelines', 'Pull Request Guidelines'],
8080
['/contributing-code/github-repo-guidelines', 'GitHub Repo Guidelines'],
81-
['/contributing-code/cc-search', 'CC Search'],
8281
['/contributing-code/usability', 'Usability'],
8382
] %}
8483
<a class="navbar-item" href="{{ href|url }}">{{ title }}</a>

0 commit comments

Comments
 (0)