Merge branch 'master' into srinidhi_gsoc_final_blog

mathemancer · web-flow · commit 4d4f70f722fe · 2020-08-25T17:46:23.000+02:00
diff --git a/content/blog/entries/linked-commons-autocomplete-feature/contents.lr b/content/blog/entries/linked-commons-autocomplete-feature/contents.lr
@@ -32,7 +32,7 @@ The following blog intends to explain the very recent feature integrated to the
 ## Motivation
 One of the newest features integrated last month into Linked Commons is Filtering by node name. Here a user can search for his/her favourite node and explore all its neighbours. Since the list is very big, it was self-evident for us to have a text box (and not a drop-down) where the user is supposed to type the node name.
 
-Some of the reasons why to have a text box or filtering by node option.
+Some of the reasons why to have "autocomplete feature" in the filtering by node name -
 - Some of the node names are very uncommon and lengthy. There is a high probability of misspelling it.
 - Submitting the form and getting a response of “Node doesn’t exist” isn’t a very good user flow, and we want to minimise such incidents.
 
@@ -73,8 +73,6 @@ Here are some aggregated result statistics.
 | Max Requests/s            |** 214       **|
 | Failures/s              	|** 0         **|
 
-Since SQLlite has a serverless design, disk io usually has a significant impact on the performance. The above results were executed on a server with HDD storage. Linked Commons server is equipped with faster disk io. It will certainly improve the performance but will be countered by the network latency and other factors like the number of nodes in the dB. So the above results to some degree resemble the actual performance.
-
 
 ## Next steps
 In the next blog, we will be covering the long awaited data update and the new architecture.
diff --git a/content/blog/entries/linked-commons-data-update/contents.lr b/content/blog/entries/linked-commons-data-update/contents.lr
@@ -0,0 +1,91 @@
+title: Linked Commons: Data Update
+---
+categories:
+announcements
+cc-catalog
+product
+gsoc
+gsoc-2020
+open-source
+cc-dataviz
+---
+author: subhamX
+---
+series: gsoc-2020-dataviz
+---
+pub_date: 2020-08-25
+---
+
+body:
+In this blog, I will be explaining the task we were working on for the last 3-4 weeks. It will take you on a journey of optimizations from million graph traversals in building the database to just a few traversals in the end. Also, we will be covering the new architecture for the upcoming version of the Linked Commons and the reason behind the change.
+
+## Where does it fit?
+
+So far the Linked Commons was using a tiny subset of the data available in the CC Catalog. One of the primary targets of our team was to update the data. If you observe closely all tasks so far starting from adding "Graph Filtering Methods" to "Autocomplete Feature". These were actually bringing us closer towards this task. i.e. the much-awaited **"Scale the Data of Linked Commons"**. We aim to add around **235k nodes and 4.14 million links** into the Linked Commons project from around **400 nodes and 500 links** in the current version. This drastic addition of new data is one of its kind, which makes this task very challenging and exciting.
+
+## Pilot
+
+The raw CC Catalog data cannot be used directly in the Linked Commons. Our first task involves processing it, which includes removing isolated nodes, etc. You can read more about it in the data processing series [blog](https://opensource.creativecommons.org/blog/entries/cc-datacatalog-data-processing/) written by my mentor Maria. After this, we need to build a database which stores the **"distance list"** of all the nodes.
+
+### What is "distance list"?
+
+<div style="text-align: center; width: 90%; margin-left: 5%;">
+    <figure>
+        <img src="distance-list.png" alt="Distance List" style="border: 1px solid black">
+        <figcaption>Distance list representation* of the node 'icij' part of a hypothetical graph</figcaption>
+    </figure>
+</div>
+
+***
+
+**Distance List** is a method of graph representation. It is similar to [Adjacency List](https://en.wikipedia.org/wiki/Adjacency_list) representation of graphs but instead of storing data of just immediate neighbouring nodes, "distance list" groups all vertices based on their distance from the root node and stores this grouped data for every vertex in the graph. In short, "distance list" is a more general form of the Adjacency List representation.
+
+To build this "distance list", we created a script for this, let’s name it **build-dB-script.py**, which uses the [Breadth-First Search(BFS)](https://en.wikipedia.org/wiki/Breadth-first_search) algorithm on every node to traverse the graph and gradually build this distance list. The filtering nodes feature of our web page connects to the server, which uses the aforementioned database and serves a smaller chunk of data.
+
+
+## Problem
+
+Now that we know where the *build-dB-script* is used, let’s discuss the problems with it. The new graph data we are going to use is enormous and is in millions. A full traversal of a graph with million nodes, million times is very slow. Just to give some helpful numbers, the script was taking around 10 minutes to process a hundred nodes. Assuming the growth is linear(in the best case), it will take more than **15 days** to complete the computations. **It is scary, and thus, optimizations in the *build-dB-script* are the need of the hour!!**
+
+
+## Optimizations
+
+In this section, we will talk of the different versions of the build database script, starting from the brute force BFS method.
+
+The brute force BFS was the most simple and technically correct solution, but as the name suggests it was slow. In the next iteration, I stored the details of last n nodes, 10 to be precise and performed the same old BFS. It was faster but it had a logic error. Say, there is a link from a node to an already visited/traversed node. The script was not putting all the nodes which could have been explored from this path. After a few more leaps from Depth-first Search, to Breadth-first search, and other methods, eventually with the help of my mentors, we built a new approach - **"Sequential dB Build"**.
+
+To keep this blog short, I won’t be going too much into implementation details, but here are some of the critical points.
+
+### Key points of the Sequential dB Build:
+
+-   It was the fastest of all the predecessors and reduced the script timing significantly.
+-   In this approach, we aimed to build the all distance list of [1, 2, 3,... ., k-1] before building kth distance list.
+
+Unfortunately, still, it was not enough for our current requirements. Just to give you some insights, the distance two list computation was taking around **4 hours**, and **distance three list** computation was taking **20+ hours**. It shows that all these optimizations were not enough and were incapable of handling this big dataset.
+
+## New Architecture
+
+As the optimizations in "build-dB-scripts" weren’t enough, we started looking to simplify the current architecture. In the end, we want to have a viable product which is scalable to this massive data. Although we are still not dropping the multi-distance filtering, we will continue our research on it and hopefully will have it in **Linked Commons 3.0**. 😎
+
+
+For any node, it is more likely that any person would wish to know the immediate neighbours who are linking to some arbitrary node. Nodes at a distance greater than one exhibits very less information on the reach and connectivity of the root node. It was because of this we decided to change our current logic of having the distance list up to 10; instead, we reduced it to 1 and also stored the immediate incomming nodes list (Nodes which are at distance 1 in the [transpose graph](https://en.wikipedia.org/wiki/Transpose_graph)).
+
+This small change in the design simplified a lot of things, and now the new graph build was taking around 2 minutes. By the time I am writing this blog we have upgraded our database from **shelve to MongoDB** where the build time is further reduced. 🔥🔥
+
+
+<div style="text-align: center; width: 90%; margin-left: 5%;">
+    <figure>
+        <img src="graph.png" alt="Light Theme" style="border: 1px solid black">
+        <figcaption>Graph showing neighbouring nodes. Incoming link are coloured with Turquoise and outgoing are coloured with Red.</figcaption>
+    </figure>
+</div>
+
+
+## Conclusion
+
+This task was really challenging and I learnt a lot. It was really mesmerizing to see the **Linked Commons grow and evolve**. I hope you enjoyed reading this blog. You can follow the project development [here](https://github.com/creativecommons/cccatalog-dataviz/), and access the stable version of linked commons [here](http://dataviz.creativecommons.engineering/). 
+
+Feel free to report bugs and suggest features. It will help us improve this project. If you wish to join the our team, consider joining our [slack](https://creativecommons.slack.com/channels/cc-dev-cc-catalog-viz) channel. Read more about our community teams [here](https://opensource.creativecommons.org/community/). See you in my next blog! 🚀
+___
+
+**Linked Commons uses a more complex schema. The picture is just for illustration.*
diff --git a/content/blog/entries/linked-commons-data-update/distance-list.png b/content/blog/entries/linked-commons-data-update/distance-list.png
diff --git a/content/blog/entries/linked-commons-data-update/graph.png b/content/blog/entries/linked-commons-data-update/graph.png
diff --git a/content/blog/entries/overview-of-the-gsoc-2020-project/contents.lr b/content/blog/entries/overview-of-the-gsoc-2020-project/contents.lr
@@ -0,0 +1,88 @@
+title: Overview of the GSoC 2020 Project
+---
+categories:
+
+cc-catalog
+gsoc
+gsoc-2020
+---
+author: charini
+---
+series: gsoc-2020-cccatalog
+---
+pub_date: 2020-08-26
+---
+body:
+This is my final blog post under the [GSoC 2020: CC catalog][cc_catalog_series] series, where I will highlight and
+summarize my contributions to Creative Commons (CC) as part of my GSoC project. The CC Catalog project collects and
+stores CC licensed images scattered across the internet, such that they can be made accessible to the general public via
+the [CC Search][cc_search] and [CC Catalog API][cc_api] tools. I got the opportunity to work on different aspects of the
+CC Catalog repository which ultimately enhances the user experience of the CC Search and CC Catalog API tools. My
+primary contributions in the duration of GSoC, and the related pull requests (PR) are as follows.
+
+1. **Sub-provider retrieval**: The first task I completed as part of my GSoC project was the retrieval of sub-providers
+(also known as _source_) such that images could be categorised under these sources, ensuring an enhanced search
+experience for the users. I completed the implementation of sub-provider retrieval for three providers; Flickr,
+Europeana, and Smithsonian. If you are interested in learning how the retrieval logic works, please check my
+[initial blog post][flickr_blog_post] of this series. The PRs related to this task are as follows.
+  - PR #[420][pr_420]: Retrieve sub-providers within Flickr
+  - PR #[442][pr_442]: Retrieve sub-providers within Europeana
+  - PR #[455][pr_455]: Retrieve sub-providers within Smithsonian
+  - PR #[461][pr_461]: Add new source as a sub-provider of Flickr
+
+2. **Alert updates to Smithsonian unit codes**: For the Smithsonian provider, we rely on the field known as _unit code_
+ to determine the sub-provider (for Smithsonian it is often a museum) each image belongs to. However, it is possible for
+ the _unit code_ values to change over time at the upstream, and if CC is unaware of these changes, it could hinder the
+ successful categorisation of Smithsonian images under unique sub-provider values. I have therefore introduced a
+ mechanism of alerting the CC code maintainers of potential changes to _unit code_ values at the upstream. More
+ information is provided in my [second blog post][unit_code_blog_post] of this series. The PR related to this task
+ is #[465][pr_465].
+
+3. **Improvements to the Smithsonian provider API script**: Smithsonian is an important provider which aggregates images
+from 19 museums. However, due to the fact that the different museums have different data models and the resultant
+incompatibility of the JSON responses returned from requests to the Smithsonian API, it is difficult to know which
+fields to rely on to obtain the information necessary for CC. This results in CC missing out on certain important
+information. As part of my GSoC project, I improved the completeness of _creator_ and _description_ information, by
+identifying previously unknown fields from which these details could be retrieved. Even though my improvements did not
+result in the identification of a comprehensive list of fields, the completeness of data was considerably improved for
+some Smithsonian museums compared to how it was before. For more context about this issue please refer to the ticket
+#[397][issue_397]. Apart from improving information of Smithsonian data, I was also able to identify issues with certain
+Smithsonian API responses which did not contain mandatory information for some of the museums. We have informed the
+Smithsonian technical team of these issues and they are highlighted in ticket #[397][issue_397] as well. The PRs related
+to this task are as follows.
+  - PR #[474][pr_474]: Improve the creator and description information of the Smithsonian source _National Museum of
+  Natural History_ (NMNH). This is the largest museum (source) under the Smithsonian provider.
+  - PR #[476][pr_476]: Improve the _creator_ and _description_ information of other sources coming under the Smithsonian
+  provider.
+
+4. **Expiration of outdated images**: The final task I completed as part of my GSoC project was implementing a strategy
+for expiring outdated images in the CC database. CC has a mechanism for keeping the images they have retrieved from
+providers up-to-date, based on how old an image is. This is called the [re-ingestion strategy][reingest_blog_post],
+where newer images are updated more frequently compared to older images. However, this re-ingestion strategy does not
+detect images which have been deleted at the upstream. Thus, it is possible that some of the images stored in the CC
+database are obsolete, which could result in broken links being presented via the [CC Search][cc_search] tool. As a
+solution, I have implemented a mechanism of identifying whether images in the CC database are obsolete by looking at the
+*updated_on* column value of the CC image table. Depending on the re-ingestion strategy per provider, we can know what
+the oldest *updated_on* value, an image can assume. If the *updated_on* value is older than the oldest valid value, we
+flag the corresponding image record  as obsolete.  The PR related to this task is #[483][pr_483].
+
+I will continue to take the responsibility for maintaining my code in the CC Catalog repository, and I hope to continue
+contributing to the CC codebase. It has been a wonderful GSoC journey for me and special thanks goes to my supervisor
+Brent for his guidance.
+
+
+[cc_catalog_series]: ./#series
+[cc_search]: https://ccsearch.creativecommons.org/
+[cc_api]: https://api.creativecommons.engineering/v1/
+[flickr_blog_post]: ../flickr-sub-provider-retrieval/
+[unit_code_blog_post]: ../smithsonian-unit-code-update/
+[reingest_blog_post]: ../date-partitioned-data-reingestion/
+[pr_420]: https://github.com/creativecommons/cccatalog/pull/420
+[pr_442]: https://github.com/creativecommons/cccatalog/pull/442
+[pr_455]: https://github.com/creativecommons/cccatalog/pull/455
+[pr_461]: https://github.com/creativecommons/cccatalog/pull/461
+[pr_465]: https://github.com/creativecommons/cccatalog/pull/465
+[pr_474]: https://github.com/creativecommons/cccatalog/pull/474
+[pr_476]: https://github.com/creativecommons/cccatalog/pull/476
+[pr_483]: https://github.com/creativecommons/cccatalog/pull/483
+[issue_397]: https://github.com/creativecommons/cccatalog/issues/397
diff --git a/themes/vocabulary_theme/templates/layout.html b/themes/vocabulary_theme/templates/layout.html
@@ -78,7 +78,6 @@
                 ['/contributing-code/projects', 'Project List'],
                 ['/contributing-code/pr-guidelines', 'Pull Request Guidelines'],
                 ['/contributing-code/github-repo-guidelines', 'GitHub Repo Guidelines'],
-                ['/contributing-code/cc-search', 'CC Search'],
                 ['/contributing-code/usability', 'Usability'],
               ] %}
                 <a class="navbar-item" href="{{ href|url }}">{{ title }}</a>