Skip to content

Commit 39dec14

Browse files
authored
Merge pull request creativecommons#380 from creativecommons/update_master
Update master
2 parents a3773a3 + fba7325 commit 39dec14

File tree

8 files changed

+152
-88
lines changed

8 files changed

+152
-88
lines changed

.gitignore

+1
Original file line numberDiff line numberDiff line change
@@ -13,3 +13,4 @@ node_modules
1313
gui/build
1414
gui/node_modules
1515
gui/static/gen
16+
*.log
Loading
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,83 @@
1+
title: Linked Commons: Autocomplete Feature
2+
---
3+
categories:
4+
announcements
5+
cc-catalog
6+
product
7+
gsoc
8+
gsoc-2020
9+
open-source
10+
cc-dataviz
11+
---
12+
author: subhamX
13+
---
14+
series: gsoc-2020-dataviz
15+
---
16+
pub_date: 2020-07-31
17+
---
18+
19+
body:
20+
## Introduction
21+
The following blog intends to explain the very recent feature integrated to the Linked Commons. Be it the giant Google Search or any small website having a form field, everyone wishes to predict what’s on the user’s mind. For every keystroke, a nice search bar always renders some possible options the user could be looking for. The core ideology behind having this feature is — *do as much work as possible for the user!*
22+
23+
24+
25+
<div style="text-align: center; width: 100%;">
26+
<figure>
27+
<img src="autocomplete-feat-in-action.gif" alt="autocomplete-feature" style="border: 1px solid black">
28+
<figcaption style="font-weight: 500;">Autocomplete feature in action</figcaption>
29+
</figure>
30+
</div>
31+
32+
## Motivation
33+
One of the newest features integrated last month into Linked Commons is Filtering by node name. Here a user can search for his/her favourite node and explore all its neighbours. Since the list is very big, it was self-evident for us to have a text box (and not a drop-down) where the user is supposed to type the node name.
34+
35+
Some of the reasons why to have a text box or filtering by node option.
36+
- Some of the node names are very uncommon and lengthy. There is a high probability of misspelling it.
37+
- Submitting the form and getting a response of “Node doesn’t exist” isn’t a very good user flow, and we want to minimise such incidents.
38+
39+
Also, on a side note giving a search bar to the user and giving no hints is ruthless. We all need recommendations and guess what linked commons got you covered! Now for every keystroke, we load a bunch of node names which you might be looking for. ;)
40+
41+
42+
## Problem
43+
The autocomplete feature on a very basic level aims to predict the rest of a word the user is typing. A possible implementation is though the linear traversal of all the nodes in the list. It will be having a **linear time complexity**. It’s not very good and it’s very obvious to look for a faster and more efficient way. Also, even if for once we neglect the **time complexity**, looking for the best 10 nodes out of these millions on the client's machine is not at all a good idea; it will cause throttling and will result in performance drops.
44+
On the other hand, a **trie based solution** is more efficient for sure but still, we cannot do this indexing on the client machine for the same reasons stated above.
45+
So far, it is now apparent that we implement this feature on the server and also aim for at least something better than linear time complexity.
46+
47+
48+
49+
## A non-conventional solution
50+
We could have used Elastic Search, which is very powerful and has a ton of functionalities but since our needs are very small we wanted to look for other simple alternatives. Moreover, we didn't want to complicate our current architecture by adding an additional framework and libraries.
51+
52+
Taking the above points into consideration we went ahead with the following solution. We store all nodes data into an SQL dB and search for all the nodes whose domain name pattern was matching to the query string. After slicing the query set and other randomization we sent the payload to the client. To make it more robust, we are caching the results in the frontend to avoid multiple calls for the same query. It will surely reduce the load from the server and also give a faster response.
53+
54+
55+
## Results
56+
To make sure our solution works well, we performed load tests, checking that any response time does not exceed 1000 ms. We used locust which is a user load testing tool. We simulated with **1000 users** and **10 as Hatch rate**.
57+
The following test is performed on the local machine to ensure that the server location isn’t affecting the results.
58+
59+
Here are some aggregated result statistics.
60+
61+
62+
63+
| Field Name | Value |
64+
|---------------------------|---------------|
65+
| Request Count |** 23323 **|
66+
| Failure Count |** 0 **|
67+
| Median Response Time |** 360 ms **|
68+
| Average Response Time |** 586.289 ms**|
69+
| Min Response Time |** 4.03094 ms**|
70+
| Max Response Time |** 4216 ms **|
71+
| Average Content Size |** 528.667 ms**|
72+
| Requests/s |** 171.754 **|
73+
| Max Requests/s |** 214 **|
74+
| Failures/s |** 0 **|
75+
76+
Since SQLlite has a serverless design, disk io usually has a significant impact on the performance. The above results were executed on a server with HDD storage. Linked Commons server is equipped with faster disk io. It will certainly improve the performance but will be countered by the network latency and other factors like the number of nodes in the dB. So the above results to some degree resemble the actual performance.
77+
78+
79+
## Next steps
80+
In the next blog, we will be covering the long awaited data update and the new architecture.
81+
82+
## Conclusion
83+
Overall, I enjoyed working on this feature and it was a great learning experience. This feature has been successfully integrated to the development version, do check it out. Now that you have read this blog till the end, I hope that you enjoyed it. For more information please visit our [Github repo](https://github.com/creativecommons/cccatalog-dataviz/). We are looking forward to hearing from you on linked commons. Our [slack](https://creativecommons.slack.com/channels/cc-dev-cc-catalog-viz) doors are always open to you, see you there. :)

content/vocabulary/contents.lr

+5
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
_model: redirect
2+
---
3+
target: https://cc-vocabulary.netlify.app/
4+
---
5+
_discoverable: no

databags/community_team_members.json

+5-5
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,7 @@
2626
},
2727
{
2828
"name": "Aksh Gupta",
29-
"role": "Project Member"
29+
"role": "Project Collaborator"
3030
},
3131
{
3232
"name": "Shubham Pandey",
@@ -66,7 +66,7 @@
6666
},
6767
{
6868
"name": "Abhishek Naidu",
69-
"role": "Project Member"
69+
"role": "Project Collaborator"
7070
}
7171
],
7272
"name": "CC Search",
@@ -204,15 +204,15 @@
204204
},
205205
{
206206
"name": "Chidiebere Onyegbuchulem",
207-
"role": "Project Member"
207+
"role": "Project Collaborator"
208208
},
209209
{
210210
"name": "Ayan Choudhary",
211-
"role": "Project Member"
211+
"role": "Project Collaborator"
212212
},
213213
{
214214
"name": "Tanuj Agarwal",
215-
"role": "Project Member"
215+
"role": "Project Collaborator"
216216
},
217217
{
218218
"name": "Jahnvi Gupta",

databags/search_roadmap.json

+53-78
Original file line numberDiff line numberDiff line change
@@ -14,44 +14,34 @@
1414
"name": "Move data cleaning pipeline from API to Catalog"
1515
},
1616
{
17-
"description": "Manage Catalog deployment and provisioning entirely through infrastructure as code.",
18-
"gid": "1167425798148811",
19-
"name": "Improve Catalog Deployment and Provisioning"
20-
},
21-
{
22-
"description": "Create better documentation for community contributors by consolidating internal and public documentation and making it available for everyone.",
23-
"gid": "1167425798148815",
24-
"name": "Improve Documentation for Community Contributors"
17+
"description": "Update Catalog schema to include new metadata generated through AWS Rekognition.",
18+
"gid": "1154270978154717",
19+
"name": "Implement architecture for schema for new metadata [AWS Grant]"
2520
},
2621
{
2722
"description": "Plan out search algorithm changes to incorporate image metadata generated via AWS Rekognition.",
2823
"gid": "1154270978154720",
2924
"name": "Plan search algorithm changes for new metadata [AWS Grant]"
3025
},
31-
{
32-
"description": "Update Catalog schema to include new metadata generated through AWS Rekognition.",
33-
"gid": "1154270978154717",
34-
"name": "Implement architecture for schema for new metadata [AWS Grant]"
35-
},
3626
{
3727
"description": "Improve how and where we explain licenses, and consider ways to make it easier for reusers to understand and comply with license requirements.",
3828
"gid": "1147666754358269",
3929
"name": "License Explanation/Compliance Improvements"
4030
},
4131
{
42-
"description": "Improve the support pages on CC Search, which includes the Collections page, for a better experience. Add explanation text for collections, improve flow.",
43-
"gid": "1149385618454685",
44-
"name": "Improved Support Pages"
32+
"description": "Offline Old Search (oldsearch.creativecommons.org) and redirect traffic to CC Search. Prior to this, build in messaging on Old Search, and support similar functionality on CC Search. See \"Meta Search Integration\" for related work.",
33+
"gid": "1149456632174214",
34+
"name": "Offline old CC Search"
4535
},
4636
{
47-
"description": "Integrating meta search functionality into CC Search for sources that are not currently indexed, and content types we do not currently support.",
48-
"gid": "1174575887784290",
49-
"name": "Design Sprint: Meta Search Integration"
37+
"description": "Research and test potential integrations for Web Monetization into CC Search and other CC web properties.",
38+
"gid": "1153114910798067",
39+
"name": "Web Monetization: Phase 1"
5040
},
5141
{
52-
"description": "Offline Old Search (oldsearch.creativecommons.org) and redirect traffic to CC Search. Prior to this, build in messaging on Old Search, and support similar functionality on CC Search. See \"Meta Search Integration\" for related work.",
53-
"gid": "1149456632174214",
54-
"name": "Offline old CC Search"
42+
"description": "Improve the support pages on CC Search, which includes the Collections page, for a better experience. Add explanation text for collections, improve flow.",
43+
"gid": "1149385618454685",
44+
"name": "Improved Support Pages"
5545
},
5646
{
5747
"description": "Make accessibility improvements to the UI.",
@@ -63,6 +53,11 @@
6353
"gid": "1149456632174198",
6454
"name": "Internationalization Infrastructure"
6555
},
56+
{
57+
"description": "Update our Common Crawl provider infrastructure to:\n(1) use Apache Airflow instead of AWS tools like Data Pipeline & Glue for processing data\n(2) unify provider processing to use the same base classes as API providers",
58+
"gid": "1167425798148813",
59+
"name": "Improve Common Crawl Infrastructure"
60+
},
6661
{
6762
"description": "Designing and prototyping an upcoming user interface for searching for audio on CC Search.",
6863
"gid": "1163392248010945",
@@ -73,86 +68,66 @@
7368
"gid": "1171015130050099",
7469
"name": "Audio Support and Integration"
7570
},
76-
{
77-
"description": "Update our Common Crawl provider infrastructure to:\n(1) use Apache Airflow instead of AWS tools like Data Pipeline & Glue for processing data\n(2) unify provider processing to use the same base classes as API providers",
78-
"gid": "1167425798148813",
79-
"name": "Improve Common Crawl Infrastructure"
80-
},
81-
{
82-
"description": "Switch our Catalog data ingestion for Wikimedia Commons to use the data dumps provided by Wikimedia instead of the MediaWiki API.",
83-
"gid": "1167425798148807",
84-
"name": "Use Data Dumps for Wikimedia Ingestion"
85-
},
86-
{
87-
"description": "Research and test potential integrations for Web Monetization into CC Search and other CC web properties.",
88-
"gid": "1153114910798067",
89-
"name": "Web Monetization: Phase 1"
90-
},
91-
{
92-
"description": "Build a UI for the Catalog API, where users can sign up, manage access, see usage metrics and statistics.",
93-
"gid": "1149478266493761",
94-
"name": "API UI with Usage Dashboard"
95-
},
96-
{
97-
"description": "Make CC Catalog API documentation more accessible to CC Search users, and improve user experience.",
98-
"gid": "1164969092703369",
99-
"name": "API documentation improvements"
100-
},
10171
{
10272
"description": "Store a private copy of all the images in the CC Catalog to analyze via machine learning.",
10373
"gid": "1154270978154722",
104-
"name": "Scraping & Resizing Work [AWS Grant]"
74+
"name": "Scraping & Resizing Work for Rekognition [AWS Grant]"
10575
},
10676
{
107-
"description": "Collect and use structured data from Wikidata to enhance our search algorithm with semantic search.",
108-
"gid": "1167425798148823",
109-
"name": "Wikidata integration with Catalog & Search Algorithm"
110-
},
111-
{
112-
"description": "Build an analytics UI that is fed by Google Analytics and our internal analytics database.",
113-
"gid": "1149385618454692",
114-
"name": "Usage/Reuse Metrics Dashboard"
77+
"description": "Generate metadata via machine learning (using AWS Rekognition) on a set of ~100 million high quality images from the CC Catalog.",
78+
"gid": "1154270978154727",
79+
"name": "Run Rekognition on 100m images [AWS Grant]"
11580
},
11681
{
11782
"description": "For all possible providers, use their APIs to ingest data into the CC Catalog instead of scraping websites via Common Crawl data.",
11883
"gid": "1149385618454708",
11984
"name": "Switch from Common Crawl to API"
120-
},
85+
}
86+
]
87+
},
88+
{
89+
"name": "Q4 2020",
90+
"tasks": [
12191
{
122-
"description": "Generate metadata via machine learning (using AWS Rekognition) on a set of ~100 million high quality images from the CC Catalog.",
123-
"gid": "1154270978154727",
124-
"name": "Run Rekognition on 100m images [AWS Grant]"
92+
"description": null,
93+
"gid": "1186693612765822",
94+
"name": "Search Relevance Improvements: Language Analysis, Quality Metrics, Minimums"
12595
},
12696
{
127-
"description": "Upgrade the CC Catalog database to use a schema-less database instead of the relational database (Postgres) that we currently use.",
128-
"gid": "1167425798148817",
129-
"name": "Upgrade Catalog: Data Lake"
97+
"description": "Design updates to the CC Search UI in response to new metadata available as a result of applying machine learning to selected images in the Catalog. At a minimum, we expect new filters will be an option. Integration of design will take place subsequently.",
98+
"gid": "1154270978154729",
99+
"name": "Plan UI Updates in Response to Metadata [AWS Grant]"
130100
},
131101
{
132102
"description": "Automate the process of finding new providers of CC-licensed content to index into the CC Catalog.",
133103
"gid": "1167425798148819",
134104
"name": "Provider Review Automation"
135105
},
136106
{
137-
"description": "Implement changes to CC Search (frontend) and Catalog to make use of thumbnails, as they become available.",
138-
"gid": "1154270978154725",
139-
"name": "Implement Use of Thumbnails in Search & Catalog [AWS Grant]"
107+
"description": "Build an analytics UI that is fed by Google Analytics and our internal analytics database.",
108+
"gid": "1149385618454692",
109+
"name": "Usage/Reuse Metrics Dashboard"
140110
},
141111
{
142-
"description": "Prepare partnership guidelines for CC Search. Create a page on CC Search publishing these guidelines.",
143-
"gid": "1146971105237802",
144-
"name": "Partnership guidelines for all integration types"
112+
"description": "Once the Rekognition crawl finishes, we want to crawl the rest of the catalog (but not feed them to rekognition). This will give us useful metadata like dimensions and quality.",
113+
"gid": "1186693612765814",
114+
"name": "Scrape all images and set up feed for new ones"
145115
},
146116
{
147-
"description": "Design updates to the CC Search UI in response to new metadata available as a result of applying machine learning to selected images in the Catalog. At a minimum, we expect new filters will be an option. Integration of design will take place subsequently.",
148-
"gid": "1154270978154729",
149-
"name": "Plan UI Updates in Response to Metadata [AWS Grant]"
150-
}
151-
]
152-
},
153-
{
154-
"name": "Q4 2020",
155-
"tasks": [
117+
"description": "Create better documentation for community contributors by consolidating internal and public documentation and making it available for everyone.",
118+
"gid": "1167425798148815",
119+
"name": "Improve Documentation for Community Contributors"
120+
},
121+
{
122+
"description": "Manage Catalog deployment and provisioning entirely through infrastructure as code.",
123+
"gid": "1167425798148811",
124+
"name": "Improve Catalog Deployment and Provisioning"
125+
},
126+
{
127+
"description": "Make CC Catalog API documentation more accessible to CC Search users, and improve user experience.",
128+
"gid": "1164969092703369",
129+
"name": "API documentation improvements"
130+
},
156131
{
157132
"description": "Design and build an embed of CC Search that can be placed on any website, as a starting point to discover objects in CC Search. Components from Design Library must be used, with the goal of simplicity.",
158133
"gid": "1168725971351188",

0 commit comments

Comments
 (0)