tushar912
diff --git a/‎content/blog/authors/aldenpage/contents.lr
+2 b/‎content/blog/authors/aldenpage/contents.lr
+2
diff --git a/‎content/blog/authors/annatuma/contents.lr
+8 b/‎content/blog/authors/annatuma/contents.lr
+8
diff --git a/‎content/blog/categories/collaboration/contents.lr
+1 b/‎content/blog/categories/collaboration/contents.lr
+1
diff --git a/‎content/blog/categories/gsod-2020/contents.lr
+1 b/‎content/blog/categories/gsod-2020/contents.lr
+1
diff --git a/‎content/blog/categories/gsod/contents.lr
+1 b/‎content/blog/categories/gsod/contents.lr
+1
diff --git a/‎content/blog/categories/outreachy-2019-20/contents.lr
+1 b/‎content/blog/categories/outreachy-2019-20/contents.lr
+1
diff --git a/‎content/blog/categories/outreachy-2020/contents.lr
+1 b/‎content/blog/categories/outreachy-2020/contents.lr
+1
diff --git a/‎content/blog/entries/2020-03-05-involucrate-gsoc-outreachy-es/contents.lr
+2 b/‎content/blog/entries/2020-03-05-involucrate-gsoc-outreachy-es/contents.lr
+2
diff --git a/‎content/blog/entries/2020-03-05-participe-gsoc/contents.lr
+2 b/‎content/blog/entries/2020-03-05-participe-gsoc/contents.lr
+2
diff --git a/‎content/blog/entries/2020-08-x5gon-cc-catalog-api/contents.lr
+25 b/‎content/blog/entries/2020-08-x5gon-cc-catalog-api/contents.lr
+25
diff --git a/‎content/blog/entries/cc-catalog-wrapping-gsoc20/contents.lr
+101 b/‎content/blog/entries/cc-catalog-wrapping-gsoc20/contents.lr
+101
diff --git a/‎content/blog/entries/cc-catalog-wrapping-gsoc20/dag_image_1.png
164 KB b/‎content/blog/entries/cc-catalog-wrapping-gsoc20/dag_image_1.png
164 KB
diff --git a/‎content/blog/entries/cc-catalog-wrapping-gsoc20/dag_image_2.png
237 KB b/‎content/blog/entries/cc-catalog-wrapping-gsoc20/dag_image_2.png
237 KB
diff --git a/‎content/blog/entries/cc-catalog-wrapping-gsoc20/dag_image_3.png
149 KB b/‎content/blog/entries/cc-catalog-wrapping-gsoc20/dag_image_3.png
149 KB
diff --git a/‎content/blog/entries/cc-platform-toolkit-revamp-2/contents.lr
+1 b/‎content/blog/entries/cc-platform-toolkit-revamp-2/contents.lr
+1
diff --git a/‎content/blog/entries/cc-platform-toolkit-revamp-3/contents.lr
+1 b/‎content/blog/entries/cc-platform-toolkit-revamp-3/contents.lr
+1
diff --git a/‎content/blog/entries/cc-platform-toolkit-revamp-4/contents.lr
+1 b/‎content/blog/entries/cc-platform-toolkit-revamp-4/contents.lr
+1
diff --git a/‎content/blog/entries/cc-platform-toolkit-revamp/contents.lr
+1 b/‎content/blog/entries/cc-platform-toolkit-revamp/contents.lr
+1
diff --git a/‎content/blog/entries/cc-vocabulary-my-first-four-weeks/contents.lr
+1 b/‎content/blog/entries/cc-vocabulary-my-first-four-weeks/contents.lr
+1
diff --git a/‎content/blog/entries/cc-vocabulary-week5-8/contents.lr
+1 b/‎content/blog/entries/cc-vocabulary-week5-8/contents.lr
+1
diff --git a/‎content/blog/entries/cc-vocabulary-week9-13/contents.lr
+1 b/‎content/blog/entries/cc-vocabulary-week9-13/contents.lr
+1
diff --git a/‎content/blog/entries/crawling-500-million/contents.lr
+1-1 b/‎content/blog/entries/crawling-500-million/contents.lr
+1-1
diff --git a/‎content/blog/entries/improving-cc-license-chooser-coding/contents.lr
+1 b/‎content/blog/entries/improving-cc-license-chooser-coding/contents.lr
+1
diff --git a/‎content/blog/entries/improving-cc-license-chooser-outcomes/contents.lr
+1 b/‎content/blog/entries/improving-cc-license-chooser-outcomes/contents.lr
+1
diff --git a/‎content/blog/entries/improving-cc-license-chooser-weeks-1-2-design/contents.lr
+1 b/‎content/blog/entries/improving-cc-license-chooser-weeks-1-2-design/contents.lr
+1
diff --git a/‎content/blog/entries/integration-vocabulary-ccos/contents.lr
+1 b/‎content/blog/entries/integration-vocabulary-ccos/contents.lr
+1
diff --git a/‎content/blog/entries/legal-database-a-new-beginning/contents.lr
+1 b/‎content/blog/entries/legal-database-a-new-beginning/contents.lr
+1
diff --git a/‎content/blog/entries/legal-database-coding-mid-term/contents.lr
+1 b/‎content/blog/entries/legal-database-coding-mid-term/contents.lr
+1
diff --git a/‎content/blog/entries/legal-database-design/contents.lr
+1 b/‎content/blog/entries/legal-database-design/contents.lr
+1
diff --git a/‎content/blog/entries/legal-database-features/contents.lr
+1 b/‎content/blog/entries/legal-database-features/contents.lr
+1
diff --git a/‎content/blog/entries/linked-commons-autocomplete-feature/contents.lr
+1-3 b/‎content/blog/entries/linked-commons-autocomplete-feature/contents.lr
+1-3
@@ -2,5 +2,7 @@ username: aldenpage
 ---
 name: Alden Page
 ---
+md5_hashed_email: 32853a2ab283e0093bf088d8af5d3cdc
+---
 about:
 [Alden](https://creativecommons.org/author/aldencreativecommons-org/) is Backend Software Engineer at Creative Commons. He is `@aldenpage` on the CC Slack.
@@ -0,0 +1,8 @@
+username: annatuma
+---
+name: Anna Tumadóttir
+---
+md5_hashed_email: c6b98be141f57ce877a7ae10595b0ae0
+---
+about:
+[Anna](https://creativecommons.org/author/annacreativecommons-org/) is Director of Product at Creative Commons. She's `@Anna` on the CC Slack.
@@ -0,0 +1 @@
+name: collaboration
@@ -0,0 +1 @@
+name: gsod-2020
@@ -0,0 +1 @@
+name: gsod
@@ -0,0 +1 @@
+name: outreachy-2019-20
@@ -0,0 +1 @@
+name: outreachy-2020
@@ -2,7 +2,9 @@ title: Involúcrate con nuestra comunidad de código abierto a través del  Goog
 ---
 categories:
 gsoc
+gsoc-2020
 outreachy
+outreachy-2020
 ---
 author: hugosolar
 ---
 
@@ -2,7 +2,9 @@ title: Participe do Google Summer of Code / Outreachy
 ---
 categories:
 gsoc
+gsoc-2020
 outreachy
+outreachy-2020
 ---
 author: brenoferreira
 ---
 
@@ -0,0 +1,25 @@
+title: X5GON Using CC Catalog API for Image Results
+---
+categories:
+community
+cc-catalog
+cc-search
+announcements
+collaboration
+---
+author: annatuma
+---
+pub_date: 2020-08-24
+---
+body:
+A few months ago, the Open Education team at Creative Commons made an introduction between the folks working on X5GON and CC Search.
+
+Throughout a few conversations, we quickly discovered that there are many parallels to how we're approaching our work, and some important differences that would allow each of us to benefit from cooperation.
+
+[X5GON](https://www.x5gon.org/) is building an AI-driven platform, focused on delivery of open education resources (OER). At its core, it is building a catalog of OER, upon which other [services](https://www.x5gon.org/platforms/services/) are based, such as analytics for personalized recommendations, and a discovery engine. By aggregating relevant content, curating it with the use of artificial intelligence and machine learning, and personalizing the experience to each learner, they're making OER more accessible and relevant.
+
+CC Search is not yet ready to ingest content types beyond images, but when we are able to do so, we plan to integrate via API with X5GON in order to serve OER that is made available in formats we will support in the future, starting with audio.
+
+The [X5GON Discovery search engine](https://discovery.x5gon.org/) allows users to find OER in video, audio, and text formats - and now, with the integration of results powered by the CC Catalog API, which also powers CC Search, users can also find openly licensed images for relevant educational queries. This is a great resource for educators and learners from all over the world.
+
+Try it for yourself, or look at these results for making [geometry](https://discovery.x5gon.org/search?q=geometry&type=Image) visual and fun!
@@ -0,0 +1,101 @@
+title: CC Catalog: wrapping up GSoC20
+---
+categories: 
+
+cc-catalog
+gsoc
+gsoc-2020
+---
+author: srinidhi
+---
+series: gsoc-2020-cccatalog
+---
+pub_date: 2020-08-25
+--- 
+body:
+With the summer of code coming to an end, this blog post summarises the work done during the last three months. The project I have been working on is to add more provider API scripts to the CC Catalog. The CC Catalog project is responsible for collecting CC licensed images hosted across the web.
+
+The internship journey has been great , and I was glad to get the opportunity to understand more about the working of the data pipeline. My work during the internship mainly involved researching new API providers and checking if they meet the necessary conditions, then we decided on a strategy to crawl the API. The strategy varies according to different APIs:  some can be partitioned based on date, others have to be paginated . Script is written for the API according to the strategy. 
+During the later phase of the internship, I had worked on the reingestion strategy for europeana and a script to merge Common Crawl tags and metadata to the corresponding image in the image table.  
+
+Provider API implemented : 
+- Science Museum :  Science Museum collection has around 60,000 images and was initially crawled through Common Crawl and shifted to API based crawl. 
+    - Issue: [Science Museum ticket][science_museum_issue]
+    - Related PRs: [Science Museum script][science_museum_script], [Science Museum workflow][science_museum_workflow]
+
+
+- Statens Museum : Statens Museum for Kunst is Denmark’s leading museum for artwork . This is a new integration and 39115 images have been collected.
+    - Issue: [Statens Museum ticket][statens_museum_issue]
+    - Related PRs: [Statens Museum implementation][statens_museum_implementation]
+
+
+- Museums Victoria : It was initially ingested from Common Crawl later shifted to API based crawl. It has around 140,000 images.
+    - Issue: [Museums Victoria ticket][museums_victoria_issue]
+    - Related PRs: [Museums Victoria implementation][museums_victoria_implementation]
+
+
+- NYPL : New York Public Library is a new integration , as of now it has around 1296 images.
+    - Issue: [NYPL ticket][nypl_issue]
+    - Related PRs: [NYPL implementation][nypl_implementation]
+
+
+- Brooklyn Museum : This was an existing integration , changes were made to follow the new ```ImageStore``` and ```DelayedRequestor``` class , it has 61503 images.
+    - Issue: [Brooklyn Museum ticket][brooklyn_museum_issue]
+    - Related PRs: [Brooklyn Museum implementation][brooklyn_museum_implementation]
+
+
+Iconfinder is a provider of icons that could not be integrated as the current strategy of ingestion is very slow and we need a better strategy.
+- Issue : [Iconfinder ticket][iconfinder_issue]
+
+[science_museum_issue]: https://github.com/creativecommons/cccatalog/issues/302
+[science_museum_script]: https://github.com/creativecommons/cccatalog/pull/400
+[science_museum_workflow]: https://github.com/creativecommons/cccatalog/pull/411
+[statens_museum_issue]: https://github.com/creativecommons/cccatalog/issues/393
+[statens_museum_implementation]: https://github.com/creativecommons/cccatalog/pull/428
+[museums_victoria_issue]: https://github.com/creativecommons/cccatalog/issues/291
+[museums_victoria_implementation]: https://github.com/creativecommons/cccatalog/pull/447
+[nypl_issue]: https://github.com/creativecommons/cccatalog/issues/147
+[nypl_implementation]: https://github.com/creativecommons/cccatalog/pull/462
+[brooklyn_museum_issue]: https://github.com/creativecommons/cccatalog/issues/348
+[brooklyn_museum_implementation]: https://github.com/creativecommons/cccatalog/pull/355
+[iconfinder_issue]:https://github.com/creativecommons/cccatalog/issues/396
+
+
+## Europeana reingestion strategy
+Data collected from europeana was collected on a daily basis and there was a need to refresh it. The idea is that new data should be refreshed more frequently and as the data gets old, refreshing should become less frequent. While developing the strategy the API key limit and maximum collection expected is to be kept in mind. Considering these factors, a workflow was set up such that each day it crawls 59 days of data. 
+The 59 days were split up into layers. The DAG crawls daily up to 1 week old data then it crawls monthly for data more than 1 week old and less than a year old data, anything older than a year is crawled every 3 months. 
+- Issue: [Europeana reingestion ticket][europeana_reingestion_issue]
+- Related PR: [Europeana reingestion strategy][europeana_reingestion_strategy]
+
+More details regarding the math of reingestion: [Data reingestion][data_reingestion_blog]
+
+<div style="text-align:center;">
+    <img src="dag_image_1.png" width="1000px"/>
+    <img src="dag_image_2.png" width="1000px"/>
+    <img src="dag_image_3.png" width="1000px"/>
+    <p>Europeana reingestion workflow</p>
+</div>
+
+[europeana_reingestion_issue]: https://github.com/creativecommons/cccatalog/issues/412
+[europeana_reingestion_strategy]: https://github.com/creativecommons/cccatalog/pull/473
+[data_reingestion_blog]: https://opensource.creativecommons.org/blog/entries/date-partitioned-data-reingestion/
+
+
+## Merging Common Crawl tags
+When a provider is shifted from Common Crawl to API based crawl, the new data from API doesn’t have tags and metadata that were generated using clarifai and hence there is need to associate the new data with the tags corresponding to that image from the Common Crawl data. A direct url match is not possible as the Common Crawl urls and API image url are different, so we try to match it on the number or identifier that is associated with the url.
+
+Currently the merging logic is applied to Science Museum, Museums Victoria and Met Museum .  
+
+In Science Museum, API url in image table is like https://coimages.sciencemuseumgroup.org.uk/images/240/862/large_BAB_S_1_02_0017.jpg and CC url is like https://s3-eu-west-1.amazonaws.com/smgco-images/images/369/541/medium_SMG00096855.jpg . So the idea is to reduce the url to the last identifier like number , so after the modification of the url by modify_urls function it looks like ```gpj.1700_20_1_S_BAB_``` (API url) and ```gpj.55869000GMS_``` (CC url) .
+Similar logic has been applied to met museum and museum victoria.
+- Issue: https://github.com/creativecommons/cccatalog/issues/468
+- Related PR: https://github.com/creativecommons/cccatalog/pull/478
+
+
+## Acknowledgement
+I would like to thank my mentors Brent and Anna for their guidance throughout the internship. 
+
+
+
+
+
@@ -4,6 +4,7 @@ categories:
 community
 platform-toolkit
 outreachy
+outreachy-2019-20
 ---
 author: apdsrocha
 ---
 
@@ -4,6 +4,7 @@ categories:
 community
 platform-toolkit
 outreachy
+outreachy-2019-20
 ---
 author: apdsrocha
 ---
 
@@ -4,6 +4,7 @@ categories:
 community
 platform-toolkit
 outreachy
+outreachy-2019-20
 ---
 author: apdsrocha
 ---
 
@@ -4,6 +4,7 @@ categories:
 community
 platform-toolkit
 outreachy
+outreachy-2019-20
 ---
 author: apdsrocha
 ---
 
@@ -5,6 +5,7 @@ categories:
 cc-vocabulary
 product
 outreachy
+outreachy-2019-20
 ---
 author: conye
 ---
 
@@ -5,6 +5,7 @@ categories:
 cc-vocabulary
 product
 outreachy
+outreachy-2019-20
 ---
 author: conye
 ---
 
@@ -5,6 +5,7 @@ categories:
 cc-vocabulary
 product
 outreachy
+outreachy-2019-20
 ---
 author: conye
 ---
 
@@ -45,7 +45,7 @@ We know we're not going to be able to crawl 500 million images with one virtual
 
 The worker processes do the actual analysis of the images, which essentially entails downloading the image, extracting interesting properties, and sticking the resulting metadata back into a Kafka topic for later downstream processing. The worker will also have to include some instrumentation for conforming to rate limits and error reporting.
 
-We also know that we will need to share some information about crawl progress between worker processes, such as whether we've exceeded our proscribed rate limit for a website, the number of times we've seen a status code in the last minute, how many images we've processed so far, and so on. Since we're only interested in sharing application state and aggregate statistics, a lightweight key/value store like Redis seems like a good fit.
+We also know that we will need to share some information about crawl progress between worker processes, such as whether we've exceeded our prescribed rate limit for a website, the number of times we've seen a status code in the last minute, how many images we've processed so far, and so on. Since we're only interested in sharing application state and aggregate statistics, a lightweight key/value store like Redis seems like a good fit.
 
 Finally, we need a supervising process that centrally controls the crawl. This key governing process will be responsible for making sure our crawler workers are behaving properly by moderating crawl rates for each source, taking action in the face of errors, and reporting statistics to the operators of the crawler. We'll call this process the crawl monitor.
 
 
@@ -5,6 +5,7 @@ author: obulat
 categories:
 
 outreachy
+outreachy-2019-20
 cc-chooser
 ---
 series: outreachy-dec-2019-chooser
 
@@ -5,6 +5,7 @@ author: obulat
 categories:
 
 outreachy
+outreachy-2019-20
 cc-chooser
 ---
 series: outreachy-dec-2019-chooser
 
@@ -5,6 +5,7 @@ author: obulat
 categories:
 
 outreachy
+outreachy-2019-20
 cc-chooser
 ---
 series: outreachy-dec-2019-chooser
 
@@ -2,6 +2,7 @@ title: Integration of Vocabulary with CCOS.
 ---
 categories:
 outreachy
+outreachy-2020
 tech
 open-source
 ---
 
@@ -4,6 +4,7 @@ categories:
 cc-legal-database
 product
 outreachy
+outreachy-2020
 ---
 author: krysal
 ---
 
@@ -4,6 +4,7 @@ categories:
 cc-legal-database
 product
 outreachy
+outreachy-2020
 ---
 author: krysal
 ---
 
@@ -4,6 +4,7 @@ categories:
 cc-legal-database
 product
 outreachy
+outreachy-2020
 ---
 author: krysal
 ---
 
@@ -4,6 +4,7 @@ categories:
 cc-legal-database
 product
 outreachy
+outreachy-2020
 ---
 author: krysal
 ---
 
@@ -32,7 +32,7 @@ The following blog intends to explain the very recent feature integrated to the
 ## Motivation
 One of the newest features integrated last month into Linked Commons is Filtering by node name. Here a user can search for his/her favourite node and explore all its neighbours. Since the list is very big, it was self-evident for us to have a text box (and not a drop-down) where the user is supposed to type the node name.
 
-Some of the reasons why to have a text box or filtering by node option.
+Some of the reasons why to have "autocomplete feature" in the filtering by node name -
 - Some of the node names are very uncommon and lengthy. There is a high probability of misspelling it.
 - Submitting the form and getting a response of “Node doesn’t exist” isn’t a very good user flow, and we want to minimise such incidents.
 
@@ -73,8 +73,6 @@ Here are some aggregated result statistics.
 | Max Requests/s            |** 214       **|
 | Failures/s              	|** 0         **|
 
-Since SQLlite has a serverless design, disk io usually has a significant impact on the performance. The above results were executed on a server with HDD storage. Linked Commons server is equipped with faster disk io. It will certainly improve the performance but will be countered by the network latency and other factors like the number of nodes in the dB. So the above results to some degree resemble the actual performance.
-
 
 ## Next steps
 In the next blog, we will be covering the long awaited data update and the new architecture.