Skip to content

Commit ccab20c

Browse files
authored
Merge pull request creativecommons#477 from kss682/srinidhi_gsoc_final_blog
gsoc final blog
2 parents 68beab9 + 4d4f70f commit ccab20c

File tree

4 files changed

+101
-0
lines changed

4 files changed

+101
-0
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,101 @@
1+
title: CC Catalog: wrapping up GSoC20
2+
---
3+
categories:
4+
5+
cc-catalog
6+
gsoc
7+
gsoc-2020
8+
---
9+
author: srinidhi
10+
---
11+
series: gsoc-2020-cccatalog
12+
---
13+
pub_date: 2020-08-25
14+
---
15+
body:
16+
With the summer of code coming to an end, this blog post summarises the work done during the last three months. The project I have been working on is to add more provider API scripts to the CC Catalog. The CC Catalog project is responsible for collecting CC licensed images hosted across the web.
17+
18+
The internship journey has been great , and I was glad to get the opportunity to understand more about the working of the data pipeline. My work during the internship mainly involved researching new API providers and checking if they meet the necessary conditions, then we decided on a strategy to crawl the API. The strategy varies according to different APIs: some can be partitioned based on date, others have to be paginated . Script is written for the API according to the strategy.
19+
During the later phase of the internship, I had worked on the reingestion strategy for europeana and a script to merge Common Crawl tags and metadata to the corresponding image in the image table.
20+
21+
Provider API implemented :
22+
- Science Museum : Science Museum collection has around 60,000 images and was initially crawled through Common Crawl and shifted to API based crawl.
23+
- Issue: [Science Museum ticket][science_museum_issue]
24+
- Related PRs: [Science Museum script][science_museum_script], [Science Museum workflow][science_museum_workflow]
25+
26+
27+
- Statens Museum : Statens Museum for Kunst is Denmark’s leading museum for artwork . This is a new integration and 39115 images have been collected.
28+
- Issue: [Statens Museum ticket][statens_museum_issue]
29+
- Related PRs: [Statens Museum implementation][statens_museum_implementation]
30+
31+
32+
- Museums Victoria : It was initially ingested from Common Crawl later shifted to API based crawl. It has around 140,000 images.
33+
- Issue: [Museums Victoria ticket][museums_victoria_issue]
34+
- Related PRs: [Museums Victoria implementation][museums_victoria_implementation]
35+
36+
37+
- NYPL : New York Public Library is a new integration , as of now it has around 1296 images.
38+
- Issue: [NYPL ticket][nypl_issue]
39+
- Related PRs: [NYPL implementation][nypl_implementation]
40+
41+
42+
- Brooklyn Museum : This was an existing integration , changes were made to follow the new ```ImageStore``` and ```DelayedRequestor``` class , it has 61503 images.
43+
- Issue: [Brooklyn Museum ticket][brooklyn_museum_issue]
44+
- Related PRs: [Brooklyn Museum implementation][brooklyn_museum_implementation]
45+
46+
47+
Iconfinder is a provider of icons that could not be integrated as the current strategy of ingestion is very slow and we need a better strategy.
48+
- Issue : [Iconfinder ticket][iconfinder_issue]
49+
50+
[science_museum_issue]: https://github.com/creativecommons/cccatalog/issues/302
51+
[science_museum_script]: https://github.com/creativecommons/cccatalog/pull/400
52+
[science_museum_workflow]: https://github.com/creativecommons/cccatalog/pull/411
53+
[statens_museum_issue]: https://github.com/creativecommons/cccatalog/issues/393
54+
[statens_museum_implementation]: https://github.com/creativecommons/cccatalog/pull/428
55+
[museums_victoria_issue]: https://github.com/creativecommons/cccatalog/issues/291
56+
[museums_victoria_implementation]: https://github.com/creativecommons/cccatalog/pull/447
57+
[nypl_issue]: https://github.com/creativecommons/cccatalog/issues/147
58+
[nypl_implementation]: https://github.com/creativecommons/cccatalog/pull/462
59+
[brooklyn_museum_issue]: https://github.com/creativecommons/cccatalog/issues/348
60+
[brooklyn_museum_implementation]: https://github.com/creativecommons/cccatalog/pull/355
61+
[iconfinder_issue]:https://github.com/creativecommons/cccatalog/issues/396
62+
63+
64+
## Europeana reingestion strategy
65+
Data collected from europeana was collected on a daily basis and there was a need to refresh it. The idea is that new data should be refreshed more frequently and as the data gets old, refreshing should become less frequent. While developing the strategy the API key limit and maximum collection expected is to be kept in mind. Considering these factors, a workflow was set up such that each day it crawls 59 days of data.
66+
The 59 days were split up into layers. The DAG crawls daily up to 1 week old data then it crawls monthly for data more than 1 week old and less than a year old data, anything older than a year is crawled every 3 months.
67+
- Issue: [Europeana reingestion ticket][europeana_reingestion_issue]
68+
- Related PR: [Europeana reingestion strategy][europeana_reingestion_strategy]
69+
70+
More details regarding the math of reingestion: [Data reingestion][data_reingestion_blog]
71+
72+
<div style="text-align:center;">
73+
<img src="dag_image_1.png" width="1000px"/>
74+
<img src="dag_image_2.png" width="1000px"/>
75+
<img src="dag_image_3.png" width="1000px"/>
76+
<p>Europeana reingestion workflow</p>
77+
</div>
78+
79+
[europeana_reingestion_issue]: https://github.com/creativecommons/cccatalog/issues/412
80+
[europeana_reingestion_strategy]: https://github.com/creativecommons/cccatalog/pull/473
81+
[data_reingestion_blog]: https://opensource.creativecommons.org/blog/entries/date-partitioned-data-reingestion/
82+
83+
84+
## Merging Common Crawl tags
85+
When a provider is shifted from Common Crawl to API based crawl, the new data from API doesn’t have tags and metadata that were generated using clarifai and hence there is need to associate the new data with the tags corresponding to that image from the Common Crawl data. A direct url match is not possible as the Common Crawl urls and API image url are different, so we try to match it on the number or identifier that is associated with the url.
86+
87+
Currently the merging logic is applied to Science Museum, Museums Victoria and Met Museum .
88+
89+
In Science Museum, API url in image table is like https://coimages.sciencemuseumgroup.org.uk/images/240/862/large_BAB_S_1_02_0017.jpg and CC url is like https://s3-eu-west-1.amazonaws.com/smgco-images/images/369/541/medium_SMG00096855.jpg . So the idea is to reduce the url to the last identifier like number , so after the modification of the url by modify_urls function it looks like ```gpj.1700_20_1_S_BAB_``` (API url) and ```gpj.55869000GMS_``` (CC url) .
90+
Similar logic has been applied to met museum and museum victoria.
91+
- Issue: https://github.com/creativecommons/cccatalog/issues/468
92+
- Related PR: https://github.com/creativecommons/cccatalog/pull/478
93+
94+
95+
## Acknowledgement
96+
I would like to thank my mentors Brent and Anna for their guidance throughout the internship.
97+
98+
99+
100+
101+
Loading
Loading
Loading

0 commit comments

Comments
 (0)