Skip to content

Commit 30670c9

Browse files
authored
Merge branch 'master' into project_team_updates
2 parents 81fd800 + 4514982 commit 30670c9

File tree

2 files changed

+74
-0
lines changed

2 files changed

+74
-0
lines changed
Lines changed: 66 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,66 @@
1+
title: Smithsonian Unit Code Update
2+
---
3+
categories:
4+
5+
cc-catalog
6+
gsoc
7+
gsoc-2020
8+
---
9+
author: charini
10+
---
11+
series: gsoc-2020-cccatalog
12+
---
13+
pub_date: 2020-08-03
14+
---
15+
body:
16+
## Introduction
17+
The Creative Commons (CC) Catalog project collects and stores CC licensed images scattered across the internet, such
18+
that they can be made accessible to the general public via the [CC Search][cc_search] and [CC Catalog API][cc_api]
19+
tools. Numerous information associated with each image, which help in the image search and categorisation process are
20+
stored via CC Catalog in the CC database.
21+
22+
In my [previous blog post][flickr_blog_post] of this series entitled 'Flickr Sub-provider Retrieval', I discussed how
23+
the images from a certain provider (such as Flickr) can be categorised based on the sub-provider values (which reflects
24+
the underlying organisation or entity that published the images through the provider). We have similarly implemented
25+
the sub-provider retrieval logic for Europeana and Smithsonian providers. Unlike in Flickr and Europeana, every single
26+
image from Smithsonian is categorised under some sub-provider value where the sub-providers are identified based on a
27+
unit code value as contained in the API response (for more information please refer to the pull request [#455][pr_455]).
28+
The unit code values and the corresponding sub provider values are maintained in the dictionary
29+
*SMITHSONIAN_SUB_PROVIDERS*. However, there is the possibility of the *unit code* values being updated at the
30+
Smithsonian API level, and it is important that we have a mechanism of reflecting those updates in the
31+
*SMITHSONIAN_SUB_PROVIDERS* dictionary as well. In this blog post, we discuss how we learn the potential
32+
changes to the *unit code* values and keep the *SMITHSONIAN_SUB_PROVIDERS* dictionary up-to-date.
33+
34+
[cc_search]: https://ccsearch.creativecommons.org/
35+
[cc_api]: https://api.creativecommons.engineering/v1/
36+
[flickr_blog_post]: ../flickr-sub-provider-retrieval/
37+
[pr_455]: https://github.com/creativecommons/cccatalog/pull/455
38+
39+
## Implementation
40+
### Retrieving the latest unit codes
41+
We are required to obtain the latest *unit codes* supported by the Smithsonian API to achieve this task. Furthermore,
42+
since we are only interested in image data, the *unit codes* which are associated with images alone need to be
43+
retrieved. The latest Smithsonian *unit codes* corresponding to images can be retrieved by calling the end point
44+
https://api.si.edu/openaccess/api/v1.0/terms/unit_code?q=online_media_type:Images&api_key=REDACTED
45+
46+
### Check for unit code updates
47+
In order to identify whether changes have occurred to the collection of *unit codes* supported by the Smithsonian API
48+
(in the form of additions and/or deletions), we compare the values retrieved by calling the previously mentioned
49+
endpoint, with the values contained in the *SMITHSONIAN_SUB_PROVIDERS* dictionary. All changes are reflected in a table
50+
named *smithsonian_new_unit_codes* which contains the two fields, 'new_unit_code' and 'action'. If a new *unit code* is
51+
introduced at the API level, we store that *unit code* value with the corresponding action value 'add' in the table.
52+
This reflects that the given *unit code* value needs to be added to the *SMITHSONIAN_SUB_PROVIDERS dictionary*. If a
53+
*unit code* that appears in the *SMITHSONIAN_SUB_PROVIDERS* dictionary does not appear at the API level, we store
54+
the *unit code* value with the corresponding action value 'delete' in the table, reflecting that it needs to be deleted
55+
from the dictionary.
56+
57+
### Triggering the unit code update workflow
58+
A separate workflow named *check_new_smithsonian_unit_codes_workflow* allows executing the logic we discussed via the
59+
Airflow UI. For each execution, the table *smithsonian_new_unit_codes* is completely cleared of previous data, and the
60+
latest updates to reflect in the *SMITHSONIAN_SUB_PROVIDERS* dictionary are stored. Note that the actual updates to
61+
the dictionary (as reflected in the table) needs to be carried out by a person, since editing the dictionary is not
62+
automated. Furthermore, this workflow is expected to be executed at-least once a week, preferably prior to running
63+
the Smithsonian image retrieval script such that the Smithsonian sub-provider retrieval task can be run with no issue.
64+
65+
## Acknowledgement
66+
I express my gratitude to my GSoC supervisor Brent Moran for assisting me with this task.

databags/community_team_members.json

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -45,6 +45,10 @@
4545
{
4646
"name": "Onyenanu Princewill",
4747
"role": "Project Contributor"
48+
},
49+
{
50+
"name": "Dhruv Bhanushali",
51+
"role": "Project Collaborator"
4852
}
4953
],
5054
"name": "CC Catalog API",
@@ -67,6 +71,10 @@
6771
{
6872
"name": "Abhishek Naidu",
6973
"role": "Project Collaborator"
74+
},
75+
{
76+
"name": "Dhruv Bhanushali",
77+
"role": "Project Collaborator"
7078
}
7179
],
7280
"name": "CC Search",

0 commit comments

Comments
 (0)