|
| 1 | +title: Smithsonian Unit Code Update |
| 2 | +--- |
| 3 | +categories: |
| 4 | + |
| 5 | +cc-catalog |
| 6 | +gsoc |
| 7 | +gsoc-2020 |
| 8 | +--- |
| 9 | +author: charini |
| 10 | +--- |
| 11 | +series: gsoc-2020-cccatalog |
| 12 | +--- |
| 13 | +pub_date: 2020-08-03 |
| 14 | +--- |
| 15 | +body: |
| 16 | +## Introduction |
| 17 | +The Creative Commons (CC) Catalog project collects and stores CC licensed images scattered across the internet, such |
| 18 | +that they can be made accessible to the general public via the [CC Search][cc_search] and [CC Catalog API][cc_api] |
| 19 | +tools. Numerous information associated with each image, which help in the image search and categorisation process are |
| 20 | +stored via CC Catalog in the CC database. |
| 21 | + |
| 22 | +In my [previous blog post][flickr_blog_post] of this series entitled 'Flickr Sub-provider Retrieval', I discussed how |
| 23 | +the images from a certain provider (such as Flickr) can be categorised based on the sub-provider values (which reflects |
| 24 | +the underlying organisation or entity that published the images through the provider). We have similarly implemented |
| 25 | +the sub-provider retrieval logic for Europeana and Smithsonian providers. Unlike in Flickr and Europeana, every single |
| 26 | +image from Smithsonian is categorised under some sub-provider value where the sub-providers are identified based on a |
| 27 | +unit code value as contained in the API response (for more information please refer to the pull request [#455][pr_455]). |
| 28 | +The unit code values and the corresponding sub provider values are maintained in the dictionary |
| 29 | +*SMITHSONIAN_SUB_PROVIDERS*. However, there is the possibility of the *unit code* values being updated at the |
| 30 | +Smithsonian API level, and it is important that we have a mechanism of reflecting those updates in the |
| 31 | +*SMITHSONIAN_SUB_PROVIDERS* dictionary as well. In this blog post, we discuss how we learn the potential |
| 32 | +changes to the *unit code* values and keep the *SMITHSONIAN_SUB_PROVIDERS* dictionary up-to-date. |
| 33 | + |
| 34 | +[cc_search]: https://ccsearch.creativecommons.org/ |
| 35 | +[cc_api]: https://api.creativecommons.engineering/v1/ |
| 36 | +[flickr_blog_post]: ../flickr-sub-provider-retrieval/ |
| 37 | +[pr_455]: https://github.com/creativecommons/cccatalog/pull/455 |
| 38 | + |
| 39 | +## Implementation |
| 40 | +### Retrieving the latest unit codes |
| 41 | +We are required to obtain the latest *unit codes* supported by the Smithsonian API to achieve this task. Furthermore, |
| 42 | +since we are only interested in image data, the *unit codes* which are associated with images alone need to be |
| 43 | +retrieved. The latest Smithsonian *unit codes* corresponding to images can be retrieved by calling the end point |
| 44 | +https://api.si.edu/openaccess/api/v1.0/terms/unit_code?q=online_media_type:Images&api_key=REDACTED |
| 45 | + |
| 46 | +### Check for unit code updates |
| 47 | +In order to identify whether changes have occurred to the collection of *unit codes* supported by the Smithsonian API |
| 48 | +(in the form of additions and/or deletions), we compare the values retrieved by calling the previously mentioned |
| 49 | +endpoint, with the values contained in the *SMITHSONIAN_SUB_PROVIDERS* dictionary. All changes are reflected in a table |
| 50 | +named *smithsonian_new_unit_codes* which contains the two fields, 'new_unit_code' and 'action'. If a new *unit code* is |
| 51 | +introduced at the API level, we store that *unit code* value with the corresponding action value 'add' in the table. |
| 52 | +This reflects that the given *unit code* value needs to be added to the *SMITHSONIAN_SUB_PROVIDERS dictionary*. If a |
| 53 | +*unit code* that appears in the *SMITHSONIAN_SUB_PROVIDERS* dictionary does not appear at the API level, we store |
| 54 | +the *unit code* value with the corresponding action value 'delete' in the table, reflecting that it needs to be deleted |
| 55 | +from the dictionary. |
| 56 | + |
| 57 | +### Triggering the unit code update workflow |
| 58 | +A separate workflow named *check_new_smithsonian_unit_codes_workflow* allows executing the logic we discussed via the |
| 59 | +Airflow UI. For each execution, the table *smithsonian_new_unit_codes* is completely cleared of previous data, and the |
| 60 | +latest updates to reflect in the *SMITHSONIAN_SUB_PROVIDERS* dictionary are stored. Note that the actual updates to |
| 61 | +the dictionary (as reflected in the table) needs to be carried out by a person, since editing the dictionary is not |
| 62 | +automated. Furthermore, this workflow is expected to be executed at-least once a week, preferably prior to running |
| 63 | +the Smithsonian image retrieval script such that the Smithsonian sub-provider retrieval task can be run with no issue. |
| 64 | + |
| 65 | +## Acknowledgement |
| 66 | +I express my gratitude to my GSoC supervisor Brent Moran for assisting me with this task. |
0 commit comments