Skip to content

Commit 85bab7a

Browse files
authored
Merge pull request creativecommons#479 from ChariniNana/master
Final GSoC blog post
2 parents f725f4c + 48a7dcd commit 85bab7a

File tree

1 file changed

+88
-0
lines changed
  • content/blog/entries/overview-of-the-gsoc-2020-project

1 file changed

+88
-0
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,88 @@
1+
title: Overview of the GSoC 2020 Project
2+
---
3+
categories:
4+
5+
cc-catalog
6+
gsoc
7+
gsoc-2020
8+
---
9+
author: charini
10+
---
11+
series: gsoc-2020-cccatalog
12+
---
13+
pub_date: 2020-08-26
14+
---
15+
body:
16+
This is my final blog post under the [GSoC 2020: CC catalog][cc_catalog_series] series, where I will highlight and
17+
summarize my contributions to Creative Commons (CC) as part of my GSoC project. The CC Catalog project collects and
18+
stores CC licensed images scattered across the internet, such that they can be made accessible to the general public via
19+
the [CC Search][cc_search] and [CC Catalog API][cc_api] tools. I got the opportunity to work on different aspects of the
20+
CC Catalog repository which ultimately enhances the user experience of the CC Search and CC Catalog API tools. My
21+
primary contributions in the duration of GSoC, and the related pull requests (PR) are as follows.
22+
23+
1. **Sub-provider retrieval**: The first task I completed as part of my GSoC project was the retrieval of sub-providers
24+
(also known as _source_) such that images could be categorised under these sources, ensuring an enhanced search
25+
experience for the users. I completed the implementation of sub-provider retrieval for three providers; Flickr,
26+
Europeana, and Smithsonian. If you are interested in learning how the retrieval logic works, please check my
27+
[initial blog post][flickr_blog_post] of this series. The PRs related to this task are as follows.
28+
- PR #[420][pr_420]: Retrieve sub-providers within Flickr
29+
- PR #[442][pr_442]: Retrieve sub-providers within Europeana
30+
- PR #[455][pr_455]: Retrieve sub-providers within Smithsonian
31+
- PR #[461][pr_461]: Add new source as a sub-provider of Flickr
32+
33+
2. **Alert updates to Smithsonian unit codes**: For the Smithsonian provider, we rely on the field known as _unit code_
34+
to determine the sub-provider (for Smithsonian it is often a museum) each image belongs to. However, it is possible for
35+
the _unit code_ values to change over time at the upstream, and if CC is unaware of these changes, it could hinder the
36+
successful categorisation of Smithsonian images under unique sub-provider values. I have therefore introduced a
37+
mechanism of alerting the CC code maintainers of potential changes to _unit code_ values at the upstream. More
38+
information is provided in my [second blog post][unit_code_blog_post] of this series. The PR related to this task
39+
is #[465][pr_465].
40+
41+
3. **Improvements to the Smithsonian provider API script**: Smithsonian is an important provider which aggregates images
42+
from 19 museums. However, due to the fact that the different museums have different data models and the resultant
43+
incompatibility of the JSON responses returned from requests to the Smithsonian API, it is difficult to know which
44+
fields to rely on to obtain the information necessary for CC. This results in CC missing out on certain important
45+
information. As part of my GSoC project, I improved the completeness of _creator_ and _description_ information, by
46+
identifying previously unknown fields from which these details could be retrieved. Even though my improvements did not
47+
result in the identification of a comprehensive list of fields, the completeness of data was considerably improved for
48+
some Smithsonian museums compared to how it was before. For more context about this issue please refer to the ticket
49+
#[397][issue_397]. Apart from improving information of Smithsonian data, I was also able to identify issues with certain
50+
Smithsonian API responses which did not contain mandatory information for some of the museums. We have informed the
51+
Smithsonian technical team of these issues and they are highlighted in ticket #[397][issue_397] as well. The PRs related
52+
to this task are as follows.
53+
- PR #[474][pr_474]: Improve the creator and description information of the Smithsonian source _National Museum of
54+
Natural History_ (NMNH). This is the largest museum (source) under the Smithsonian provider.
55+
- PR #[476][pr_476]: Improve the _creator_ and _description_ information of other sources coming under the Smithsonian
56+
provider.
57+
58+
4. **Expiration of outdated images**: The final task I completed as part of my GSoC project was implementing a strategy
59+
for expiring outdated images in the CC database. CC has a mechanism for keeping the images they have retrieved from
60+
providers up-to-date, based on how old an image is. This is called the [re-ingestion strategy][reingest_blog_post],
61+
where newer images are updated more frequently compared to older images. However, this re-ingestion strategy does not
62+
detect images which have been deleted at the upstream. Thus, it is possible that some of the images stored in the CC
63+
database are obsolete, which could result in broken links being presented via the [CC Search][cc_search] tool. As a
64+
solution, I have implemented a mechanism of identifying whether images in the CC database are obsolete by looking at the
65+
*updated_on* column value of the CC image table. Depending on the re-ingestion strategy per provider, we can know what
66+
the oldest *updated_on* value, an image can assume. If the *updated_on* value is older than the oldest valid value, we
67+
flag the corresponding image record as obsolete. The PR related to this task is #[483][pr_483].
68+
69+
I will continue to take the responsibility for maintaining my code in the CC Catalog repository, and I hope to continue
70+
contributing to the CC codebase. It has been a wonderful GSoC journey for me and special thanks goes to my supervisor
71+
Brent for his guidance.
72+
73+
74+
[cc_catalog_series]: ./#series
75+
[cc_search]: https://ccsearch.creativecommons.org/
76+
[cc_api]: https://api.creativecommons.engineering/v1/
77+
[flickr_blog_post]: ../flickr-sub-provider-retrieval/
78+
[unit_code_blog_post]: ../smithsonian-unit-code-update/
79+
[reingest_blog_post]: ../date-partitioned-data-reingestion/
80+
[pr_420]: https://github.com/creativecommons/cccatalog/pull/420
81+
[pr_442]: https://github.com/creativecommons/cccatalog/pull/442
82+
[pr_455]: https://github.com/creativecommons/cccatalog/pull/455
83+
[pr_461]: https://github.com/creativecommons/cccatalog/pull/461
84+
[pr_465]: https://github.com/creativecommons/cccatalog/pull/465
85+
[pr_474]: https://github.com/creativecommons/cccatalog/pull/474
86+
[pr_476]: https://github.com/creativecommons/cccatalog/pull/476
87+
[pr_483]: https://github.com/creativecommons/cccatalog/pull/483
88+
[issue_397]: https://github.com/creativecommons/cccatalog/issues/397

0 commit comments

Comments
 (0)