Skip to content

Commit c421801

Browse files
authored
Merge branch 'master' into table_to_markdown_table_change
2 parents 39f6550 + 0eb1a83 commit c421801

File tree

11 files changed

+318
-7
lines changed

11 files changed

+318
-7
lines changed

.github/CODEOWNERS

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
# These owners will be the default owners for everything in
22
# the repo. Unless a later match takes precedence, they will
33
# be requested for review when someone opens a pull request.
4-
* @creativecommons/internal-tech @creativecommons/ct-cc-open-source-core-committers
4+
* @creativecommons/engineering @creativecommons/ct-cc-open-source-core-committers
55

66
# These users own any files in the specified directory and
77
# any of its subdirectories.

content/blog/authors/brenoferreira/contents.lr

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,4 +5,4 @@ name: Breno Ferreira
55
md5_hashed_email: c23cf5bf54e6df322a3b17d176b76320
66
---
77
about:
8-
[Breno](https://creativecommons.org/author/brenoferreira/) is Front End Engineer at Creative Commons. He's `@brenoferreira` on the CC Slack.
8+
[Breno](https://creativecommons.org/author/brenoferreira/) was previously Front End Engineer at Creative Commons.
Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
username: ChariniNana
2+
---
3+
name: Charini Nanayakkara
4+
---
5+
md5_hashed_email: e9440a4f3196442dd46ba8eca21041c8
6+
---
7+
about:
8+
Charini is a PhD student from the Australian National University, who is working on
9+
the [cccatalog](https://github.com/creativecommons/cccatalog) repository as a part of GSoC 2020.
10+
She is `@Charini Nanayakkara` on slack.
Lines changed: 225 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,225 @@
1+
title: Flickr Sub-provider Retrieval
2+
---
3+
categories:
4+
5+
cc-catalog
6+
gsoc
7+
gsoc-2020
8+
---
9+
author: charini
10+
---
11+
series: gsoc-2020-cccatalog
12+
---
13+
pub_date: 2020-06-24
14+
---
15+
body:
16+
## Introduction
17+
The Creative Commons (CC) licensed images made available via CC Search and CC Catalog API tools are retrieved from
18+
numerous sources (which we refer to as providers) such as Flickr and different museum collections. While the existing
19+
implementation of the CC Catalog tools enables filtering images in various manners such as based on image tags, the
20+
provider, and the license type, it does not facilitate searching for images from particularly valuable internal sources
21+
(referred to as sub-providers). For example, images related to 'NASA' have significant value in the Flickr collection,
22+
since 'NASA' related pictures are extensively used by a large audience especially for educational purposes. The aim of
23+
my first task in the GSoC project is to implement required changes in the API script level and in the existing data in
24+
the database, such that filtering by certain important sub-providers is made possible.
25+
26+
While there are several providers such as Flickr, Europeana, and Smithsonian from which we require to extract
27+
sub-providers, the consensus was to initially focus on Flickr due to that currently being in production, and since a
28+
substantial amount of images made available via CC Search come from Flickr. Thus, in this initial blog post, I will
29+
discuss how I addressed the requirement of sub-provider retrieval in Flickr by making the necessary changes in the
30+
[Creative Commons Catalog](https://github.com/creativecommons/cccatalog) repository.
31+
32+
## Research
33+
The primary research involved in the Flickr sub-provider retrieval task was defining which entities to identify as
34+
sub-providers, and identifying how those sub-providers can be retrieved based on the image related information we
35+
retain.
36+
37+
### Definition of a sub-provider
38+
It was decided that a sub-provider should be a collection of user accounts in Flickr, where this collection corresponded
39+
to a common entity, and the common entity would reflect the sub-provider. For example, since both Flickr user accounts
40+
*NASA HQ PHOTO* and *NASA Johnson* provide images related to NASA, we would represent the NASA sub-provider by those
41+
two (and other related) user accounts.
42+
43+
The next challenge was to determine how to identify which collections of user accounts were important to a wider
44+
audience. The number of views per user account was an intuitive measure to rely on for this requirement. My supervisor
45+
Brent Moran executed a query on the existing CC database to obtain the 50 most popular user accounts in Flickr. A
46+
snippet of the query response is as follows:
47+
48+
user_account_name | total_views
49+
--- | ---
50+
Apollo Image Gallery | 1216297208
51+
BioDivLibrary | 625528813
52+
manhhai | 445714729
53+
Thomas Hawk | 300554527
54+
Sangudo | 258177509
55+
NASA Goddard Photo and Video | 225143949
56+
57+
Despite having a significant number of views, some of these user accounts did not appear to be worth being identified
58+
as belonging to a sub-provider due to their lack of educational importance. Thus, we manually curated this list to
59+
retain what we believed to be important to a wider audience.
60+
61+
### Sub-provider identification
62+
In order to identify the sub-provider each image from Flickr belonged to, it was necessary to determine which field in
63+
the stored image data referred to the user account. Among the various information contained in an API response, only a
64+
selected set of fields is stored on the CC end, and it was important to use such stored data for the identification of
65+
sub-providers. We initially decided to rely on the user account name which was reflected by the *ownername* field in
66+
the JSON response and stored as the *creator* in the CC database. However, we later realised that the names of accounts
67+
could potentially change over time, and therefore was not a reliable field for extracting the sub-provider. Another
68+
field from the JSON response that helped to uniquely identify a user account was the *owner* field, which acted like a
69+
unique user ID. Even though the *owner* value was not directly stored in the CC database, it was stored as part of the
70+
*creator URL* field, and fortunately, all creator URLs from Flickr consisted of a common prefix plus the *owner* value
71+
(the user id). Thus, we decided to use the *creator URL* value retained in the CC database for identifying sub-providers
72+
in Flickr.
73+
74+
75+
## Implementation
76+
There are two levels at which sub-provider retrieval needs to be supported, where the first concerns the API scripts
77+
from which we initially pull the data from different providers to keep the CC collections uptodate. The second is the
78+
CC database level where the existing data needs to be updated to ensure that those reflect the sub-providers similar to
79+
the newly added image information.
80+
81+
The following sections explain how we represented the sub-provider information in the implementation, the changes made
82+
at Flickr API script level and the database update logic to support sub-provider retrieval.
83+
84+
### Representing the sub-provider information
85+
As previously explained, we define a sub-provider as a collection of user accounts, and it was identified that the
86+
unique user ID returned in the Flickr JSON response (referred to as the *owner*) was a reliable field for uniquely
87+
identifying each user account. For the time being, we focused on sub-providers NASA, SpaceX, and the Biodiversity
88+
Heritage Library (BioDivLibrary) based on their considerable importance to the community. Using the top six NASA related
89+
user accounts, the 'Official SpaceX Photos' user account, and the 'BioDivLibrary' user account as filtered by Brent's
90+
query, we identified the corresponding user IDs (content of the *owner* field) using the
91+
**flickr.people.findByUsername** method made available in the Flickr API. The mapping between the sub-provider and the
92+
corresponding user IDs was stored in a dictionary as follows.
93+
94+
```python
95+
FLICKR_SUB_PROVIDERS = {
96+
'nasa': {
97+
'24662369@N07', # NASA Goddard Photo and Video
98+
'35067687@N04', # NASA HQ PHOTO
99+
'29988733@N04', # NASA Johnson
100+
'28634332@N05', # NASA's Marshall Space Flight Center
101+
'108488366@N07', # NASAKennedy
102+
'136485307@N06' # Apollo Image Gallery
103+
},
104+
'bio_diversity': {
105+
'61021753@N02' # BioDivLibrary
106+
},
107+
'spacex': {
108+
'130608600@N05' # Official SpaceX Photos
109+
}
110+
}
111+
```
112+
113+
Since this information was required both at the API script level and the database level to retrieve sub-providers, we
114+
stored it in a common file accessible from both levels.
115+
116+
The next challenge was to identify how to reflect the sub-provider of each image using the existing database schema.
117+
There are two different fields in the database as *provider* and the *source*. The *provider* reflects the main source
118+
from which the images are retrieved, which happens to be 'Flickr' in this scenario. The *source* field reflects an
119+
organisation or entity that has published the photos using 'Flickr' in this instance (or some other site that we
120+
recognise as a *provider*).The *source* field was previously not utilised and was simply set to the value of the
121+
*provider* in the Flickr API script. Based on internal discussions, it was decided that the *source* field was to be
122+
used for reflecting the sub-provider, if the corresponding image belonged to any of the user accounts contained in our
123+
dictionary *FLICKR_SUB_PROVIDERS*. Otherwise the *source* was set to the default *provider* value 'Flickr'.
124+
125+
### Sub-provider retrieval at API script level
126+
Retrieving the sub-provider from the Flickr API script was fairly straightforward. Since the complete JSON response was
127+
available at the API script level, we did not have to worry about retrieving the user ID (*owner* value) from the
128+
*creator URL* field in our data. Rather, we simply get the owner value from the API response, and try to search for it
129+
in the *FLICKR_SUB_PROVIDERS* dictionary as follows.
130+
131+
```python
132+
owner = image_data.get('owner').strip()
133+
source = next((s for s in FLICKR_SUB_PROVIDERS if owner in FLICKR_SUB_PROVIDERS[s]), 'Flickr')
134+
```
135+
136+
Since the collection of user IDs corresponding to each sub-provider is represented as a set, the time complexity for
137+
each sub-provider is O(1) and therefore the total time complexity is linear in the number of sub-providers (that is O(n)
138+
for n sub-providers). Due to the number of sub-providers of interest being minimal (currently it is 3), this search
139+
logic is quite efficient.
140+
141+
Once we determine whether the *source* field should be set to a sub-provider value or the default ‘Flickr’ value with
142+
the given logic, we set the *source* value in the image store likewise.
143+
144+
### Sub-provider update at the database level
145+
When updating sub-providers at the database level, we need to rely on the creator URL field to obtain the user ID of
146+
each image. The creator URL is of the following form.
147+
148+
'https://www.flickr.com/photos/' + *User ID*
149+
150+
For the purpose of automating the process of updating the database to reflect sub-providers, I added the necessary SQL
151+
queries and made it accessible via the Apache Airflow UI. The database update logic is as follows.
152+
153+
As the first step, I create a temporary table and populate it with the sub-provider values and the corresponding
154+
creator URLs. This is done by iterating through the sub-provider, user ID value pairs in the *FLICKR_SUB_PROVIDERS*
155+
dictionary, and concatenating the user ID with the prefix 'https://www.flickr.com/photos/' to obtain the creator URL.
156+
157+
The initial plan was to then perform a join on the CC image table (where all the image related information is stored)
158+
with the temporary table on the condition that the creator URL from the image table matches that of the temporary table.
159+
This query which filters all the rows in the image table where we need to update the sub-provider values, looks as
160+
follows.
161+
162+
```python
163+
UPDATE {image_table}
164+
SET {col.SOURCE} = public.{temp_table}.{col.SUB_PROVIDER}
165+
FROM public.{temp_table}
166+
WHERE
167+
{image_table}.{col.CREATOR_URL} = public.{temp_table}.{col.CREATOR_URL};
168+
```
169+
170+
However, a major concern with this query, as my supervisor Brent Moran pointed out, was that it locked all the rows
171+
which matched the 'WHERE' clause at once. With respect to the magnitude of the Flickr data available in the CC image
172+
table, this meant that the above query would lock millions of rows, thus hindering the execution of other queries on
173+
the image table. To mitigate this issue, we decided to update the SQL query as follows, such that we perform a 'SELECT'
174+
query on the rows to be updated by joining the previously created temporary table with the CC image table (a 'SELECT'
175+
query does not lock the table), and iterate row by row over the query result to set the *source* value in the image
176+
table to the sub-provider value.
177+
178+
```python
179+
SELECT
180+
{col.FOREIGN_ID} AS foreign_id,
181+
public.{temp_table}.{col.PROVIDER} AS sub_provider
182+
FROM {image_table}
183+
INNER JOIN public.{temp_table}
184+
ON
185+
{image_table}.{col.CREATOR_URL} = public.{temp_table}.{col.CREATOR_URL}
186+
AND
187+
{image_table}.{col.PROVIDER} = 'Flickr';
188+
189+
# Let us refer to the result produced from the SELECT query as 'selected_records'
190+
191+
for (foreign_id, sub_provider) in selected_records:
192+
UPDATE {image_table}
193+
SET {col.SOURCE} = '{sub_provider}'
194+
WHERE
195+
{image_table}.{col.PROVIDER} = 'Flickr'
196+
AND
197+
MD5({image_table}.{col.FOREIGN_ID}) = MD5('{foreign_id}');
198+
```
199+
200+
To make this functionality available from the Airflow UI, I have added the Airflow DAG
201+
*flickr_sub_provider_update_workflow*.
202+
The changes in the source field after updating the image table in the database looks like follows.
203+
204+
205+
id | provider | source before the update | source after the update
206+
:---: | :---: | :---: | :---: | :---: | :---:
207+
14369 | flickr | flickr | bio_diversity
208+
14372 | flickr | flickr | bio_diversity
209+
14375 | flickr | flickr | bio_diversity
210+
14378 | flickr | flickr | bio_diversity
211+
14382 | flickr | flickr | bio_diversity
212+
40784 | flickr | flickr | nasa
213+
47237 | flickr | flickr | nasa
214+
47242 | flickr | flickr | nasa
215+
47244 | flickr | flickr | nasa
216+
47245 | flickr | flickr | nasa
217+
218+
219+
For more information regarding the implementation, please refer to the following PR:
220+
https://github.com/creativecommons/cccatalog/pull/420
221+
222+
## Acknowledgement
223+
224+
I express my gratitude to Brent Moran and Anna Tumadóttir for their assistance with my first task in GSoC 2020 by
225+
helping me to filter the sub-providers of interest and conducting the necessary research.

content/contents.lr

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -64,7 +64,7 @@ So if you're looking to integrate CC licenses or CC licensed works into your app
6464
</div>
6565
<div class="card-body" align="center">
6666
<p class="card-text text-center">Participate in one of our regular usability tests for CC tools that we're actively working on (we'll give you a gift card!)</p>
67-
<a href="/community/#mailing-lists" class="btn btn-sm btn-outline-primary">Sign up to test</a>
67+
<a href="/contributing-code/usability" class="btn btn-sm btn-outline-primary">Learn more</a>
6868
<a href="https://github.com/creativecommons/creativecommons.org/issues/new/choose" class="btn btn-sm btn-outline-primary">File an issue</a>
6969
</div>
7070
</div>

0 commit comments

Comments
 (0)