|
| 1 | +title: Flickr Sub-provider Retrieval |
| 2 | +--- |
| 3 | +categories: |
| 4 | + |
| 5 | +cc-catalog |
| 6 | +gsoc |
| 7 | +gsoc-2020 |
| 8 | +--- |
| 9 | +author: charini |
| 10 | +--- |
| 11 | +series: gsoc-2020-cccatalog |
| 12 | +--- |
| 13 | +pub_date: 2020-06-24 |
| 14 | +--- |
| 15 | +body: |
| 16 | +## Introduction |
| 17 | +The Creative Commons (CC) licensed images made available via CC Search and CC Catalog API tools are retrieved from |
| 18 | +numerous sources (which we refer to as providers) such as Flickr and different museum collections. While the existing |
| 19 | +implementation of the CC Catalog tools enables filtering images in various manners such as based on image tags, the |
| 20 | +provider, and the license type, it does not facilitate searching for images from particularly valuable internal sources |
| 21 | +(referred to as sub-providers). For example, images related to 'NASA' have significant value in the Flickr collection, |
| 22 | +since 'NASA' related pictures are extensively used by a large audience especially for educational purposes. The aim of |
| 23 | +my first task in the GSoC project is to implement required changes in the API script level and in the existing data in |
| 24 | +the database, such that filtering by certain important sub-providers is made possible. |
| 25 | + |
| 26 | +While there are several providers such as Flickr, Europeana, and Smithsonian from which we require to extract |
| 27 | +sub-providers, the consensus was to initially focus on Flickr due to that currently being in production, and since a |
| 28 | +substantial amount of images made available via CC Search come from Flickr. Thus, in this initial blog post, I will |
| 29 | +discuss how I addressed the requirement of sub-provider retrieval in Flickr by making the necessary changes in the |
| 30 | +[Creative Commons Catalog](https://github.com/creativecommons/cccatalog) repository. |
| 31 | + |
| 32 | +## Research |
| 33 | +The primary research involved in the Flickr sub-provider retrieval task was defining which entities to identify as |
| 34 | +sub-providers, and identifying how those sub-providers can be retrieved based on the image related information we |
| 35 | +retain. |
| 36 | + |
| 37 | +### Definition of a sub-provider |
| 38 | +It was decided that a sub-provider should be a collection of user accounts in Flickr, where this collection corresponded |
| 39 | +to a common entity, and the common entity would reflect the sub-provider. For example, since both Flickr user accounts |
| 40 | +*NASA HQ PHOTO* and *NASA Johnson* provide images related to NASA, we would represent the NASA sub-provider by those |
| 41 | +two (and other related) user accounts. |
| 42 | + |
| 43 | +The next challenge was to determine how to identify which collections of user accounts were important to a wider |
| 44 | +audience. The number of views per user account was an intuitive measure to rely on for this requirement. My supervisor |
| 45 | +Brent Moran executed a query on the existing CC database to obtain the 50 most popular user accounts in Flickr. A |
| 46 | +snippet of the query response is as follows: |
| 47 | + |
| 48 | +user_account_name | total_views |
| 49 | +--- | --- |
| 50 | +Apollo Image Gallery | 1216297208 |
| 51 | +BioDivLibrary | 625528813 |
| 52 | +manhhai | 445714729 |
| 53 | +Thomas Hawk | 300554527 |
| 54 | +Sangudo | 258177509 |
| 55 | +NASA Goddard Photo and Video | 225143949 |
| 56 | + |
| 57 | +Despite having a significant number of views, some of these user accounts did not appear to be worth being identified |
| 58 | +as belonging to a sub-provider due to their lack of educational importance. Thus, we manually curated this list to |
| 59 | +retain what we believed to be important to a wider audience. |
| 60 | + |
| 61 | +### Sub-provider identification |
| 62 | +In order to identify the sub-provider each image from Flickr belonged to, it was necessary to determine which field in |
| 63 | +the stored image data referred to the user account. Among the various information contained in an API response, only a |
| 64 | +selected set of fields is stored on the CC end, and it was important to use such stored data for the identification of |
| 65 | +sub-providers. We initially decided to rely on the user account name which was reflected by the *ownername* field in |
| 66 | +the JSON response and stored as the *creator* in the CC database. However, we later realised that the names of accounts |
| 67 | +could potentially change over time, and therefore was not a reliable field for extracting the sub-provider. Another |
| 68 | +field from the JSON response that helped to uniquely identify a user account was the *owner* field, which acted like a |
| 69 | +unique user ID. Even though the *owner* value was not directly stored in the CC database, it was stored as part of the |
| 70 | +*creator URL* field, and fortunately, all creator URLs from Flickr consisted of a common prefix plus the *owner* value |
| 71 | +(the user id). Thus, we decided to use the *creator URL* value retained in the CC database for identifying sub-providers |
| 72 | +in Flickr. |
| 73 | + |
| 74 | + |
| 75 | +## Implementation |
| 76 | +There are two levels at which sub-provider retrieval needs to be supported, where the first concerns the API scripts |
| 77 | +from which we initially pull the data from different providers to keep the CC collections uptodate. The second is the |
| 78 | +CC database level where the existing data needs to be updated to ensure that those reflect the sub-providers similar to |
| 79 | +the newly added image information. |
| 80 | + |
| 81 | +The following sections explain how we represented the sub-provider information in the implementation, the changes made |
| 82 | +at Flickr API script level and the database update logic to support sub-provider retrieval. |
| 83 | + |
| 84 | +### Representing the sub-provider information |
| 85 | +As previously explained, we define a sub-provider as a collection of user accounts, and it was identified that the |
| 86 | +unique user ID returned in the Flickr JSON response (referred to as the *owner*) was a reliable field for uniquely |
| 87 | +identifying each user account. For the time being, we focused on sub-providers NASA, SpaceX, and the Biodiversity |
| 88 | +Heritage Library (BioDivLibrary) based on their considerable importance to the community. Using the top six NASA related |
| 89 | +user accounts, the 'Official SpaceX Photos' user account, and the 'BioDivLibrary' user account as filtered by Brent's |
| 90 | +query, we identified the corresponding user IDs (content of the *owner* field) using the |
| 91 | +**flickr.people.findByUsername** method made available in the Flickr API. The mapping between the sub-provider and the |
| 92 | +corresponding user IDs was stored in a dictionary as follows. |
| 93 | + |
| 94 | +```python |
| 95 | +FLICKR_SUB_PROVIDERS = { |
| 96 | + 'nasa': { |
| 97 | + '24662369@N07', # NASA Goddard Photo and Video |
| 98 | + '35067687@N04', # NASA HQ PHOTO |
| 99 | + '29988733@N04', # NASA Johnson |
| 100 | + '28634332@N05', # NASA's Marshall Space Flight Center |
| 101 | + '108488366@N07', # NASAKennedy |
| 102 | + '136485307@N06' # Apollo Image Gallery |
| 103 | + }, |
| 104 | + 'bio_diversity': { |
| 105 | + '61021753@N02' # BioDivLibrary |
| 106 | + }, |
| 107 | + 'spacex': { |
| 108 | + '130608600@N05' # Official SpaceX Photos |
| 109 | + } |
| 110 | +} |
| 111 | +``` |
| 112 | + |
| 113 | +Since this information was required both at the API script level and the database level to retrieve sub-providers, we |
| 114 | +stored it in a common file accessible from both levels. |
| 115 | + |
| 116 | +The next challenge was to identify how to reflect the sub-provider of each image using the existing database schema. |
| 117 | +There are two different fields in the database as *provider* and the *source*. The *provider* reflects the main source |
| 118 | +from which the images are retrieved, which happens to be 'Flickr' in this scenario. The *source* field reflects an |
| 119 | +organisation or entity that has published the photos using 'Flickr' in this instance (or some other site that we |
| 120 | +recognise as a *provider*).The *source* field was previously not utilised and was simply set to the value of the |
| 121 | +*provider* in the Flickr API script. Based on internal discussions, it was decided that the *source* field was to be |
| 122 | +used for reflecting the sub-provider, if the corresponding image belonged to any of the user accounts contained in our |
| 123 | +dictionary *FLICKR_SUB_PROVIDERS*. Otherwise the *source* was set to the default *provider* value 'Flickr'. |
| 124 | + |
| 125 | +### Sub-provider retrieval at API script level |
| 126 | +Retrieving the sub-provider from the Flickr API script was fairly straightforward. Since the complete JSON response was |
| 127 | +available at the API script level, we did not have to worry about retrieving the user ID (*owner* value) from the |
| 128 | +*creator URL* field in our data. Rather, we simply get the owner value from the API response, and try to search for it |
| 129 | +in the *FLICKR_SUB_PROVIDERS* dictionary as follows. |
| 130 | + |
| 131 | +```python |
| 132 | +owner = image_data.get('owner').strip() |
| 133 | +source = next((s for s in FLICKR_SUB_PROVIDERS if owner in FLICKR_SUB_PROVIDERS[s]), 'Flickr') |
| 134 | +``` |
| 135 | + |
| 136 | +Since the collection of user IDs corresponding to each sub-provider is represented as a set, the time complexity for |
| 137 | +each sub-provider is O(1) and therefore the total time complexity is linear in the number of sub-providers (that is O(n) |
| 138 | +for n sub-providers). Due to the number of sub-providers of interest being minimal (currently it is 3), this search |
| 139 | +logic is quite efficient. |
| 140 | + |
| 141 | +Once we determine whether the *source* field should be set to a sub-provider value or the default ‘Flickr’ value with |
| 142 | +the given logic, we set the *source* value in the image store likewise. |
| 143 | + |
| 144 | +### Sub-provider update at the database level |
| 145 | +When updating sub-providers at the database level, we need to rely on the creator URL field to obtain the user ID of |
| 146 | +each image. The creator URL is of the following form. |
| 147 | + |
| 148 | +'https://www.flickr.com/photos/' + *User ID* |
| 149 | + |
| 150 | +For the purpose of automating the process of updating the database to reflect sub-providers, I added the necessary SQL |
| 151 | +queries and made it accessible via the Apache Airflow UI. The database update logic is as follows. |
| 152 | + |
| 153 | +As the first step, I create a temporary table and populate it with the sub-provider values and the corresponding |
| 154 | +creator URLs. This is done by iterating through the sub-provider, user ID value pairs in the *FLICKR_SUB_PROVIDERS* |
| 155 | +dictionary, and concatenating the user ID with the prefix 'https://www.flickr.com/photos/' to obtain the creator URL. |
| 156 | + |
| 157 | +The initial plan was to then perform a join on the CC image table (where all the image related information is stored) |
| 158 | +with the temporary table on the condition that the creator URL from the image table matches that of the temporary table. |
| 159 | +This query which filters all the rows in the image table where we need to update the sub-provider values, looks as |
| 160 | +follows. |
| 161 | + |
| 162 | +```python |
| 163 | +UPDATE {image_table} |
| 164 | +SET {col.SOURCE} = public.{temp_table}.{col.SUB_PROVIDER} |
| 165 | +FROM public.{temp_table} |
| 166 | +WHERE |
| 167 | +{image_table}.{col.CREATOR_URL} = public.{temp_table}.{col.CREATOR_URL}; |
| 168 | +``` |
| 169 | + |
| 170 | +However, a major concern with this query, as my supervisor Brent Moran pointed out, was that it locked all the rows |
| 171 | +which matched the 'WHERE' clause at once. With respect to the magnitude of the Flickr data available in the CC image |
| 172 | +table, this meant that the above query would lock millions of rows, thus hindering the execution of other queries on |
| 173 | +the image table. To mitigate this issue, we decided to update the SQL query as follows, such that we perform a 'SELECT' |
| 174 | +query on the rows to be updated by joining the previously created temporary table with the CC image table (a 'SELECT' |
| 175 | +query does not lock the table), and iterate row by row over the query result to set the *source* value in the image |
| 176 | +table to the sub-provider value. |
| 177 | + |
| 178 | +```python |
| 179 | +SELECT |
| 180 | +{col.FOREIGN_ID} AS foreign_id, |
| 181 | +public.{temp_table}.{col.PROVIDER} AS sub_provider |
| 182 | +FROM {image_table} |
| 183 | +INNER JOIN public.{temp_table} |
| 184 | +ON |
| 185 | +{image_table}.{col.CREATOR_URL} = public.{temp_table}.{col.CREATOR_URL} |
| 186 | +AND |
| 187 | +{image_table}.{col.PROVIDER} = 'Flickr'; |
| 188 | + |
| 189 | +# Let us refer to the result produced from the SELECT query as 'selected_records' |
| 190 | + |
| 191 | +for (foreign_id, sub_provider) in selected_records: |
| 192 | + UPDATE {image_table} |
| 193 | + SET {col.SOURCE} = '{sub_provider}' |
| 194 | + WHERE |
| 195 | + {image_table}.{col.PROVIDER} = 'Flickr' |
| 196 | + AND |
| 197 | + MD5({image_table}.{col.FOREIGN_ID}) = MD5('{foreign_id}'); |
| 198 | +``` |
| 199 | + |
| 200 | +To make this functionality available from the Airflow UI, I have added the Airflow DAG |
| 201 | +*flickr_sub_provider_update_workflow*. |
| 202 | +The changes in the source field after updating the image table in the database looks like follows. |
| 203 | + |
| 204 | + |
| 205 | +id | provider | source before the update | source after the update |
| 206 | +:---: | :---: | :---: | :---: | :---: | :---: |
| 207 | +14369 | flickr | flickr | bio_diversity |
| 208 | +14372 | flickr | flickr | bio_diversity |
| 209 | +14375 | flickr | flickr | bio_diversity |
| 210 | +14378 | flickr | flickr | bio_diversity |
| 211 | +14382 | flickr | flickr | bio_diversity |
| 212 | +40784 | flickr | flickr | nasa |
| 213 | +47237 | flickr | flickr | nasa |
| 214 | +47242 | flickr | flickr | nasa |
| 215 | +47244 | flickr | flickr | nasa |
| 216 | +47245 | flickr | flickr | nasa |
| 217 | + |
| 218 | + |
| 219 | +For more information regarding the implementation, please refer to the following PR: |
| 220 | +https://github.com/creativecommons/cccatalog/pull/420 |
| 221 | + |
| 222 | +## Acknowledgement |
| 223 | + |
| 224 | +I express my gratitude to Brent Moran and Anna Tumadóttir for their assistance with my first task in GSoC 2020 by |
| 225 | +helping me to filter the sub-providers of interest and conducting the necessary research. |
0 commit comments