Skip to content

Commit 9040a6d

Browse files
authored
Merge pull request creativecommons#285 from kss682/blog-post-1
Science Museum implementation blog post
2 parents 21bace8 + 0d2b7b6 commit 9040a6d

File tree

3 files changed

+145
-0
lines changed

3 files changed

+145
-0
lines changed
+10
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
username: kss682
2+
---
3+
name: K S Srinidhi Krishna
4+
---
5+
md5_hashed_email: 7E71C293A442A2CF434BFC244BD5F184
6+
---
7+
about:
8+
Srinidhi Krishna is a computer science undergraduate student from India and will be interning with Creative Commons during the summer.
9+
He is working on [cccatalog](https://github.com/creativecommons/cccatalog) as a part of GSoC20.
10+
He is `@K S Srinidhi Krishna` on slack.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,134 @@
1+
title: Science Museum provider implementation
2+
---
3+
categories:
4+
5+
cc-catalog
6+
gsoc
7+
gsoc-2020
8+
---
9+
author: srinidhi
10+
---
11+
series: gsoc-2020-cccatalog
12+
---
13+
pub_date: 2020-06-10
14+
---
15+
body:
16+
## Introduction
17+
CC catalog project is responsible for collecting CC licensed images available in the web, CC licensed images are hosted by different
18+
sources, these sources that provide the images and its metadata are called providers. Currently, images are collected from providers using two methods
19+
Common Crawl and API based crawl. Common Crawl data is an open repository of web crawled data and we use that data to get the necessary image metadata
20+
for that provider [more information](https://commoncrawl.org/the-data/get-started/). API crawl is implemented using the API endpoint maintained
21+
by the providers. The main problem with Common Crawl is that we don't have control over the data they crawl, and this sometimes results poor
22+
data quality whereas with API based crawl we have access to the information available. API based crawl is better when we need to update image
23+
metadata and reqular intervals.
24+
25+
As a part of the internship, I will be working on moving providers from Common Crawl to API based crawl as well as integrate new providers
26+
to the API crawl. I will be starting with the Science Museum provider.
27+
28+
## Science Museum
29+
Science museum is a provider with around 80,000 CC licensed images, currently Science museum data is ingested from Common Crawl.
30+
Science museum is one such provider where our data is of poor quality and there is need to improve it. This is done by moving
31+
Science museum to an API based crawl.
32+
33+
## API research
34+
We want to index metadata using their open API [endpoint](https://collection.sciencemuseumgroup.org.uk/search/has_image/image_license).
35+
However, before the implementation we have to ensure that the API provides necessary content and there is a systematic way to get it.
36+
The first step is to take an object from their collection and check certain criterias.
37+
38+
[sample object](https://collection.sciencemuseumgroup.org.uk/api/objects/co8005638)
39+
40+
The criteria are:
41+
- parameters available for the API
42+
- Object landing url (frontend link of the object the image is associated with)
43+
- Image url (the url link of the image)
44+
- CC license associated with the image
45+
- creator, title and other metadata info
46+
47+
Once the above checks have been made, we need to find a way to get all the objects, this could be by paging through the records
48+
or partition using the parameters, etc. Since their API parameter has ```page[number]``` paging would be an appropriate choice with max size
49+
as 100 it would require around 800 pages to get all the objects but then since they don't allow paging a large number of results, and
50+
the max number of pages for Science Museum is 50 pages.This would mean we would get only 5000 objects and around 17000 images.
51+
52+
[API page-50](https://collection.sciencemuseumgroup.org.uk/search/image_license?page[size]=100&page[number]=50)
53+
54+
[API page-51](https://collection.sciencemuseumgroup.org.uk/search/image_license?page[size]=100&page[number]=51)
55+
56+
So we need to find a way to divide the collection into subsets such that each subset has less than or equal to 5000 objects.
57+
Luckily, the API had another set of parameters ```date[from]``` and ```date[to]``` which represents the time period of the object.
58+
Querying the API through different time period at the same time ensuring that records in that time period don't exceed 5000 solves the problem, starting
59+
from year 0 to year 2020 by trial and error method suitable year range was chosen.
60+
61+
```
62+
YEAR_RANGE = [
63+
(0, 1500),
64+
(1500, 1750),
65+
(1750, 1825),
66+
(1825, 1850),
67+
(1850, 1875),
68+
(1875, 1900),
69+
(1900, 1915),
70+
(1915, 1940),
71+
(1940, 1965),
72+
(1965, 1990),
73+
(1990, 2020)
74+
]
75+
```
76+
77+
78+
With this we have a method to ingest the desired records, but before writing the script we need to know the different licenses
79+
provided by the API. We need to figure out a consistent way to identify which license and version are attached to each object.
80+
To do this, we ran a test script to get counts of objects under different licenses.
81+
82+
The results are:
83+
84+
```
85+
+-----------------+----------+
86+
| license_version | count(1) |
87+
+-----------------+----------+
88+
| CC-BY-NC-ND 2.0 | 210 |
89+
| CC-BY-NC-ND 4.0 | 2376 |
90+
| CC-BY-NC-SA 2.0 | 1 |
91+
| CC-BY-NC-SA 4.0 | 61694 |
92+
+-----------------+----------+
93+
```
94+
95+
Since the licenses and their versions are confirmed, we can start the implementation.
96+
97+
## Implementation
98+
The implementation is quite simple in nature: we loop the through the ```YEAR_RANGE``` and get all the records for that period and
99+
pass it on to an object data handler method that extracts the necessary details from the record and store it in the ```ImageStore```
100+
instance. ImageStore is a class that stores image information from the provider, it stores the information in a buffer and inserts to tsv
101+
when the buffer reached threshold limit. Due to overlapping date ranges, the metadata for some objects is collected multiple times.
102+
So, we keep track of the record/object's id in a global variable ```RECORD_IDS = []```.
103+
104+
Within the object data handler method before collecting details we check if the ```id``` already exists in ```RECORD_IDS```.
105+
If it exists we move on to the next record.
106+
107+
```
108+
for obj_ in batch_data:
109+
id_ = obj_.get("id")
110+
if id_ in RECORD_IDS:
111+
continue
112+
RECORD_IDS.append(id_)
113+
```
114+
115+
```id_``` is the object id and we cannnot use this value as foreign identifier, the reason behind it is that an object could
116+
have multiple images with it and using object id we cannot determine the image uniquely, so we must use image id that is unique
117+
for each image. Currently image id is taken from ```multimedia```, multimedia is a field in the json response that lists multiple
118+
images and their metadata, for each image data in multimedia, foreign id is in ```admin.uid```.
119+
120+
The implementation can be found [here](https://github.com/creativecommons/cccatalog/blob/master/src/cc_catalog_airflow/dags/provider_api_scripts/science_museum.py).
121+
122+
### Results:
123+
Running the scripts we get:
124+
- Number of records recieved : ```35584```
125+
- Number of images collected : ``` 62497```
126+
127+
The problem with current implementation is that records with no date would be missed.
128+
129+
Science Museum provider is the first provider I worked on as a part of the internship and thank my mentor Brent Moran for the help.
130+
131+
### Additional Details :
132+
- [research work](https://github.com/creativecommons/cccatalog/issues/302)
133+
- [implementation](https://github.com/creativecommons/cccatalog/pull/400)
134+
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
name: GSoC 2020: CC catalog

0 commit comments

Comments
 (0)