Skip to content

Commit eecc6f5

Browse files
committed
content changes made
1 parent ff42808 commit eecc6f5

File tree

1 file changed

+42
-23
lines changed
  • content/blog/entries/science-museum-implementation

1 file changed

+42
-23
lines changed

content/blog/entries/science-museum-implementation/contents.lr

Lines changed: 42 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -13,33 +13,47 @@ series: gsoc-2020-cccatalog
1313
pub_date: 2020-05-29
1414
---
1515
body:
16-
Science museum is a provider with around 80,000 CC license images, currently Science museum data is ingested from Common Crawl.
17-
The problem with Common Crawl is that we don't have control over the content available and this results in poor data quality.
18-
Science museum is one such provider that has poor data quality and there is need to improve it. So this is done by moving Science museum to an API based crawl.
16+
## Introduction
17+
CC catalog project is responsible for collecting CC licensed images available in the web, CC licensed images are hosted by different
18+
sources, these sources that provide the images and its metadata are called providers. Currently, images are collected from providers using two methods
19+
Common Crawl and API based crawl. Common Crawl data is an open repository of web crawled data and we use that data to get the necessary image metadata
20+
for that provider [more information](https://commoncrawl.org/the-data/get-started/). API crawl is implemented using the API endpoint maintained
21+
by the providers. The main problem with Common Crawl is that we don't have control over the data they crawl, and this sometimes results poor
22+
data quality whereas with API based crawl we have access to the information available. API based crawl is better when we need to update image
23+
metadata and reqular intervals.
24+
25+
As a part of the internship, I will be working on moving providers from Common Crawl to API based crawl as well as integrate new providers
26+
to the API crawl. I will be starting with the Science Museum provider.
27+
28+
## Science Museum
29+
Science museum is a provider with around 80,000 CC licensed images, currently Science museum data is ingested from Common Crawl.
30+
Science museum is one such provider where our data is of poor quality and there is need to improve it. This is done by moving
31+
Science museum to an API based crawl.
1932

2033
## API research
21-
Science museum provides an official API endpoint using which the implementation will be done, but before the implementation
22-
we have to ensure that the API provides necessary content and there is a systematic way to get it. First step is to take an object
23-
from their collection and check certain criterias.
34+
We want to index metadata using their open API [endpoint](https://collection.sciencemuseumgroup.org.uk/search/has_image/image_license).
35+
However, before the implementation we have to ensure that the API provides necessary content and there is a systematic way to get it.
36+
The first step is to take an object from their collection and check certain criterias.
2437

2538
[sample object](https://collection.sciencemuseumgroup.org.uk/api/objects/co8005638)
2639

27-
- parameters available for the API
28-
- Object landing url
29-
- Image url
40+
The criteria are:
41+
- parameters available for the API
42+
- Object landing url (frontend link of the object the image is associated with)
43+
- Image url (the url link of the image)
3044
- CC license associated with the image
3145
- creator, title and other metadata info
3246

3347
Once the above checks have been made, we need to find a way to get all the objects, this could be by paging through the records
3448
or partition using the parameters, etc. Since their API parameter has ```page[number]``` paging would be an appropriate choice with max size
35-
as 100 it would require around 800 pages to get all the objects but then certain glitches with their API doesn't allow us to go after
36-
page 50.This would mean we would get only 5000 objects and around 17000 images.
49+
as 100 it would require around 800 pages to get all the objects but then since they don't allow paging a large number of results, and
50+
the max number of pages for Science Museum is 50 pages.This would mean we would get only 5000 objects and around 17000 images.
3751

3852
[API page-50](https://collection.sciencemuseumgroup.org.uk/search/image_license?page[size]=100&page[number]=50)
3953

4054
[API page-51](https://collection.sciencemuseumgroup.org.uk/search/image_license?page[size]=100&page[number]=51)
4155

42-
So we need to find the way to divide the collection into subsets such that each subset has less than or equal to 5000 objects.
56+
So we need to find a way to divide the collection into subsets such that each subset has less than or equal to 5000 objects.
4357
Luckily, the API had another set of parameters ```date[from]``` and ```date[to]``` which represents the time period of the object.
4458
Querying the API through different time period at the same time ensuring that records in that time period don't exceed 5000 solves the problem, starting
4559
from year 0 to year 2020 by trial and error method suitable year range was chosen.
@@ -60,9 +74,12 @@ from year 0 to year 2020 by trial and error method suitable year range was chose
6074
]
6175
```
6276

63-
With this we confirm the implementation method, but before writing the script we need to know what are the different license provided
64-
by them and check if it matches with our licenses and version, since they haven't mentioned about their licenses and version, running a
65-
test script to get all the licenses and grouping them would do the work.
77+
78+
With this we have a method to ingest the desired records, but before writing the script we need to know the different licenses
79+
provided by the API. We need to figure out a consistent way to identify which license and version are attached to each object.
80+
To do this, we ran a test script to get counts of objects under different licenses.
81+
82+
The results are:
6683

6784
```
6885
+-----------------+----------+
@@ -78,13 +95,14 @@ test script to get all the licenses and grouping them would do the work.
7895
Since the licenses and their versions are confirmed, we can start the implementation.
7996

8097
## Implementation
81-
The implementation is quite simple in nature, we loop the through the ```YEAR_RANGE``` and get all the records for that period and
98+
The implementation is quite simple in nature: we loop the through the ```YEAR_RANGE``` and get all the records for that period and
8299
pass it on to an object data handler method that extracts the necessary details from the record and store it in the ```ImageStore```
83-
instance. Since we are querying with the objects period, the same object would be present in multiple year range ,to avoid repition
84-
we keep track of the record/object's id in a global variable ```RECORD_IDS = []```.
100+
instance. ImageStore is a class that stores image information from the provider, it stores the information in a buffer and inserts to tsv
101+
when the buffer reached threshold limit. Due to overlapping date ranges, the metadata for some objects is collected multiple times.
102+
So, we keep track of the record/object's id in a global variable ```RECORD_IDS = []```.
85103

86-
Within the object data handler method before collecting details we check if the ```id``` already exists in ```RECORD_IDS```, if it exists
87-
we move on to the next record.
104+
Within the object data handler method before collecting details we check if the ```id``` already exists in ```RECORD_IDS```.
105+
If it exists we move on to the next record.
88106

89107
```
90108
for obj_ in batch_data:
@@ -95,10 +113,11 @@ we move on to the next record.
95113
```
96114

97115
```id_``` is the object id and we cannnot use this value as foreign identifier, the reason behind it is that an object could
98-
have multiple images with it and using object id we cannot determine the image uniquely, so we must use image id that unique
99-
for each image. Currently image id is taken from ```multimedia```, for each image data in multimedia foreign id is in ```admin.uid```.
116+
have multiple images with it and using object id we cannot determine the image uniquely, so we must use image id that is unique
117+
for each image. Currently image id is taken from ```multimedia```, multimedia is a field in the json response that lists multiple
118+
images and their metadata, for each image data in multimedia, foreign id is in ```admin.uid```.
100119

101-
Once the implementation script is ready (test suite is built along with), we create a workflow, that runs on a monthly basis.
120+
The implementation can be found [here](https://github.com/creativecommons/cccatalog/blob/master/src/cc_catalog_airflow/dags/provider_api_scripts/science_museum.py).
102121

103122
### Results:
104123
Running the scripts we get:

0 commit comments

Comments
 (0)