You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: content/blog/entries/science-museum-implementation/contents.lr
+42-23Lines changed: 42 additions & 23 deletions
Original file line number
Diff line number
Diff line change
@@ -13,33 +13,47 @@ series: gsoc-2020-cccatalog
13
13
pub_date: 2020-05-29
14
14
---
15
15
body:
16
-
Science museum is a provider with around 80,000 CC license images, currently Science museum data is ingested from Common Crawl.
17
-
The problem with Common Crawl is that we don't have control over the content available and this results in poor data quality.
18
-
Science museum is one such provider that has poor data quality and there is need to improve it. So this is done by moving Science museum to an API based crawl.
16
+
## Introduction
17
+
CC catalog project is responsible for collecting CC licensed images available in the web, CC licensed images are hosted by different
18
+
sources, these sources that provide the images and its metadata are called providers. Currently, images are collected from providers using two methods
19
+
Common Crawl and API based crawl. Common Crawl data is an open repository of web crawled data and we use that data to get the necessary image metadata
20
+
for that provider [more information](https://commoncrawl.org/the-data/get-started/). API crawl is implemented using the API endpoint maintained
21
+
by the providers. The main problem with Common Crawl is that we don't have control over the data they crawl, and this sometimes results poor
22
+
data quality whereas with API based crawl we have access to the information available. API based crawl is better when we need to update image
23
+
metadata and reqular intervals.
24
+
25
+
As a part of the internship, I will be working on moving providers from Common Crawl to API based crawl as well as integrate new providers
26
+
to the API crawl. I will be starting with the Science Museum provider.
27
+
28
+
## Science Museum
29
+
Science museum is a provider with around 80,000 CC licensed images, currently Science museum data is ingested from Common Crawl.
30
+
Science museum is one such provider where our data is of poor quality and there is need to improve it. This is done by moving
31
+
Science museum to an API based crawl.
19
32
20
33
## API research
21
-
Science museum provides an official API endpoint using which the implementation will be done, but before the implementation
22
-
we have to ensure that the API provides necessary content and there is a systematic way to get it. First step is to take an object
23
-
from their collection and check certain criterias.
34
+
We want to index metadata using their open API [endpoint](https://collection.sciencemuseumgroup.org.uk/search/has_image/image_license).
35
+
However, before the implementation we have to ensure that the API provides necessary content and there is a systematic way to get it.
36
+
The first step is to take an object from their collection and check certain criterias.
So we need to find the way to divide the collection into subsets such that each subset has less than or equal to 5000 objects.
56
+
So we need to find a way to divide the collection into subsets such that each subset has less than or equal to 5000 objects.
43
57
Luckily, the API had another set of parameters ```date[from]``` and ```date[to]``` which represents the time period of the object.
44
58
Querying the API through different time period at the same time ensuring that records in that time period don't exceed 5000 solves the problem, starting
45
59
from year 0 to year 2020 by trial and error method suitable year range was chosen.
@@ -60,9 +74,12 @@ from year 0 to year 2020 by trial and error method suitable year range was chose
60
74
]
61
75
```
62
76
63
-
With this we confirm the implementation method, but before writing the script we need to know what are the different license provided
64
-
by them and check if it matches with our licenses and version, since they haven't mentioned about their licenses and version, running a
65
-
test script to get all the licenses and grouping them would do the work.
77
+
78
+
With this we have a method to ingest the desired records, but before writing the script we need to know the different licenses
79
+
provided by the API. We need to figure out a consistent way to identify which license and version are attached to each object.
80
+
To do this, we ran a test script to get counts of objects under different licenses.
81
+
82
+
The results are:
66
83
67
84
```
68
85
+-----------------+----------+
@@ -78,13 +95,14 @@ test script to get all the licenses and grouping them would do the work.
78
95
Since the licenses and their versions are confirmed, we can start the implementation.
79
96
80
97
## Implementation
81
-
The implementation is quite simple in nature, we loop the through the ```YEAR_RANGE``` and get all the records for that period and
98
+
The implementation is quite simple in nature: we loop the through the ```YEAR_RANGE``` and get all the records for that period and
82
99
pass it on to an object data handler method that extracts the necessary details from the record and store it in the ```ImageStore```
83
-
instance. Since we are querying with the objects period, the same object would be present in multiple year range ,to avoid repition
84
-
we keep track of the record/object's id in a global variable ```RECORD_IDS = []```.
100
+
instance. ImageStore is a class that stores image information from the provider, it stores the information in a buffer and inserts to tsv
101
+
when the buffer reached threshold limit. Due to overlapping date ranges, the metadata for some objects is collected multiple times.
102
+
So, we keep track of the record/object's id in a global variable ```RECORD_IDS = []```.
85
103
86
-
Within the object data handler method before collecting details we check if the ```id``` already exists in ```RECORD_IDS```, if it exists
87
-
we move on to the next record.
104
+
Within the object data handler method before collecting details we check if the ```id``` already exists in ```RECORD_IDS```.
105
+
If it exists we move on to the next record.
88
106
89
107
```
90
108
for obj_ in batch_data:
@@ -95,10 +113,11 @@ we move on to the next record.
95
113
```
96
114
97
115
```id_``` is the object id and we cannnot use this value as foreign identifier, the reason behind it is that an object could
98
-
have multiple images with it and using object id we cannot determine the image uniquely, so we must use image id that unique
99
-
for each image. Currently image id is taken from ```multimedia```, for each image data in multimedia foreign id is in ```admin.uid```.
116
+
have multiple images with it and using object id we cannot determine the image uniquely, so we must use image id that is unique
117
+
for each image. Currently image id is taken from ```multimedia```, multimedia is a field in the json response that lists multiple
118
+
images and their metadata, for each image data in multimedia, foreign id is in ```admin.uid```.
100
119
101
-
Once the implementation script is ready (test suite is built along with), we create a workflow, that runs on a monthly basis.
120
+
The implementation can be found [here](https://github.com/creativecommons/cccatalog/blob/master/src/cc_catalog_airflow/dags/provider_api_scripts/science_museum.py).
0 commit comments