content changes made

kss682 · kss682 · commit eecc6f552e80 · 2020-06-04T01:07:26.000+05:30
diff --git a/content/blog/entries/science-museum-implementation/contents.lr b/content/blog/entries/science-museum-implementation/contents.lr
@@ -13,33 +13,47 @@ series: gsoc-2020-cccatalog
 pub_date: 2020-05-29
 ---
 body:
-Science museum is a provider with around 80,000 CC license images, currently Science museum data is ingested from Common Crawl. 
-The problem with Common Crawl is that we don't have control over the content available and this results in poor data quality.
-Science museum is one such provider that has poor data quality and there is need to improve it. So this is done by moving Science museum to an API based crawl. 
+## Introduction 
+CC catalog project is responsible for collecting CC licensed images available in the web,  CC licensed images are hosted by different
+sources, these sources that provide the images and its metadata are called providers. Currently, images are collected from providers using two methods
+Common Crawl and API based crawl. Common Crawl data is an open repository of web crawled data and we use that data to get the necessary image metadata 
+for that provider [more information](https://commoncrawl.org/the-data/get-started/). API crawl is implemented using the API endpoint maintained 
+by the providers. The main problem with Common Crawl is that we don't have control over the data they crawl, and this sometimes results poor 
+data quality whereas with API based crawl we have access to the information available. API based crawl is better when we need to update image
+metadata and reqular intervals. 
+
+As a part of the internship, I will be working on moving providers from Common Crawl to API based crawl as well as integrate new providers
+to the API crawl. I will be starting with the Science Museum provider.
+
+## Science Museum
+Science museum is a provider with around 80,000 CC licensed images, currently Science museum data is ingested from Common Crawl. 
+Science museum is one such provider where our data is of poor quality and there is need to improve it. This is done by moving 
+Science museum to an API based crawl. 
 
 ## API research
-Science museum provides an official API endpoint using which the implementation will be done, but before the implementation 
-we have to ensure that the API provides necessary content and there is a systematic way to get it. First step is to take an object 
-from their collection and check certain criterias.
+We want to index metadata using their open API [endpoint](https://collection.sciencemuseumgroup.org.uk/search/has_image/image_license). 
+However, before the implementation we have to ensure that the API provides necessary content and there is a systematic way to get it.
+The first step is to take an object from their collection and check certain criterias.
 
 [sample object](https://collection.sciencemuseumgroup.org.uk/api/objects/co8005638)
 
-- parameters available for the API 
-- Object landing url
-- Image url
+The criteria are:
+- parameters available for the API
+- Object landing url (frontend link of the object the image is associated with) 
+- Image url (the url link of the image)
 - CC license associated with the image
 - creator, title and other metadata info 
 
 Once the above checks have been made, we need to find a way to get all the objects, this could be by paging through the records 
 or partition using the parameters, etc. Since their API parameter has ```page[number]``` paging would be an appropriate choice with  max size 
-as 100 it would require around 800 pages to get all the objects but then certain glitches with their API doesn't allow us to go after 
-page 50.This would mean we would get only 5000 objects and around 17000 images. 
+as 100 it would require around 800 pages to get all the objects but then since they don't allow paging a large number of results, and 
+the max number of pages for Science Museum is 50 pages.This would mean we would get only 5000 objects and around 17000 images. 
 
 [API page-50](https://collection.sciencemuseumgroup.org.uk/search/image_license?page[size]=100&page[number]=50)
 
 [API page-51](https://collection.sciencemuseumgroup.org.uk/search/image_license?page[size]=100&page[number]=51)
 
-So we need to find the way to divide the collection into subsets such that each subset has less than or equal to 5000 objects.
+So we need to find a way to divide the collection into subsets such that each subset has less than or equal to 5000 objects.
 Luckily, the API had another set of parameters ```date[from]``` and ```date[to]``` which represents the time period of the object. 
 Querying the API through different time period at the same time ensuring that records in that time period don't exceed 5000 solves the problem, starting
 from year 0 to year 2020 by trial and error method suitable year range was chosen.
@@ -60,9 +74,12 @@ from year 0 to year 2020 by trial and error method suitable year range was chose
                                             ]
 ```
 
-With this we confirm the implementation method, but before writing the script we need to know what are the different license provided 
-by them and check if it matches with our licenses and version, since they haven't mentioned about their licenses and version, running a
-test script to get all the licenses and grouping them would do the work.
+
+With this we have a method to ingest the desired records, but before writing the script we need to know the different licenses 
+provided by the API.  We need to figure out a consistent way to identify which license and version are attached to each object.  
+To do this, we ran a test script to get counts of objects under different licenses. 
+
+The results are:
 
 ```
 +-----------------+----------+
@@ -78,13 +95,14 @@ test script to get all the licenses and grouping them would do the work.
 Since the licenses and their versions are confirmed, we can start the implementation.
 
 ## Implementation
-The implementation is quite simple in nature, we loop the through the ```YEAR_RANGE``` and get all the records for that period and 
+The implementation is quite simple in nature: we loop the through the ```YEAR_RANGE``` and get all the records for that period and 
 pass it on to an object data handler method that extracts the necessary details from the record and store it in the ```ImageStore```
-instance. Since we are querying with the objects period, the same object would be present in multiple year range ,to avoid repition
-we keep track of the record/object's id in a global variable ```RECORD_IDS = []```.
+instance. ImageStore is a class that stores image information from the provider, it stores the information in a buffer and inserts to tsv
+when the buffer reached threshold limit. Due to overlapping date ranges, the metadata for some objects is collected multiple times.
+So, we keep track of the record/object's id in a global variable ```RECORD_IDS = []```.
 
-Within the object data handler method before collecting details we check if the ```id``` already exists in ```RECORD_IDS```, if it exists 
-we move on to the next record.
+Within the object data handler method before collecting details we check if the ```id``` already exists in ```RECORD_IDS```. 
+If it exists we move on to the next record.
 
 ```
                                             for obj_ in batch_data:
@@ -95,10 +113,11 @@ we move on to the next record.
 ```
 
 ```id_``` is the object id and we cannnot use this value as foreign identifier, the reason behind it is that an object could
-have multiple images with it and using object id we cannot determine the image uniquely, so we must use image id that unique
-for each image. Currently image id is taken from ```multimedia```, for each image data in multimedia foreign id is in ```admin.uid```.
+have multiple images with it and using object id we cannot determine the image uniquely, so we must use image id that is unique
+for each image. Currently image id is taken from ```multimedia```, multimedia is a field in the json response that lists multiple 
+images and their metadata, for each image data in multimedia, foreign id is in ```admin.uid```.
 
-Once the implementation script is ready (test suite is built along with), we create a workflow, that runs on a monthly basis.
+The implementation can be found [here](https://github.com/creativecommons/cccatalog/blob/master/src/cc_catalog_airflow/dags/provider_api_scripts/science_museum.py).
 
 ### Results:
 Running the scripts we get: