Merge pull request creativecommons#285 from kss682/blog-post-1

mathemancer · web-flow · commit 9040a6daf846 · 2020-06-10T11:34:45.000+02:00
Science Museum implementation blog post
diff --git a/content/blog/authors/srinidhi/contents.lr b/content/blog/authors/srinidhi/contents.lr
@@ -0,0 +1,10 @@
+username: kss682
+---
+name: K S Srinidhi Krishna
+---
+md5_hashed_email: 7E71C293A442A2CF434BFC244BD5F184
+---
+about:
+Srinidhi Krishna is a computer science undergraduate student from India and will be interning with Creative Commons during the summer.
+He is working on [cccatalog](https://github.com/creativecommons/cccatalog) as a part of GSoC20.
+He is `@K S Srinidhi Krishna` on slack.
diff --git a/content/blog/entries/science-museum-implementation/contents.lr b/content/blog/entries/science-museum-implementation/contents.lr
@@ -0,0 +1,134 @@
+title: Science Museum provider implementation
+---
+categories: 
+
+cc-catalog 
+gsoc
+gsoc-2020
+---
+author: srinidhi
+---
+series: gsoc-2020-cccatalog
+---
+pub_date: 2020-06-10
+---
+body:
+## Introduction 
+CC catalog project is responsible for collecting CC licensed images available in the web,  CC licensed images are hosted by different
+sources, these sources that provide the images and its metadata are called providers. Currently, images are collected from providers using two methods
+Common Crawl and API based crawl. Common Crawl data is an open repository of web crawled data and we use that data to get the necessary image metadata 
+for that provider [more information](https://commoncrawl.org/the-data/get-started/). API crawl is implemented using the API endpoint maintained 
+by the providers. The main problem with Common Crawl is that we don't have control over the data they crawl, and this sometimes results poor 
+data quality whereas with API based crawl we have access to the information available. API based crawl is better when we need to update image
+metadata and reqular intervals. 
+
+As a part of the internship, I will be working on moving providers from Common Crawl to API based crawl as well as integrate new providers
+to the API crawl. I will be starting with the Science Museum provider.
+
+## Science Museum
+Science museum is a provider with around 80,000 CC licensed images, currently Science museum data is ingested from Common Crawl. 
+Science museum is one such provider where our data is of poor quality and there is need to improve it. This is done by moving 
+Science museum to an API based crawl. 
+
+## API research
+We want to index metadata using their open API [endpoint](https://collection.sciencemuseumgroup.org.uk/search/has_image/image_license). 
+However, before the implementation we have to ensure that the API provides necessary content and there is a systematic way to get it.
+The first step is to take an object from their collection and check certain criterias.
+
+[sample object](https://collection.sciencemuseumgroup.org.uk/api/objects/co8005638)
+
+The criteria are:
+- parameters available for the API
+- Object landing url (frontend link of the object the image is associated with) 
+- Image url (the url link of the image)
+- CC license associated with the image
+- creator, title and other metadata info 
+
+Once the above checks have been made, we need to find a way to get all the objects, this could be by paging through the records 
+or partition using the parameters, etc. Since their API parameter has ```page[number]``` paging would be an appropriate choice with  max size 
+as 100 it would require around 800 pages to get all the objects but then since they don't allow paging a large number of results, and 
+the max number of pages for Science Museum is 50 pages.This would mean we would get only 5000 objects and around 17000 images. 
+
+[API page-50](https://collection.sciencemuseumgroup.org.uk/search/image_license?page[size]=100&page[number]=50)
+
+[API page-51](https://collection.sciencemuseumgroup.org.uk/search/image_license?page[size]=100&page[number]=51)
+
+So we need to find a way to divide the collection into subsets such that each subset has less than or equal to 5000 objects.
+Luckily, the API had another set of parameters ```date[from]``` and ```date[to]``` which represents the time period of the object. 
+Querying the API through different time period at the same time ensuring that records in that time period don't exceed 5000 solves the problem, starting
+from year 0 to year 2020 by trial and error method suitable year range was chosen.
+
+```
+                                            YEAR_RANGE = [
+                                                (0, 1500),
+                                                (1500, 1750),
+                                                (1750, 1825),
+                                                (1825, 1850),
+                                                (1850, 1875),
+                                                (1875, 1900),
+                                                (1900, 1915),
+                                                (1915, 1940),
+                                                (1940, 1965),
+                                                (1965, 1990),
+                                                (1990, 2020)
+                                            ]
+```
+
+
+With this we have a method to ingest the desired records, but before writing the script we need to know the different licenses 
+provided by the API.  We need to figure out a consistent way to identify which license and version are attached to each object.  
+To do this, we ran a test script to get counts of objects under different licenses. 
+
+The results are:
+
+```
++-----------------+----------+
+| license_version | count(1) |
++-----------------+----------+
+| CC-BY-NC-ND 2.0 |      210 |
+| CC-BY-NC-ND 4.0 |     2376 |
+| CC-BY-NC-SA 2.0 |        1 |
+| CC-BY-NC-SA 4.0 |    61694 |
++-----------------+----------+
+```
+
+Since the licenses and their versions are confirmed, we can start the implementation.
+
+## Implementation
+The implementation is quite simple in nature: we loop the through the ```YEAR_RANGE``` and get all the records for that period and 
+pass it on to an object data handler method that extracts the necessary details from the record and store it in the ```ImageStore```
+instance. ImageStore is a class that stores image information from the provider, it stores the information in a buffer and inserts to tsv
+when the buffer reached threshold limit. Due to overlapping date ranges, the metadata for some objects is collected multiple times.
+So, we keep track of the record/object's id in a global variable ```RECORD_IDS = []```.
+
+Within the object data handler method before collecting details we check if the ```id``` already exists in ```RECORD_IDS```. 
+If it exists we move on to the next record.
+
+```
+                                            for obj_ in batch_data:
+                                                id_ = obj_.get("id")
+                                                if id_ in RECORD_IDS:
+                                                    continue
+                                                RECORD_IDS.append(id_)
+```
+
+```id_``` is the object id and we cannnot use this value as foreign identifier, the reason behind it is that an object could
+have multiple images with it and using object id we cannot determine the image uniquely, so we must use image id that is unique
+for each image. Currently image id is taken from ```multimedia```, multimedia is a field in the json response that lists multiple 
+images and their metadata, for each image data in multimedia, foreign id is in ```admin.uid```.
+
+The implementation can be found [here](https://github.com/creativecommons/cccatalog/blob/master/src/cc_catalog_airflow/dags/provider_api_scripts/science_museum.py).
+
+### Results:
+Running the scripts we get:
+- Number of records recieved : ```35584```
+- Number of images collected : ``` 62497```
+
+The problem with current implementation is that records with no date would be missed.
+
+Science Museum provider is the first provider I worked on as a part of the internship and thank my mentor Brent Moran for the help.
+
+### Additional Details : 
+- [research work](https://github.com/creativecommons/cccatalog/issues/302)
+- [implementation](https://github.com/creativecommons/cccatalog/pull/400)
+
diff --git a/content/blog/series/gsoc-2020-cccatalog/contents.lr b/content/blog/series/gsoc-2020-cccatalog/contents.lr
@@ -0,0 +1 @@
+name: GSoC 2020: CC catalog