|
| 1 | +title: Science Museum provider implementation |
| 2 | +--- |
| 3 | +categories: |
| 4 | + |
| 5 | +cc-catalog |
| 6 | +gsoc |
| 7 | +gsoc-2020 |
| 8 | +--- |
| 9 | +author: srinidhi |
| 10 | +--- |
| 11 | +series: gsoc-2020-cccatalog |
| 12 | +--- |
| 13 | +pub_date: 2020-06-10 |
| 14 | +--- |
| 15 | +body: |
| 16 | +## Introduction |
| 17 | +CC catalog project is responsible for collecting CC licensed images available in the web, CC licensed images are hosted by different |
| 18 | +sources, these sources that provide the images and its metadata are called providers. Currently, images are collected from providers using two methods |
| 19 | +Common Crawl and API based crawl. Common Crawl data is an open repository of web crawled data and we use that data to get the necessary image metadata |
| 20 | +for that provider [more information](https://commoncrawl.org/the-data/get-started/). API crawl is implemented using the API endpoint maintained |
| 21 | +by the providers. The main problem with Common Crawl is that we don't have control over the data they crawl, and this sometimes results poor |
| 22 | +data quality whereas with API based crawl we have access to the information available. API based crawl is better when we need to update image |
| 23 | +metadata and reqular intervals. |
| 24 | + |
| 25 | +As a part of the internship, I will be working on moving providers from Common Crawl to API based crawl as well as integrate new providers |
| 26 | +to the API crawl. I will be starting with the Science Museum provider. |
| 27 | + |
| 28 | +## Science Museum |
| 29 | +Science museum is a provider with around 80,000 CC licensed images, currently Science museum data is ingested from Common Crawl. |
| 30 | +Science museum is one such provider where our data is of poor quality and there is need to improve it. This is done by moving |
| 31 | +Science museum to an API based crawl. |
| 32 | + |
| 33 | +## API research |
| 34 | +We want to index metadata using their open API [endpoint](https://collection.sciencemuseumgroup.org.uk/search/has_image/image_license). |
| 35 | +However, before the implementation we have to ensure that the API provides necessary content and there is a systematic way to get it. |
| 36 | +The first step is to take an object from their collection and check certain criterias. |
| 37 | + |
| 38 | +[sample object](https://collection.sciencemuseumgroup.org.uk/api/objects/co8005638) |
| 39 | + |
| 40 | +The criteria are: |
| 41 | +- parameters available for the API |
| 42 | +- Object landing url (frontend link of the object the image is associated with) |
| 43 | +- Image url (the url link of the image) |
| 44 | +- CC license associated with the image |
| 45 | +- creator, title and other metadata info |
| 46 | + |
| 47 | +Once the above checks have been made, we need to find a way to get all the objects, this could be by paging through the records |
| 48 | +or partition using the parameters, etc. Since their API parameter has ```page[number]``` paging would be an appropriate choice with max size |
| 49 | +as 100 it would require around 800 pages to get all the objects but then since they don't allow paging a large number of results, and |
| 50 | +the max number of pages for Science Museum is 50 pages.This would mean we would get only 5000 objects and around 17000 images. |
| 51 | + |
| 52 | +[API page-50](https://collection.sciencemuseumgroup.org.uk/search/image_license?page[size]=100&page[number]=50) |
| 53 | + |
| 54 | +[API page-51](https://collection.sciencemuseumgroup.org.uk/search/image_license?page[size]=100&page[number]=51) |
| 55 | + |
| 56 | +So we need to find a way to divide the collection into subsets such that each subset has less than or equal to 5000 objects. |
| 57 | +Luckily, the API had another set of parameters ```date[from]``` and ```date[to]``` which represents the time period of the object. |
| 58 | +Querying the API through different time period at the same time ensuring that records in that time period don't exceed 5000 solves the problem, starting |
| 59 | +from year 0 to year 2020 by trial and error method suitable year range was chosen. |
| 60 | + |
| 61 | +``` |
| 62 | + YEAR_RANGE = [ |
| 63 | + (0, 1500), |
| 64 | + (1500, 1750), |
| 65 | + (1750, 1825), |
| 66 | + (1825, 1850), |
| 67 | + (1850, 1875), |
| 68 | + (1875, 1900), |
| 69 | + (1900, 1915), |
| 70 | + (1915, 1940), |
| 71 | + (1940, 1965), |
| 72 | + (1965, 1990), |
| 73 | + (1990, 2020) |
| 74 | + ] |
| 75 | +``` |
| 76 | + |
| 77 | + |
| 78 | +With this we have a method to ingest the desired records, but before writing the script we need to know the different licenses |
| 79 | +provided by the API. We need to figure out a consistent way to identify which license and version are attached to each object. |
| 80 | +To do this, we ran a test script to get counts of objects under different licenses. |
| 81 | + |
| 82 | +The results are: |
| 83 | + |
| 84 | +``` |
| 85 | ++-----------------+----------+ |
| 86 | +| license_version | count(1) | |
| 87 | ++-----------------+----------+ |
| 88 | +| CC-BY-NC-ND 2.0 | 210 | |
| 89 | +| CC-BY-NC-ND 4.0 | 2376 | |
| 90 | +| CC-BY-NC-SA 2.0 | 1 | |
| 91 | +| CC-BY-NC-SA 4.0 | 61694 | |
| 92 | ++-----------------+----------+ |
| 93 | +``` |
| 94 | + |
| 95 | +Since the licenses and their versions are confirmed, we can start the implementation. |
| 96 | + |
| 97 | +## Implementation |
| 98 | +The implementation is quite simple in nature: we loop the through the ```YEAR_RANGE``` and get all the records for that period and |
| 99 | +pass it on to an object data handler method that extracts the necessary details from the record and store it in the ```ImageStore``` |
| 100 | +instance. ImageStore is a class that stores image information from the provider, it stores the information in a buffer and inserts to tsv |
| 101 | +when the buffer reached threshold limit. Due to overlapping date ranges, the metadata for some objects is collected multiple times. |
| 102 | +So, we keep track of the record/object's id in a global variable ```RECORD_IDS = []```. |
| 103 | + |
| 104 | +Within the object data handler method before collecting details we check if the ```id``` already exists in ```RECORD_IDS```. |
| 105 | +If it exists we move on to the next record. |
| 106 | + |
| 107 | +``` |
| 108 | + for obj_ in batch_data: |
| 109 | + id_ = obj_.get("id") |
| 110 | + if id_ in RECORD_IDS: |
| 111 | + continue |
| 112 | + RECORD_IDS.append(id_) |
| 113 | +``` |
| 114 | + |
| 115 | +```id_``` is the object id and we cannnot use this value as foreign identifier, the reason behind it is that an object could |
| 116 | +have multiple images with it and using object id we cannot determine the image uniquely, so we must use image id that is unique |
| 117 | +for each image. Currently image id is taken from ```multimedia```, multimedia is a field in the json response that lists multiple |
| 118 | +images and their metadata, for each image data in multimedia, foreign id is in ```admin.uid```. |
| 119 | + |
| 120 | +The implementation can be found [here](https://github.com/creativecommons/cccatalog/blob/master/src/cc_catalog_airflow/dags/provider_api_scripts/science_museum.py). |
| 121 | + |
| 122 | +### Results: |
| 123 | +Running the scripts we get: |
| 124 | +- Number of records recieved : ```35584``` |
| 125 | +- Number of images collected : ``` 62497``` |
| 126 | + |
| 127 | +The problem with current implementation is that records with no date would be missed. |
| 128 | + |
| 129 | +Science Museum provider is the first provider I worked on as a part of the internship and thank my mentor Brent Moran for the help. |
| 130 | + |
| 131 | +### Additional Details : |
| 132 | +- [research work](https://github.com/creativecommons/cccatalog/issues/302) |
| 133 | +- [implementation](https://github.com/creativecommons/cccatalog/pull/400) |
| 134 | + |
0 commit comments