|
| 1 | +title: Data flow: from API to DB |
| 2 | +--- |
| 3 | +categories: |
| 4 | + |
| 5 | +cc-catalog |
| 6 | +airflow |
| 7 | +gsoc |
| 8 | +gsoc-2020 |
| 9 | +--- |
| 10 | +author: srinidhi |
| 11 | +--- |
| 12 | +series: gsoc-2020-cccatalog |
| 13 | +--- |
| 14 | +pub_date: 2020-07-13 |
| 15 | +--- |
| 16 | +body: |
| 17 | +## Introduction |
| 18 | +The CC Catalog project handles the flow of image metadata from the source or provider and loads it to the database, |
| 19 | +which is then surfaced to the [CC search](https://ccsearch.creativecommons.org/about) tool. The workflows are set up for |
| 20 | +each provider to gather metadata about CC licensed images. These workflows are handled with the help of Apache Airflow. |
| 21 | +Airflow is an open source tool that helps us to schedule and monitor workflows. |
| 22 | + |
| 23 | +## Airflow intro |
| 24 | +Apache Airflow is an open source tool that helps us to schedule tasks and monitor workflows . It provides an easy to use |
| 25 | +UI that makes managing tasks easy. In Airflow, the tasks we want to schedule are organised in DAGs (Directed Acyclic Graphs) . |
| 26 | +DAGs consist of a collection of tasks, and a relationship defined among these tasks, so that they run in an organised manner. |
| 27 | +DAGs files are standard python files that are loaded from the defined ```DAG_FOLDER``` on a host. |
| 28 | +Airflow selects all the python files in the ```DAG_FOLDER``` that have a DAG instance defined globally, and executes them to |
| 29 | +create the DAG objects. |
| 30 | + |
| 31 | +## CC Catalog Workflow |
| 32 | +In the CC catalog, Airflow is set up inside a docker container along with other services . The loader and provider workflows are |
| 33 | +inside the ```dags``` directory in the repo [insert link here]. Provider workflows are set up to pull metadata about CC licensed |
| 34 | +images from the respective providers , the data pulled is structured into a standardised format and written into a TSV |
| 35 | +(Tab Separated Values) file locally. These TSV files are then loaded into S3 and then finally to PostgreSQL DB by the loader |
| 36 | +workflow. |
| 37 | + |
| 38 | +## Provider API workflow |
| 39 | +The provider workflows are usually scheduled in one of two time frequencies, daily or monthly. |
| 40 | + |
| 41 | +Providers such as Flickr or Wikimedia Commons that are filtered using the date parameter are usually scheduled for daily jobs. |
| 42 | +These providers have a large volume of continuously changing data, and so daily updates are required to keep the data in sync. |
| 43 | + |
| 44 | +Providers that are scheduled for monthly ingestion are ones with a relativley low volume of data, or for which filtering by date |
| 45 | +is not possible. This means we need to ingest the entire collection at once. Examples are museum providers like the |
| 46 | +[Science museum UK](https://collection.sciencemuseumgroup.org.uk/) or [Statens Museum for Kunst](https://www.smk.dk/). |
| 47 | +We don’t expect museum providers to change data on a daily basis. |
| 48 | + |
| 49 | +The scheduling of the DAGs by the scheduler daemons depends on a few parameters. |
| 50 | +- ```start_date``` - it denotes the starting date from which the task should begin running. |
| 51 | +- ```schedule_interval``` - it denotes the interval between subsequent runs, it can be |
| 52 | +specified with airflow keyword strings like “@daily”, “@weekly”, “@monthly”, “@yearly” |
| 53 | +other than these we can also schedule the interval using cron expression. |
| 54 | + |
| 55 | + |
| 56 | +Example : Cleveland museum is currently scheduled for a monthly crawl with a starting date as ```2020-01-15``` .[cleveland_museum_workflow](https://github.com/creativecommons/cccatalog/blob/master/src/cc_catalog_airflow/dags/cleveland_museum_workflow.py) |
| 57 | + |
| 58 | +## Loader workflow |
| 59 | +The data from the provider scripts are not directly loaded into S3. Instead, they are stored in a TSV file on the local disk, and |
| 60 | +the tsv_postgres workflow handles loading of data to S3, and eventually PostgreSQL. The DAG starts by calling the task to stage |
| 61 | +the oldest tsv file from the output directory of the provider scripts to the staging directory. Next, two tasks run in parallel, |
| 62 | +one loads the tsv file in the staging directory to S3 , while the other creates the loading table in the PostgreSQL database. |
| 63 | +Once the data is loaded to S3 and the loading table has been created, the data from S3 is loaded to the intermediate loading table |
| 64 | +and then finally inserted into the image table. If loading from S3 fails the data is loaded to PostgreSQL from the locally stored |
| 65 | +tsv file. When the data has been successfully transferred to the image table, the intermediate loading table is dropped and the |
| 66 | +tsv files in the staging directory are deleted. If the copying the tsv files to S3 fails or then those files are moved to the |
| 67 | +failure directory for future inspection. |
| 68 | + |
| 69 | +<div style="text-align:center;"> |
| 70 | + <img src="loader_workflow.png" width="1000px"/ > |
| 71 | + <p> Loader workflow </p> |
| 72 | +</div> |
0 commit comments