|
| 1 | +title: Data flow: from API to DB |
| 2 | +--- |
| 3 | +categories: |
| 4 | + |
| 5 | +cc-catalog |
| 6 | +airflow |
| 7 | +gsoc |
| 8 | +gsoc-2020 |
| 9 | +--- |
| 10 | +author: srinidhi |
| 11 | +--- |
| 12 | +series: gsoc-2020-cccatalog |
| 13 | +--- |
| 14 | +pub_date: 2020-07-22 |
| 15 | +--- |
| 16 | +body: |
| 17 | + |
| 18 | +## Introduction The CC |
| 19 | +Catalog project handles the flow of image metadata from the source or provider |
| 20 | +and loads it to the database, which is then surfaced to the [CC |
| 21 | +search][CC_search] tool. The workflows are set up for each provider to gather |
| 22 | +metadata about CC licensed images. These workflows are handled with the help of |
| 23 | +Apache Airflow. Airflow is an open source tool that helps us to schedule and |
| 24 | +monitor workflows. |
| 25 | +[CC_search]: https://ccsearch.creativecommons.org/about |
| 26 | + |
| 27 | +## Airflow intro |
| 28 | +Apache Airflow is an open source tool that helps us to schedule tasks and |
| 29 | +monitor workflows . It provides an easy to use UI that makes managing tasks |
| 30 | +easy. In Airflow, the tasks we want to schedule are organised in DAGs |
| 31 | +(Directed Acyclic Graphs). DAGs consist of a collection of tasks, and a |
| 32 | +relationship defined among these tasks, so that they run in an organised |
| 33 | +manner. DAGs files are standard python files that are loaded from the defined |
| 34 | +`DAG_FOLDER` on a host. Airflow selects all the python files in the |
| 35 | +`DAG_FOLDER` that have a DAG instance defined globally, and executes them to |
| 36 | +create the DAG objects. |
| 37 | + |
| 38 | +## CC Catalog Workflow |
| 39 | +In the CC catalog, Airflow is set up inside a docker container along with other |
| 40 | +services . The loader and provider workflows are inside the `dags` directory in |
| 41 | +the repo [dag folder][dags]. Provider workflows are set up to pull metadata |
| 42 | +about CC licensed images from the respective providers , the data pulled is |
| 43 | +structured into a standardised format and written into a TSV (Tab Separated |
| 44 | +Values) file locally. These TSV files are then loaded into S3 and then finally |
| 45 | +to PostgreSQL DB by the loader workflow. |
| 46 | +[dags]: https://github.com/creativecommons/cccatalog/tree/dacb48d24c6ae9b532ff108589b9326bde0d37a3/src/cc_catalog_airflow/dags |
| 47 | + |
| 48 | +## Provider API workflow |
| 49 | +The provider workflows are usually scheduled in one of two time frequencies, |
| 50 | +daily or monthly. |
| 51 | + |
| 52 | +Providers such as Flickr or Wikimedia Commons that are filtered using the date |
| 53 | +parameter are usually scheduled for daily jobs. These providers have a large |
| 54 | +volume of continuously changing data, and so daily updates are required to keep |
| 55 | +the data in sync. |
| 56 | + |
| 57 | +Providers that are scheduled for monthly ingestion are ones with a relativley |
| 58 | +low volume of data, or for which filtering by date is not possible. This means |
| 59 | +we need to ingest the entire collection at once. Examples are museum providers |
| 60 | +like the [Science museum UK][science_museum] or [Statens Museum for |
| 61 | +Kunst][smk]. We don’t expect museum providers to change data on a daily basis. |
| 62 | + |
| 63 | +[science_museum]: https://collection.sciencemuseumgroup.org.uk/ |
| 64 | +[smk]: https://www.smk.dk/ |
| 65 | + |
| 66 | +The scheduling of the DAGs by the scheduler daemons depends on a few |
| 67 | +parameters. |
| 68 | + |
| 69 | +- ```start_date``` - it denotes the starting date from which the |
| 70 | +task should begin running. |
| 71 | +- ```schedule_interval``` - it denotes the interval between subsequent runs, it |
| 72 | +can be specified with airflow keyword strings like “@daily”, “@weekly”, |
| 73 | +“@monthly”, “@yearly” other than these we can also schedule the interval using |
| 74 | +cron expression. |
| 75 | + |
| 76 | + |
| 77 | +Example: Cleveland museum is currently scheduled for a monthly crawl with a |
| 78 | +starting date as ```2020-01-15```. [cleveland_museum_workflow][clm_workflow] |
| 79 | + |
| 80 | +[clm_workflow]: https://github.com/creativecommons/cccatalog/blob/dacb48d24c6ae9b532ff108589b9326bde0d37a3/src/cc_catalog_airflow/dags/cleveland_museum_workflow.py |
| 81 | + |
| 82 | +## Loader workflow |
| 83 | +The data from the provider scripts are not directly loaded into S3. Instead, |
| 84 | +they are stored in a TSV file on the local disk, and the tsv_postgres workflow |
| 85 | +handles loading of data to S3, and eventually PostgreSQL. The DAG starts by |
| 86 | +calling the task to stage the oldest tsv file from the output directory of the |
| 87 | +provider scripts to the staging directory. Next, two tasks run in parallel, one |
| 88 | +loads the tsv file in the staging directory to S3 , while the other creates the |
| 89 | +loading table in the PostgreSQL database. Once the data is loaded to S3 and the |
| 90 | +loading table has been created, the data from S3 is loaded to the intermediate |
| 91 | +loading table and then finally inserted into the image table. If loading from |
| 92 | +S3 fails the data is loaded to PostgreSQL from the locally stored tsv file. |
| 93 | +When the data has been successfully transferred to the image table, the |
| 94 | +intermediate loading table is dropped and the tsv files in the staging |
| 95 | +directory are deleted. If the copying the tsv files to S3 fails or then those |
| 96 | +files are moved to the failure directory for future inspection. |
| 97 | + |
| 98 | +<div style="text-align:center;"> |
| 99 | + <img src="loader_workflow.png" width="1000px"/> |
| 100 | + <p> Loader workflow </p> |
| 101 | +</div> |
| 102 | + |
| 103 | +## Acknowledgement |
| 104 | + |
| 105 | +I would like to thank Brent Moran for helping me write this blog post. |
0 commit comments