@@ -2,7 +2,7 @@ title: Data flow: from API to DB
22---
33categories:
44
5- cc-catalog
5+ cc-catalog
66airflow
77gsoc
88gsoc-2020
@@ -11,62 +11,91 @@ author: srinidhi
1111---
1212series: gsoc-2020-cccatalog
1313---
14- pub_date: 2020-07-13
15- ---
14+ pub_date: 2020-07-15
15+ ---
1616body:
17- ## Introduction
18- The CC Catalog project handles the flow of image metadata from the source or provider and loads it to the database,
19- which is then surfaced to the [ CC search] ( https://ccsearch.creativecommons.org/about ) tool. The workflows are set up for
20- each provider to gather metadata about CC licensed images. These workflows are handled with the help of Apache Airflow.
21- Airflow is an open source tool that helps us to schedule and monitor workflows.
17+
18+ ## Introduction The CC
19+ Catalog project handles the flow of image metadata from the source or provider
20+ and loads it to the database, which is then surfaced to the [ CC
21+ search] [ CC_search ] tool. The workflows are set up for each provider to gather
22+ metadata about CC licensed images. These workflows are handled with the help of
23+ Apache Airflow. Airflow is an open source tool that helps us to schedule and
24+ monitor workflows.
25+ [ CC_search ] : https://ccsearch.creativecommons.org/about
2226
2327## Airflow intro
24- Apache Airflow is an open source tool that helps us to schedule tasks and monitor workflows . It provides an easy to use
25- UI that makes managing tasks easy. In Airflow, the tasks we want to schedule are organised in DAGs (Directed Acyclic Graphs) .
26- DAGs consist of a collection of tasks, and a relationship defined among these tasks, so that they run in an organised manner.
27- DAGs files are standard python files that are loaded from the defined ``` DAG_FOLDER ``` on a host.
28- Airflow selects all the python files in the ``` DAG_FOLDER ``` that have a DAG instance defined globally, and executes them to
28+ Apache Airflow is an open source tool that helps us to schedule tasks and
29+ monitor workflows . It provides an easy to use UI that makes managing tasks
30+ easy. In Airflow, the tasks we want to schedule are organised in DAGs
31+ (Directed Acyclic Graphs). DAGs consist of a collection of tasks, and a
32+ relationship defined among these tasks, so that they run in an organised
33+ manner. DAGs files are standard python files that are loaded from the defined
34+ ` DAG_FOLDER ` on a host. Airflow selects all the python files in the
35+ ` DAG_FOLDER ` that have a DAG instance defined globally, and executes them to
2936create the DAG objects.
3037
3138## CC Catalog Workflow
32- In the CC catalog, Airflow is set up inside a docker container along with other services . The loader and provider workflows are
33- inside the ``` dags ``` directory in the repo [ insert link here] . Provider workflows are set up to pull metadata about CC licensed
34- images from the respective providers , the data pulled is structured into a standardised format and written into a TSV
35- (Tab Separated Values) file locally. These TSV files are then loaded into S3 and then finally to PostgreSQL DB by the loader
36- workflow.
39+ In the CC catalog, Airflow is set up inside a docker container along with other
40+ services . The loader and provider workflows are inside the ` dags ` directory in
41+ the repo [ dag folder] [ dags ] . Provider workflows are set up to pull metadata
42+ about CC licensed images from the respective providers , the data pulled is
43+ structured into a standardised format and written into a TSV (Tab Separated
44+ Values) file locally. These TSV files are then loaded into S3 and then finally
45+ to PostgreSQL DB by the loader workflow.
46+ [ dags ] : https://github.com/creativecommons/cccatalog/tree/master/src/cc_catalog_airflow/dags
3747
3848## Provider API workflow
39- The provider workflows are usually scheduled in one of two time frequencies, daily or monthly.
49+ The provider workflows are usually scheduled in one of two time frequencies,
50+ daily or monthly.
51+
52+ Providers such as Flickr or Wikimedia Commons that are filtered using the date
53+ parameter are usually scheduled for daily jobs. These providers have a large
54+ volume of continuously changing data, and so daily updates are required to keep
55+ the data in sync.
56+
57+ Providers that are scheduled for monthly ingestion are ones with a relativley
58+ low volume of data, or for which filtering by date is not possible. This means
59+ we need to ingest the entire collection at once. Examples are museum providers
60+ like the [ Science museum UK] [ science_museum ] or [ Statens Museum for
61+ Kunst] [ smk ] . We don’t expect museum providers to change data on a daily basis.
62+
63+ [ science_museum ] : https://collection.sciencemuseumgroup.org.uk/
64+ [ smk ] : https://www.smk.dk/
4065
41- Providers such as Flickr or Wikimedia Commons that are filtered using the date parameter are usually scheduled for daily jobs.
42- These providers have a large volume of continuously changing data, and so daily updates are required to keep the data in sync.
66+ The scheduling of the DAGs by the scheduler daemons depends on a few
67+ parameters.
4368
44- Providers that are scheduled for monthly ingestion are ones with a relativley low volume of data, or for which filtering by date
45- is not possible. This means we need to ingest the entire collection at once. Examples are museum providers like the
46- [ Science museum UK] ( https://collection.sciencemuseumgroup.org.uk/ ) or [ Statens Museum for Kunst] ( https://www.smk.dk/ ) .
47- We don’t expect museum providers to change data on a daily basis.
69+ - ``` start_date ``` - it denotes the starting date from which the
70+ task should begin running.
71+ - ``` schedule_interval ``` - it denotes the interval between subsequent runs, it
72+ can be specified with airflow keyword strings like “@daily ”, “@weekly ”,
73+ “@monthly ”, “@yearly ” other than these we can also schedule the interval using
74+ cron expression.
4875
49- The scheduling of the DAGs by the scheduler daemons depends on a few parameters.
50- - ``` start_date ``` - it denotes the starting date from which the task should begin running.
51- - ``` schedule_interval ``` - it denotes the interval between subsequent runs, it can be
52- specified with airflow keyword strings like “@daily ”, “@weekly ”, “@monthly ”, “@yearly ”
53- other than these we can also schedule the interval using cron expression.
5476
77+ Example: Cleveland museum is currently scheduled for a monthly crawl with a
78+ starting date as ``` 2020-01-15 ``` . [ cleveland_museum_workflow] [ clm_workflow ]
5579
56- Example : Cleveland museum is currently scheduled for a monthly crawl with a starting date as ``` 2020-01-15 ``` . [ cleveland_museum_workflow ] ( https://github.com/creativecommons/cccatalog/blob/master/src/cc_catalog_airflow/dags/cleveland_museum_workflow.py )
80+ [ clm_workflow ] : https://github.com/creativecommons/cccatalog/blob/master/src/cc_catalog_airflow/dags/cleveland_museum_workflow.py
5781
5882## Loader workflow
59- The data from the provider scripts are not directly loaded into S3. Instead, they are stored in a TSV file on the local disk, and
60- the tsv_postgres workflow handles loading of data to S3, and eventually PostgreSQL. The DAG starts by calling the task to stage
61- the oldest tsv file from the output directory of the provider scripts to the staging directory. Next, two tasks run in parallel,
62- one loads the tsv file in the staging directory to S3 , while the other creates the loading table in the PostgreSQL database.
63- Once the data is loaded to S3 and the loading table has been created, the data from S3 is loaded to the intermediate loading table
64- and then finally inserted into the image table. If loading from S3 fails the data is loaded to PostgreSQL from the locally stored
65- tsv file. When the data has been successfully transferred to the image table, the intermediate loading table is dropped and the
66- tsv files in the staging directory are deleted. If the copying the tsv files to S3 fails or then those files are moved to the
67- failure directory for future inspection.
83+ The data from the provider scripts are not directly loaded into S3. Instead,
84+ they are stored in a TSV file on the local disk, and the tsv_postgres workflow
85+ handles loading of data to S3, and eventually PostgreSQL. The DAG starts by
86+ calling the task to stage the oldest tsv file from the output directory of the
87+ provider scripts to the staging directory. Next, two tasks run in parallel, one
88+ loads the tsv file in the staging directory to S3 , while the other creates the
89+ loading table in the PostgreSQL database. Once the data is loaded to S3 and the
90+ loading table has been created, the data from S3 is loaded to the intermediate
91+ loading table and then finally inserted into the image table. If loading from
92+ S3 fails the data is loaded to PostgreSQL from the locally stored tsv file.
93+ When the data has been successfully transferred to the image table, the
94+ intermediate loading table is dropped and the tsv files in the staging
95+ directory are deleted. If the copying the tsv files to S3 fails or then those
96+ files are moved to the failure directory for future inspection.
6897
6998<div style =" text-align :center ;" >
70- <img src="loader_workflow.png" width="1000px"/ >
99+ <img src="loader_workflow.png" width="1000px"/>
71100 <p> Loader workflow </p>
72101</div >
0 commit comments