@@ -2,7 +2,7 @@ title: Data flow: from API to DB
2
2
---
3
3
categories:
4
4
5
- cc-catalog
5
+ cc-catalog
6
6
airflow
7
7
gsoc
8
8
gsoc-2020
@@ -11,62 +11,91 @@ author: srinidhi
11
11
---
12
12
series: gsoc-2020-cccatalog
13
13
---
14
- pub_date: 2020-07-13
15
- ---
14
+ pub_date: 2020-07-15
15
+ ---
16
16
body:
17
- ## Introduction
18
- The CC Catalog project handles the flow of image metadata from the source or provider and loads it to the database,
19
- which is then surfaced to the [ CC search] ( https://ccsearch.creativecommons.org/about ) tool. The workflows are set up for
20
- each provider to gather metadata about CC licensed images. These workflows are handled with the help of Apache Airflow.
21
- Airflow is an open source tool that helps us to schedule and monitor workflows.
17
+
18
+ ## Introduction The CC
19
+ Catalog project handles the flow of image metadata from the source or provider
20
+ and loads it to the database, which is then surfaced to the [ CC
21
+ search] [ CC_search ] tool. The workflows are set up for each provider to gather
22
+ metadata about CC licensed images. These workflows are handled with the help of
23
+ Apache Airflow. Airflow is an open source tool that helps us to schedule and
24
+ monitor workflows.
25
+ [ CC_search ] : https://ccsearch.creativecommons.org/about
22
26
23
27
## Airflow intro
24
- Apache Airflow is an open source tool that helps us to schedule tasks and monitor workflows . It provides an easy to use
25
- UI that makes managing tasks easy. In Airflow, the tasks we want to schedule are organised in DAGs (Directed Acyclic Graphs) .
26
- DAGs consist of a collection of tasks, and a relationship defined among these tasks, so that they run in an organised manner.
27
- DAGs files are standard python files that are loaded from the defined ``` DAG_FOLDER ``` on a host.
28
- Airflow selects all the python files in the ``` DAG_FOLDER ``` that have a DAG instance defined globally, and executes them to
28
+ Apache Airflow is an open source tool that helps us to schedule tasks and
29
+ monitor workflows . It provides an easy to use UI that makes managing tasks
30
+ easy. In Airflow, the tasks we want to schedule are organised in DAGs
31
+ (Directed Acyclic Graphs). DAGs consist of a collection of tasks, and a
32
+ relationship defined among these tasks, so that they run in an organised
33
+ manner. DAGs files are standard python files that are loaded from the defined
34
+ ` DAG_FOLDER ` on a host. Airflow selects all the python files in the
35
+ ` DAG_FOLDER ` that have a DAG instance defined globally, and executes them to
29
36
create the DAG objects.
30
37
31
38
## CC Catalog Workflow
32
- In the CC catalog, Airflow is set up inside a docker container along with other services . The loader and provider workflows are
33
- inside the ``` dags ``` directory in the repo [ insert link here] . Provider workflows are set up to pull metadata about CC licensed
34
- images from the respective providers , the data pulled is structured into a standardised format and written into a TSV
35
- (Tab Separated Values) file locally. These TSV files are then loaded into S3 and then finally to PostgreSQL DB by the loader
36
- workflow.
39
+ In the CC catalog, Airflow is set up inside a docker container along with other
40
+ services . The loader and provider workflows are inside the ` dags ` directory in
41
+ the repo [ dag folder] [ dags ] . Provider workflows are set up to pull metadata
42
+ about CC licensed images from the respective providers , the data pulled is
43
+ structured into a standardised format and written into a TSV (Tab Separated
44
+ Values) file locally. These TSV files are then loaded into S3 and then finally
45
+ to PostgreSQL DB by the loader workflow.
46
+ [ dags ] : https://github.com/creativecommons/cccatalog/tree/master/src/cc_catalog_airflow/dags
37
47
38
48
## Provider API workflow
39
- The provider workflows are usually scheduled in one of two time frequencies, daily or monthly.
49
+ The provider workflows are usually scheduled in one of two time frequencies,
50
+ daily or monthly.
51
+
52
+ Providers such as Flickr or Wikimedia Commons that are filtered using the date
53
+ parameter are usually scheduled for daily jobs. These providers have a large
54
+ volume of continuously changing data, and so daily updates are required to keep
55
+ the data in sync.
56
+
57
+ Providers that are scheduled for monthly ingestion are ones with a relativley
58
+ low volume of data, or for which filtering by date is not possible. This means
59
+ we need to ingest the entire collection at once. Examples are museum providers
60
+ like the [ Science museum UK] [ science_museum ] or [ Statens Museum for
61
+ Kunst] [ smk ] . We don’t expect museum providers to change data on a daily basis.
62
+
63
+ [ science_museum ] : https://collection.sciencemuseumgroup.org.uk/
64
+ [ smk ] : https://www.smk.dk/
40
65
41
- Providers such as Flickr or Wikimedia Commons that are filtered using the date parameter are usually scheduled for daily jobs.
42
- These providers have a large volume of continuously changing data, and so daily updates are required to keep the data in sync.
66
+ The scheduling of the DAGs by the scheduler daemons depends on a few
67
+ parameters.
43
68
44
- Providers that are scheduled for monthly ingestion are ones with a relativley low volume of data, or for which filtering by date
45
- is not possible. This means we need to ingest the entire collection at once. Examples are museum providers like the
46
- [ Science museum UK] ( https://collection.sciencemuseumgroup.org.uk/ ) or [ Statens Museum for Kunst] ( https://www.smk.dk/ ) .
47
- We don’t expect museum providers to change data on a daily basis.
69
+ - ``` start_date ``` - it denotes the starting date from which the
70
+ task should begin running.
71
+ - ``` schedule_interval ``` - it denotes the interval between subsequent runs, it
72
+ can be specified with airflow keyword strings like “@daily ”, “@weekly ”,
73
+ “@monthly ”, “@yearly ” other than these we can also schedule the interval using
74
+ cron expression.
48
75
49
- The scheduling of the DAGs by the scheduler daemons depends on a few parameters.
50
- - ``` start_date ``` - it denotes the starting date from which the task should begin running.
51
- - ``` schedule_interval ``` - it denotes the interval between subsequent runs, it can be
52
- specified with airflow keyword strings like “@daily ”, “@weekly ”, “@monthly ”, “@yearly ”
53
- other than these we can also schedule the interval using cron expression.
54
76
77
+ Example: Cleveland museum is currently scheduled for a monthly crawl with a
78
+ starting date as ``` 2020-01-15 ``` . [ cleveland_museum_workflow] [ clm_workflow ]
55
79
56
- Example : Cleveland museum is currently scheduled for a monthly crawl with a starting date as ``` 2020-01-15 ``` . [ cleveland_museum_workflow ] ( https://github.com/creativecommons/cccatalog/blob/master/src/cc_catalog_airflow/dags/cleveland_museum_workflow.py )
80
+ [ clm_workflow ] : https://github.com/creativecommons/cccatalog/blob/master/src/cc_catalog_airflow/dags/cleveland_museum_workflow.py
57
81
58
82
## Loader workflow
59
- The data from the provider scripts are not directly loaded into S3. Instead, they are stored in a TSV file on the local disk, and
60
- the tsv_postgres workflow handles loading of data to S3, and eventually PostgreSQL. The DAG starts by calling the task to stage
61
- the oldest tsv file from the output directory of the provider scripts to the staging directory. Next, two tasks run in parallel,
62
- one loads the tsv file in the staging directory to S3 , while the other creates the loading table in the PostgreSQL database.
63
- Once the data is loaded to S3 and the loading table has been created, the data from S3 is loaded to the intermediate loading table
64
- and then finally inserted into the image table. If loading from S3 fails the data is loaded to PostgreSQL from the locally stored
65
- tsv file. When the data has been successfully transferred to the image table, the intermediate loading table is dropped and the
66
- tsv files in the staging directory are deleted. If the copying the tsv files to S3 fails or then those files are moved to the
67
- failure directory for future inspection.
83
+ The data from the provider scripts are not directly loaded into S3. Instead,
84
+ they are stored in a TSV file on the local disk, and the tsv_postgres workflow
85
+ handles loading of data to S3, and eventually PostgreSQL. The DAG starts by
86
+ calling the task to stage the oldest tsv file from the output directory of the
87
+ provider scripts to the staging directory. Next, two tasks run in parallel, one
88
+ loads the tsv file in the staging directory to S3 , while the other creates the
89
+ loading table in the PostgreSQL database. Once the data is loaded to S3 and the
90
+ loading table has been created, the data from S3 is loaded to the intermediate
91
+ loading table and then finally inserted into the image table. If loading from
92
+ S3 fails the data is loaded to PostgreSQL from the locally stored tsv file.
93
+ When the data has been successfully transferred to the image table, the
94
+ intermediate loading table is dropped and the tsv files in the staging
95
+ directory are deleted. If the copying the tsv files to S3 fails or then those
96
+ files are moved to the failure directory for future inspection.
68
97
69
98
<div style =" text-align :center ;" >
70
- <img src="loader_workflow.png" width="1000px"/ >
99
+ <img src="loader_workflow.png" width="1000px"/>
71
100
<p> Loader workflow </p>
72
101
</div >
0 commit comments