Skip to content

Commit 58bb6cc

Browse files
committed
formating changes
1 parent 4a315fd commit 58bb6cc

File tree

1 file changed

+70
-41
lines changed

1 file changed

+70
-41
lines changed

content/blog/entries/data-flow-API-to-DB/contents.lr

+70-41
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@ title: Data flow: from API to DB
22
---
33
categories:
44

5-
cc-catalog
5+
cc-catalog
66
airflow
77
gsoc
88
gsoc-2020
@@ -11,62 +11,91 @@ author: srinidhi
1111
---
1212
series: gsoc-2020-cccatalog
1313
---
14-
pub_date: 2020-07-13
15-
---
14+
pub_date: 2020-07-15
15+
---
1616
body:
17-
## Introduction
18-
The CC Catalog project handles the flow of image metadata from the source or provider and loads it to the database,
19-
which is then surfaced to the [CC search](https://ccsearch.creativecommons.org/about) tool. The workflows are set up for
20-
each provider to gather metadata about CC licensed images. These workflows are handled with the help of Apache Airflow.
21-
Airflow is an open source tool that helps us to schedule and monitor workflows.
17+
18+
## Introduction The CC
19+
Catalog project handles the flow of image metadata from the source or provider
20+
and loads it to the database, which is then surfaced to the [CC
21+
search][CC_search] tool. The workflows are set up for each provider to gather
22+
metadata about CC licensed images. These workflows are handled with the help of
23+
Apache Airflow. Airflow is an open source tool that helps us to schedule and
24+
monitor workflows.
25+
[CC_search]: https://ccsearch.creativecommons.org/about
2226

2327
## Airflow intro
24-
Apache Airflow is an open source tool that helps us to schedule tasks and monitor workflows . It provides an easy to use
25-
UI that makes managing tasks easy. In Airflow, the tasks we want to schedule are organised in DAGs (Directed Acyclic Graphs) .
26-
DAGs consist of a collection of tasks, and a relationship defined among these tasks, so that they run in an organised manner.
27-
DAGs files are standard python files that are loaded from the defined ```DAG_FOLDER``` on a host.
28-
Airflow selects all the python files in the ```DAG_FOLDER``` that have a DAG instance defined globally, and executes them to
28+
Apache Airflow is an open source tool that helps us to schedule tasks and
29+
monitor workflows . It provides an easy to use UI that makes managing tasks
30+
easy. In Airflow, the tasks we want to schedule are organised in DAGs
31+
(Directed Acyclic Graphs). DAGs consist of a collection of tasks, and a
32+
relationship defined among these tasks, so that they run in an organised
33+
manner. DAGs files are standard python files that are loaded from the defined
34+
`DAG_FOLDER` on a host. Airflow selects all the python files in the
35+
`DAG_FOLDER` that have a DAG instance defined globally, and executes them to
2936
create the DAG objects.
3037

3138
## CC Catalog Workflow
32-
In the CC catalog, Airflow is set up inside a docker container along with other services . The loader and provider workflows are
33-
inside the ```dags``` directory in the repo [insert link here]. Provider workflows are set up to pull metadata about CC licensed
34-
images from the respective providers , the data pulled is structured into a standardised format and written into a TSV
35-
(Tab Separated Values) file locally. These TSV files are then loaded into S3 and then finally to PostgreSQL DB by the loader
36-
workflow.
39+
In the CC catalog, Airflow is set up inside a docker container along with other
40+
services . The loader and provider workflows are inside the `dags` directory in
41+
the repo [dag folder][dags]. Provider workflows are set up to pull metadata
42+
about CC licensed images from the respective providers , the data pulled is
43+
structured into a standardised format and written into a TSV (Tab Separated
44+
Values) file locally. These TSV files are then loaded into S3 and then finally
45+
to PostgreSQL DB by the loader workflow.
46+
[dags]: https://github.com/creativecommons/cccatalog/tree/master/src/cc_catalog_airflow/dags
3747

3848
## Provider API workflow
39-
The provider workflows are usually scheduled in one of two time frequencies, daily or monthly.
49+
The provider workflows are usually scheduled in one of two time frequencies,
50+
daily or monthly.
51+
52+
Providers such as Flickr or Wikimedia Commons that are filtered using the date
53+
parameter are usually scheduled for daily jobs. These providers have a large
54+
volume of continuously changing data, and so daily updates are required to keep
55+
the data in sync.
56+
57+
Providers that are scheduled for monthly ingestion are ones with a relativley
58+
low volume of data, or for which filtering by date is not possible. This means
59+
we need to ingest the entire collection at once. Examples are museum providers
60+
like the [Science museum UK][science_museum] or [Statens Museum for
61+
Kunst][smk]. We don’t expect museum providers to change data on a daily basis.
62+
63+
[science_museum]: https://collection.sciencemuseumgroup.org.uk/
64+
[smk]: https://www.smk.dk/
4065

41-
Providers such as Flickr or Wikimedia Commons that are filtered using the date parameter are usually scheduled for daily jobs.
42-
These providers have a large volume of continuously changing data, and so daily updates are required to keep the data in sync.
66+
The scheduling of the DAGs by the scheduler daemons depends on a few
67+
parameters.
4368

44-
Providers that are scheduled for monthly ingestion are ones with a relativley low volume of data, or for which filtering by date
45-
is not possible. This means we need to ingest the entire collection at once. Examples are museum providers like the
46-
[Science museum UK](https://collection.sciencemuseumgroup.org.uk/) or [Statens Museum for Kunst](https://www.smk.dk/).
47-
We don’t expect museum providers to change data on a daily basis.
69+
- ```start_date``` - it denotes the starting date from which the
70+
task should begin running.
71+
- ```schedule_interval``` - it denotes the interval between subsequent runs, it
72+
can be specified with airflow keyword strings like “@daily”, “@weekly”,
73+
@monthly”, “@yearly” other than these we can also schedule the interval using
74+
cron expression.
4875

49-
The scheduling of the DAGs by the scheduler daemons depends on a few parameters.
50-
- ```start_date``` - it denotes the starting date from which the task should begin running.
51-
- ```schedule_interval``` - it denotes the interval between subsequent runs, it can be
52-
specified with airflow keyword strings like “@daily”, “@weekly”, “@monthly”, “@yearly
53-
other than these we can also schedule the interval using cron expression.
5476

77+
Example: Cleveland museum is currently scheduled for a monthly crawl with a
78+
starting date as ```2020-01-15```. [cleveland_museum_workflow][clm_workflow]
5579

56-
Example : Cleveland museum is currently scheduled for a monthly crawl with a starting date as ```2020-01-15``` .[cleveland_museum_workflow](https://github.com/creativecommons/cccatalog/blob/master/src/cc_catalog_airflow/dags/cleveland_museum_workflow.py)
80+
[clm_workflow]: https://github.com/creativecommons/cccatalog/blob/master/src/cc_catalog_airflow/dags/cleveland_museum_workflow.py
5781

5882
## Loader workflow
59-
The data from the provider scripts are not directly loaded into S3. Instead, they are stored in a TSV file on the local disk, and
60-
the tsv_postgres workflow handles loading of data to S3, and eventually PostgreSQL. The DAG starts by calling the task to stage
61-
the oldest tsv file from the output directory of the provider scripts to the staging directory. Next, two tasks run in parallel,
62-
one loads the tsv file in the staging directory to S3 , while the other creates the loading table in the PostgreSQL database.
63-
Once the data is loaded to S3 and the loading table has been created, the data from S3 is loaded to the intermediate loading table
64-
and then finally inserted into the image table. If loading from S3 fails the data is loaded to PostgreSQL from the locally stored
65-
tsv file. When the data has been successfully transferred to the image table, the intermediate loading table is dropped and the
66-
tsv files in the staging directory are deleted. If the copying the tsv files to S3 fails or then those files are moved to the
67-
failure directory for future inspection.
83+
The data from the provider scripts are not directly loaded into S3. Instead,
84+
they are stored in a TSV file on the local disk, and the tsv_postgres workflow
85+
handles loading of data to S3, and eventually PostgreSQL. The DAG starts by
86+
calling the task to stage the oldest tsv file from the output directory of the
87+
provider scripts to the staging directory. Next, two tasks run in parallel, one
88+
loads the tsv file in the staging directory to S3 , while the other creates the
89+
loading table in the PostgreSQL database. Once the data is loaded to S3 and the
90+
loading table has been created, the data from S3 is loaded to the intermediate
91+
loading table and then finally inserted into the image table. If loading from
92+
S3 fails the data is loaded to PostgreSQL from the locally stored tsv file.
93+
When the data has been successfully transferred to the image table, the
94+
intermediate loading table is dropped and the tsv files in the staging
95+
directory are deleted. If the copying the tsv files to S3 fails or then those
96+
files are moved to the failure directory for future inspection.
6897

6998
<div style="text-align:center;">
70-
<img src="loader_workflow.png" width="1000px"/ >
99+
<img src="loader_workflow.png" width="1000px"/>
71100
<p> Loader workflow </p>
72101
</div>

0 commit comments

Comments
 (0)