Skip to content

Commit d690fe7

Browse files
authored
Merge pull request creativecommons#336 from kss682/srinidhi-blogpost2
blog-post-2
2 parents 000f4c8 + 9edaa0e commit d690fe7

File tree

2 files changed

+105
-0
lines changed

2 files changed

+105
-0
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,105 @@
1+
title: Data flow: from API to DB
2+
---
3+
categories:
4+
5+
cc-catalog
6+
airflow
7+
gsoc
8+
gsoc-2020
9+
---
10+
author: srinidhi
11+
---
12+
series: gsoc-2020-cccatalog
13+
---
14+
pub_date: 2020-07-22
15+
---
16+
body:
17+
18+
## Introduction The CC
19+
Catalog project handles the flow of image metadata from the source or provider
20+
and loads it to the database, which is then surfaced to the [CC
21+
search][CC_search] tool. The workflows are set up for each provider to gather
22+
metadata about CC licensed images. These workflows are handled with the help of
23+
Apache Airflow. Airflow is an open source tool that helps us to schedule and
24+
monitor workflows.
25+
[CC_search]: https://ccsearch.creativecommons.org/about
26+
27+
## Airflow intro
28+
Apache Airflow is an open source tool that helps us to schedule tasks and
29+
monitor workflows . It provides an easy to use UI that makes managing tasks
30+
easy. In Airflow, the tasks we want to schedule are organised in DAGs
31+
(Directed Acyclic Graphs). DAGs consist of a collection of tasks, and a
32+
relationship defined among these tasks, so that they run in an organised
33+
manner. DAGs files are standard python files that are loaded from the defined
34+
`DAG_FOLDER` on a host. Airflow selects all the python files in the
35+
`DAG_FOLDER` that have a DAG instance defined globally, and executes them to
36+
create the DAG objects.
37+
38+
## CC Catalog Workflow
39+
In the CC catalog, Airflow is set up inside a docker container along with other
40+
services . The loader and provider workflows are inside the `dags` directory in
41+
the repo [dag folder][dags]. Provider workflows are set up to pull metadata
42+
about CC licensed images from the respective providers , the data pulled is
43+
structured into a standardised format and written into a TSV (Tab Separated
44+
Values) file locally. These TSV files are then loaded into S3 and then finally
45+
to PostgreSQL DB by the loader workflow.
46+
[dags]: https://github.com/creativecommons/cccatalog/tree/dacb48d24c6ae9b532ff108589b9326bde0d37a3/src/cc_catalog_airflow/dags
47+
48+
## Provider API workflow
49+
The provider workflows are usually scheduled in one of two time frequencies,
50+
daily or monthly.
51+
52+
Providers such as Flickr or Wikimedia Commons that are filtered using the date
53+
parameter are usually scheduled for daily jobs. These providers have a large
54+
volume of continuously changing data, and so daily updates are required to keep
55+
the data in sync.
56+
57+
Providers that are scheduled for monthly ingestion are ones with a relativley
58+
low volume of data, or for which filtering by date is not possible. This means
59+
we need to ingest the entire collection at once. Examples are museum providers
60+
like the [Science museum UK][science_museum] or [Statens Museum for
61+
Kunst][smk]. We don’t expect museum providers to change data on a daily basis.
62+
63+
[science_museum]: https://collection.sciencemuseumgroup.org.uk/
64+
[smk]: https://www.smk.dk/
65+
66+
The scheduling of the DAGs by the scheduler daemons depends on a few
67+
parameters.
68+
69+
- ```start_date``` - it denotes the starting date from which the
70+
task should begin running.
71+
- ```schedule_interval``` - it denotes the interval between subsequent runs, it
72+
can be specified with airflow keyword strings like “@daily”, “@weekly”,
73+
@monthly”, “@yearly” other than these we can also schedule the interval using
74+
cron expression.
75+
76+
77+
Example: Cleveland museum is currently scheduled for a monthly crawl with a
78+
starting date as ```2020-01-15```. [cleveland_museum_workflow][clm_workflow]
79+
80+
[clm_workflow]: https://github.com/creativecommons/cccatalog/blob/dacb48d24c6ae9b532ff108589b9326bde0d37a3/src/cc_catalog_airflow/dags/cleveland_museum_workflow.py
81+
82+
## Loader workflow
83+
The data from the provider scripts are not directly loaded into S3. Instead,
84+
they are stored in a TSV file on the local disk, and the tsv_postgres workflow
85+
handles loading of data to S3, and eventually PostgreSQL. The DAG starts by
86+
calling the task to stage the oldest tsv file from the output directory of the
87+
provider scripts to the staging directory. Next, two tasks run in parallel, one
88+
loads the tsv file in the staging directory to S3 , while the other creates the
89+
loading table in the PostgreSQL database. Once the data is loaded to S3 and the
90+
loading table has been created, the data from S3 is loaded to the intermediate
91+
loading table and then finally inserted into the image table. If loading from
92+
S3 fails the data is loaded to PostgreSQL from the locally stored tsv file.
93+
When the data has been successfully transferred to the image table, the
94+
intermediate loading table is dropped and the tsv files in the staging
95+
directory are deleted. If the copying the tsv files to S3 fails or then those
96+
files are moved to the failure directory for future inspection.
97+
98+
<div style="text-align:center;">
99+
<img src="loader_workflow.png" width="1000px"/>
100+
<p> Loader workflow </p>
101+
</div>
102+
103+
## Acknowledgement
104+
105+
I would like to thank Brent Moran for helping me write this blog post.
Loading

0 commit comments

Comments
 (0)