Skip to content

Commit 4a315fd

Browse files
committed
blog-post-2
1 parent 1c3a40e commit 4a315fd

File tree

2 files changed

+72
-0
lines changed

2 files changed

+72
-0
lines changed
Lines changed: 72 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,72 @@
1+
title: Data flow: from API to DB
2+
---
3+
categories:
4+
5+
cc-catalog
6+
airflow
7+
gsoc
8+
gsoc-2020
9+
---
10+
author: srinidhi
11+
---
12+
series: gsoc-2020-cccatalog
13+
---
14+
pub_date: 2020-07-13
15+
---
16+
body:
17+
## Introduction
18+
The CC Catalog project handles the flow of image metadata from the source or provider and loads it to the database,
19+
which is then surfaced to the [CC search](https://ccsearch.creativecommons.org/about) tool. The workflows are set up for
20+
each provider to gather metadata about CC licensed images. These workflows are handled with the help of Apache Airflow.
21+
Airflow is an open source tool that helps us to schedule and monitor workflows.
22+
23+
## Airflow intro
24+
Apache Airflow is an open source tool that helps us to schedule tasks and monitor workflows . It provides an easy to use
25+
UI that makes managing tasks easy. In Airflow, the tasks we want to schedule are organised in DAGs (Directed Acyclic Graphs) .
26+
DAGs consist of a collection of tasks, and a relationship defined among these tasks, so that they run in an organised manner.
27+
DAGs files are standard python files that are loaded from the defined ```DAG_FOLDER``` on a host.
28+
Airflow selects all the python files in the ```DAG_FOLDER``` that have a DAG instance defined globally, and executes them to
29+
create the DAG objects.
30+
31+
## CC Catalog Workflow
32+
In the CC catalog, Airflow is set up inside a docker container along with other services . The loader and provider workflows are
33+
inside the ```dags``` directory in the repo [insert link here]. Provider workflows are set up to pull metadata about CC licensed
34+
images from the respective providers , the data pulled is structured into a standardised format and written into a TSV
35+
(Tab Separated Values) file locally. These TSV files are then loaded into S3 and then finally to PostgreSQL DB by the loader
36+
workflow.
37+
38+
## Provider API workflow
39+
The provider workflows are usually scheduled in one of two time frequencies, daily or monthly.
40+
41+
Providers such as Flickr or Wikimedia Commons that are filtered using the date parameter are usually scheduled for daily jobs.
42+
These providers have a large volume of continuously changing data, and so daily updates are required to keep the data in sync.
43+
44+
Providers that are scheduled for monthly ingestion are ones with a relativley low volume of data, or for which filtering by date
45+
is not possible. This means we need to ingest the entire collection at once. Examples are museum providers like the
46+
[Science museum UK](https://collection.sciencemuseumgroup.org.uk/) or [Statens Museum for Kunst](https://www.smk.dk/).
47+
We don’t expect museum providers to change data on a daily basis.
48+
49+
The scheduling of the DAGs by the scheduler daemons depends on a few parameters.
50+
- ```start_date``` - it denotes the starting date from which the task should begin running.
51+
- ```schedule_interval``` - it denotes the interval between subsequent runs, it can be
52+
specified with airflow keyword strings like “@daily”, “@weekly”, “@monthly”, “@yearly
53+
other than these we can also schedule the interval using cron expression.
54+
55+
56+
Example : Cleveland museum is currently scheduled for a monthly crawl with a starting date as ```2020-01-15``` .[cleveland_museum_workflow](https://github.com/creativecommons/cccatalog/blob/master/src/cc_catalog_airflow/dags/cleveland_museum_workflow.py)
57+
58+
## Loader workflow
59+
The data from the provider scripts are not directly loaded into S3. Instead, they are stored in a TSV file on the local disk, and
60+
the tsv_postgres workflow handles loading of data to S3, and eventually PostgreSQL. The DAG starts by calling the task to stage
61+
the oldest tsv file from the output directory of the provider scripts to the staging directory. Next, two tasks run in parallel,
62+
one loads the tsv file in the staging directory to S3 , while the other creates the loading table in the PostgreSQL database.
63+
Once the data is loaded to S3 and the loading table has been created, the data from S3 is loaded to the intermediate loading table
64+
and then finally inserted into the image table. If loading from S3 fails the data is loaded to PostgreSQL from the locally stored
65+
tsv file. When the data has been successfully transferred to the image table, the intermediate loading table is dropped and the
66+
tsv files in the staging directory are deleted. If the copying the tsv files to S3 fails or then those files are moved to the
67+
failure directory for future inspection.
68+
69+
<div style="text-align:center;">
70+
<img src="loader_workflow.png" width="1000px"/ >
71+
<p> Loader workflow </p>
72+
</div>
Loading

0 commit comments

Comments
 (0)