Skip to content

Commit ae0ceda

Browse files
authored
Merge branch 'master' into markdown-table-patch
2 parents 142b84f + 3e1241a commit ae0ceda

File tree

1 file changed

+212
-0
lines changed
  • content/blog/entries/date-partitioned-data-reingestion

1 file changed

+212
-0
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,212 @@
1+
title: Date-Partitioned Data Reingestion
2+
---
3+
categories:
4+
airflow
5+
cc-catalog
6+
cc-search
7+
open-source
8+
product
9+
---
10+
author: mathemancer
11+
---
12+
pub_date: 2020-05-14
13+
---
14+
body:
15+
16+
CC Catalog is a project that gathers information about images from around the
17+
internet, and stores the information so that these images can eventually be
18+
indexed in [CC Search][cc_search]. A portion of the process is directed by
19+
[Apache Airflow][airflow], which is a tool commonly used to organize workflows
20+
and data pipelines.
21+
22+
In this blog post, we will explore the way in which we keep information we
23+
gather about images up-to-date, using metadata pulled from the Flickr API as an
24+
example case study.
25+
26+
[cc_search]: https://ccsearch.creativecommons.org/
27+
[airflow]: https://airflow.apache.org/
28+
29+
## Apache Airflow, and the `execution_date` concept
30+
31+
Apache Airflow is open source software that loads Directed Acyclic Graphs (DAGs)
32+
defined via python files. The DAG is what defines a given workflow. The nodes
33+
are pieces of jobs that need to be accomplished, and the directed edges of the
34+
graph define dependencies between the various pieces.
35+
36+
A [DAG Run][dag_run_docs] is an 'execution' of the overall workflow defined by
37+
the DAG, and is associated with an `execution_date`. Contrary to what one might
38+
expect, `execution_date` does *not* mean the date when the workflow is executed,
39+
but rather the date 'perspective' from which the workflow is executed. This
40+
means one can give a command that instructs Airflow to execute the workflow
41+
defined by a DAG as if the date were 2019-01-01, regardless of the actual date.
42+
43+
[dag_run_docs]: https://airflow.apache.org/docs/1.10.9/concepts.html#dag-runs
44+
45+
## Our Use of `execution_date`
46+
47+
Much of the data contained in CC Catalog is pulled from various APIs on the
48+
internet, and one strategy we use quite regularly is to make a request of the
49+
form:
50+
51+
*"please give me all the metadata for photos uploaded to Flickr on 2019-01-01"*.
52+
53+
Since we're often requesting metadata about user-sourced content on 3rd-party
54+
sites, some sort of `date_uploaded` parameter is often available for filtering
55+
results in the API provided by the 3rd-party site. This allows us to partition
56+
large data-sets into more manageable pieces. It also leads naturally to the
57+
strategy of requesting metadata for yesterday's photos, each day:
58+
59+
*"please give me all the metadata for photos uploaded to Flickr **yesterday**"*.
60+
61+
Doing this each day lets us keep the metadata in our catalog synced with the
62+
upstream source (i.e., the Flickr API). This is where the `execution_date`
63+
concept comes in. By default, a workflow which is scheduled to run daily uses
64+
the previous day's date as its `execution_date`, and so an execution that
65+
happens on the actual date 2020-02-02 will have `execution_date` 2020-02-01 by
66+
default. This matches up naturally with the strategy above, so we have a number
67+
of workflows that ingest (meta)data into CC Catalog using this default
68+
`execution_date` on a daily basis.
69+
70+
## Challenge: Data can go stale over time
71+
72+
There are some problems with the strategy outlined above:
73+
74+
- What if a photo changes upstream?
75+
- What if a photo is deleted upstream?
76+
- What about metadata that changes over time (e.g., 'views')?
77+
78+
Given we're only ingesting metadata about photos the day after they're uploaded,
79+
we won't be able to capture the relevant data for any of these situations. So,
80+
we need to reingest the metadata for images on some schedule over time.
81+
82+
## Reingestion Schedule
83+
84+
We would prefer to reingest the metadata for newer images more frequently, and
85+
the metadata for older images less frequently. This is because we assume the
86+
metadata for newer images will be updated at the source in more interesting ways
87+
when the image is newer. For example, assume a picture is viewed 100 times per
88+
month.
89+
90+
month | total views | % increase
91+
------:|------------:|-----------:
92+
1 | 100 | infinite
93+
2 | 200 | 100%
94+
3 | 300 | 50%
95+
4 | 400 | 33%
96+
5 | 500 | 25%
97+
6 | 600 | 20%
98+
7 | 700 | 17%
99+
8 | 800 | 14%
100+
9 | 900 | 13%
101+
10 | 1000 | 11%
102+
11 | 1100 | 10%
103+
12 | 1200 | 9%
104+
105+
As we see, given consistent monthly views, the 'percent increase' of the total
106+
views metric drops off as the picture ages (In reality, it appears that in most
107+
cases, pictures are mostly viewed when they are new).
108+
109+
Thus, it makes sense to focus more on keeping the information up-to-date for the
110+
most recently uploaded images.
111+
112+
### Real numbers for Flickr
113+
114+
For Flickr, in the worst case, we can ingest about 100 dates' worth of uploaded
115+
image metadata per day. This was calculated using the year 2016 as an example.
116+
Because 2016 was around the peak for the number of images uploaded to Flickr per
117+
day, the actual number if dates' worth of metadata we can ingest per day is
118+
quite a bit higher, perhaps 150.
119+
120+
We'll need to choose around 150 dates for each daily run, and reingest the
121+
metadata for all images uploaded on each of those dates. We want to choose
122+
those dates preferring newer images (for the reasons outlined above), and choose
123+
them so that if we follow the same date-choosing algorithm each daily run, we'll
124+
eventually reingest the metadata for *all* images on some predictable schedule.
125+
126+
### Strategy to choose which dates to reingest
127+
128+
Assume we'll reingest metadata from some number `n` of dates on each daily run.
129+
We set some maximum number of days `D` we're willing to wait between reingestion
130+
of the data for a given image, subject to the constraint that we need to have
131+
`n * D > T`, where `T` is the total number of dates for which data exists. For
132+
Flickr, there's (at the time of this writing) about 6,000 days for which image
133+
metadata was uploaded. If we set
134+
135+
- `n = 150`
136+
- `D = 180`
137+
138+
then we have `n * D = 150 * 180 = 27,000 > 6,000`, as desired. In fact, there
139+
is quite a bit of slack in this calculation. Keep in mind, however, that we add
140+
one date's worth of metadata as each day passes in real time. Thus, we want to
141+
keep some slack here. One option would be to reingest the metadata for each
142+
image every 90 days, rather than every 180. This would still leave some slack,
143+
and we'd have generally fresher data. This means that on each day, we'd ingest
144+
metadata for photos uploaded on that date, as well as metadata for photos
145+
uploaded
146+
147+
- 90, 180, 270, 360, ..., 13320, or 13410 days prior to the current date.
148+
149+
This is better, but 90 days is still quite a long time to wait to reingest
150+
metadata for a recently-uploaded photo. So, it'd be better to use the slack
151+
available to reingest metadata for recently-uploaded photos more often, and back
152+
off smoothly to reingest metadata for the oldest photos only once every 180
153+
days. We ended up using a schedule where we ingest metadata for photos uploaded
154+
on the current `execution_date`, as well as metadata for photos uploaded
155+
156+
- 1, 2, ..., 6, or 7 days prior;
157+
- 14, 21, ..., 84, or 91 days prior;
158+
- 106, 121, ..., 376, or 391 days prior;
159+
- 421, 451, ..., 1081, or 1111 days prior;
160+
- 1201, 1291, ..., 3181, or 3271 days prior; and
161+
- 3451, 3631, ..., 10291, or 10471 days prior.
162+
163+
These lists can be generated using the following snippet:
164+
165+
```python
166+
def get_reingestion_day_list_list(*args):
167+
return [
168+
[
169+
args[i][0] * (j + 1) + sum(arg[0] * arg[1] for arg in args[:i])
170+
for j in range(args[i][1])
171+
]
172+
for i in range(len(args))
173+
]
174+
175+
get_reingestion_day_list_list(
176+
(1, 7),
177+
(7, 12),
178+
(15, 20),
179+
(30, 24),
180+
(90, 24),
181+
(180, 40)
182+
)
183+
```
184+
185+
This function creates a list of lists of integers based on input pairs
186+
describing which prior dates to ingest. An approximate interpretation of the
187+
input pairs in this example would be
188+
189+
- Ingest data which is at most a week old daily.
190+
- Ingest data which is between a week and three months old weekly.
191+
- Ingest data which is between three months and a year old biweekly.
192+
- Ingest data which is between one and three years old monthly.
193+
- Ingest data which is between three and nine years old every three
194+
months.
195+
- Ingest data which is between nine and twenty-eight years old every six months.
196+
197+
The astute reader will notice that these lists only define 128 dates (including
198+
the current date) for which metadata should be reingested. We prefer to be a
199+
bit conservative on the total amount we plan to ingest per day, since things can
200+
happen that put the ingestion workflow DAG out of service for some time on
201+
occasion.
202+
203+
So, using this strategy, we ensure that all metadata is updated at least every 6
204+
months, with a preference towards metadata about images uploaded recently.
205+
Because this schedule covers about 28.7 years back in time, this strategy should
206+
suffice to reingest all relevant Flickr data for the next 12 years or so (the
207+
current date is 2020).
208+
209+
For more context around what we've shown here, please take a look at
210+
[the CC Catalog repo][cccatalog].
211+
212+
[cccatalog]: https://github.com/creativecommons/cccatalog/

0 commit comments

Comments
 (0)