Skip to content

Commit 7fda046

Browse files
committed
Merge branch 'master' into add_search-roadmap
2 parents e1a7f05 + 8455e7c commit 7fda046

File tree

26 files changed

+805
-435
lines changed

26 files changed

+805
-435
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
username: dhruvi16
2+
---
3+
name: Dhruvi Butti
4+
---
5+
md5_hashed_email: 4a38c3c8acacc9bc097b2a28580cca04
6+
---
7+
about:
8+
[dhruvi16](https://dhruvi16.github.io/dhruvi16/) is a 3rd year undergrad at IIIT Surat. She is a FOSS enthusiast and very much into UX concepts.
9+
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,212 @@
1+
title: Date-Partitioned Data Reingestion
2+
---
3+
categories:
4+
airflow
5+
cc-catalog
6+
cc-search
7+
open-source
8+
product
9+
---
10+
author: mathemancer
11+
---
12+
pub_date: 2020-05-14
13+
---
14+
body:
15+
16+
CC Catalog is a project that gathers information about images from around the
17+
internet, and stores the information so that these images can eventually be
18+
indexed in [CC Search][cc_search]. A portion of the process is directed by
19+
[Apache Airflow][airflow], which is a tool commonly used to organize workflows
20+
and data pipelines.
21+
22+
In this blog post, we will explore the way in which we keep information we
23+
gather about images up-to-date, using metadata pulled from the Flickr API as an
24+
example case study.
25+
26+
[cc_search]: https://ccsearch.creativecommons.org/
27+
[airflow]: https://airflow.apache.org/
28+
29+
## Apache Airflow, and the `execution_date` concept
30+
31+
Apache Airflow is open source software that loads Directed Acyclic Graphs (DAGs)
32+
defined via python files. The DAG is what defines a given workflow. The nodes
33+
are pieces of jobs that need to be accomplished, and the directed edges of the
34+
graph define dependencies between the various pieces.
35+
36+
A [DAG Run][dag_run_docs] is an 'execution' of the overall workflow defined by
37+
the DAG, and is associated with an `execution_date`. Contrary to what one might
38+
expect, `execution_date` does *not* mean the date when the workflow is executed,
39+
but rather the date 'perspective' from which the workflow is executed. This
40+
means one can give a command that instructs Airflow to execute the workflow
41+
defined by a DAG as if the date were 2019-01-01, regardless of the actual date.
42+
43+
[dag_run_docs]: https://airflow.apache.org/docs/1.10.9/concepts.html#dag-runs
44+
45+
## Our Use of `execution_date`
46+
47+
Much of the data contained in CC Catalog is pulled from various APIs on the
48+
internet, and one strategy we use quite regularly is to make a request of the
49+
form:
50+
51+
*"please give me all the metadata for photos uploaded to Flickr on 2019-01-01"*.
52+
53+
Since we're often requesting metadata about user-sourced content on 3rd-party
54+
sites, some sort of `date_uploaded` parameter is often available for filtering
55+
results in the API provided by the 3rd-party site. This allows us to partition
56+
large data-sets into more manageable pieces. It also leads naturally to the
57+
strategy of requesting metadata for yesterday's photos, each day:
58+
59+
*"please give me all the metadata for photos uploaded to Flickr **yesterday**"*.
60+
61+
Doing this each day lets us keep the metadata in our catalog synced with the
62+
upstream source (i.e., the Flickr API). This is where the `execution_date`
63+
concept comes in. By default, a workflow which is scheduled to run daily uses
64+
the previous day's date as its `execution_date`, and so an execution that
65+
happens on the actual date 2020-02-02 will have `execution_date` 2020-02-01 by
66+
default. This matches up naturally with the strategy above, so we have a number
67+
of workflows that ingest (meta)data into CC Catalog using this default
68+
`execution_date` on a daily basis.
69+
70+
## Challenge: Data can go stale over time
71+
72+
There are some problems with the strategy outlined above:
73+
74+
- What if a photo changes upstream?
75+
- What if a photo is deleted upstream?
76+
- What about metadata that changes over time (e.g., 'views')?
77+
78+
Given we're only ingesting metadata about photos the day after they're uploaded,
79+
we won't be able to capture the relevant data for any of these situations. So,
80+
we need to reingest the metadata for images on some schedule over time.
81+
82+
## Reingestion Schedule
83+
84+
We would prefer to reingest the metadata for newer images more frequently, and
85+
the metadata for older images less frequently. This is because we assume the
86+
metadata for newer images will be updated at the source in more interesting ways
87+
when the image is newer. For example, assume a picture is viewed 100 times per
88+
month.
89+
90+
month | total views | % increase
91+
------:|------------:|-----------:
92+
1 | 100 | infinite
93+
2 | 200 | 100%
94+
3 | 300 | 50%
95+
4 | 400 | 33%
96+
5 | 500 | 25%
97+
6 | 600 | 20%
98+
7 | 700 | 17%
99+
8 | 800 | 14%
100+
9 | 900 | 13%
101+
10 | 1000 | 11%
102+
11 | 1100 | 10%
103+
12 | 1200 | 9%
104+
105+
As we see, given consistent monthly views, the 'percent increase' of the total
106+
views metric drops off as the picture ages (In reality, it appears that in most
107+
cases, pictures are mostly viewed when they are new).
108+
109+
Thus, it makes sense to focus more on keeping the information up-to-date for the
110+
most recently uploaded images.
111+
112+
### Real numbers for Flickr
113+
114+
For Flickr, in the worst case, we can ingest about 100 dates' worth of uploaded
115+
image metadata per day. This was calculated using the year 2016 as an example.
116+
Because 2016 was around the peak for the number of images uploaded to Flickr per
117+
day, the actual number if dates' worth of metadata we can ingest per day is
118+
quite a bit higher, perhaps 150.
119+
120+
We'll need to choose around 150 dates for each daily run, and reingest the
121+
metadata for all images uploaded on each of those dates. We want to choose
122+
those dates preferring newer images (for the reasons outlined above), and choose
123+
them so that if we follow the same date-choosing algorithm each daily run, we'll
124+
eventually reingest the metadata for *all* images on some predictable schedule.
125+
126+
### Strategy to choose which dates to reingest
127+
128+
Assume we'll reingest metadata from some number `n` of dates on each daily run.
129+
We set some maximum number of days `D` we're willing to wait between reingestion
130+
of the data for a given image, subject to the constraint that we need to have
131+
`n * D > T`, where `T` is the total number of dates for which data exists. For
132+
Flickr, there's (at the time of this writing) about 6,000 days for which image
133+
metadata was uploaded. If we set
134+
135+
- `n = 150`
136+
- `D = 180`
137+
138+
then we have `n * D = 150 * 180 = 27,000 > 6,000`, as desired. In fact, there
139+
is quite a bit of slack in this calculation. Keep in mind, however, that we add
140+
one date's worth of metadata as each day passes in real time. Thus, we want to
141+
keep some slack here. One option would be to reingest the metadata for each
142+
image every 90 days, rather than every 180. This would still leave some slack,
143+
and we'd have generally fresher data. This means that on each day, we'd ingest
144+
metadata for photos uploaded on that date, as well as metadata for photos
145+
uploaded
146+
147+
- 90, 180, 270, 360, ..., 13320, or 13410 days prior to the current date.
148+
149+
This is better, but 90 days is still quite a long time to wait to reingest
150+
metadata for a recently-uploaded photo. So, it'd be better to use the slack
151+
available to reingest metadata for recently-uploaded photos more often, and back
152+
off smoothly to reingest metadata for the oldest photos only once every 180
153+
days. We ended up using a schedule where we ingest metadata for photos uploaded
154+
on the current `execution_date`, as well as metadata for photos uploaded
155+
156+
- 1, 2, ..., 6, or 7 days prior;
157+
- 14, 21, ..., 84, or 91 days prior;
158+
- 106, 121, ..., 376, or 391 days prior;
159+
- 421, 451, ..., 1081, or 1111 days prior;
160+
- 1201, 1291, ..., 3181, or 3271 days prior; and
161+
- 3451, 3631, ..., 10291, or 10471 days prior.
162+
163+
These lists can be generated using the following snippet:
164+
165+
```python
166+
def get_reingestion_day_list_list(*args):
167+
return [
168+
[
169+
args[i][0] * (j + 1) + sum(arg[0] * arg[1] for arg in args[:i])
170+
for j in range(args[i][1])
171+
]
172+
for i in range(len(args))
173+
]
174+
175+
get_reingestion_day_list_list(
176+
(1, 7),
177+
(7, 12),
178+
(15, 20),
179+
(30, 24),
180+
(90, 24),
181+
(180, 40)
182+
)
183+
```
184+
185+
This function creates a list of lists of integers based on input pairs
186+
describing which prior dates to ingest. An approximate interpretation of the
187+
input pairs in this example would be
188+
189+
- Ingest data which is at most a week old daily.
190+
- Ingest data which is between a week and three months old weekly.
191+
- Ingest data which is between three months and a year old biweekly.
192+
- Ingest data which is between one and three years old monthly.
193+
- Ingest data which is between three and nine years old every three
194+
months.
195+
- Ingest data which is between nine and twenty-eight years old every six months.
196+
197+
The astute reader will notice that these lists only define 128 dates (including
198+
the current date) for which metadata should be reingested. We prefer to be a
199+
bit conservative on the total amount we plan to ingest per day, since things can
200+
happen that put the ingestion workflow DAG out of service for some time on
201+
occasion.
202+
203+
So, using this strategy, we ensure that all metadata is updated at least every 6
204+
months, with a preference towards metadata about images uploaded recently.
205+
Because this schedule covers about 28.7 years back in time, this strategy should
206+
suffice to reingest all relevant Flickr data for the next 12 years or so (the
207+
current date is 2020).
208+
209+
For more context around what we've shown here, please take a look at
210+
[the CC Catalog repo][cccatalog].
211+
212+
[cccatalog]: https://github.com/creativecommons/cccatalog/
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,52 @@
1+
title: Why Lektor?
2+
---
3+
categories:
4+
outreachy
5+
tech
6+
open-source
7+
---
8+
author: dhruvi16
9+
---
10+
pub_date: 2020-05-15
11+
---
12+
body:
13+
14+
15+
<div style="text-align: center;">
16+
<figure>
17+
<img src="cover.png"/>
18+
<figcaption>
19+
<em>
20+
why?why?why? <a href="https://creativecommons.org/publicdomain/zero/1.0/">(CC0 1.0)</a>
21+
</em>
22+
</figcaption>
23+
</figure>
24+
</div>
25+
26+
I wrote this article while prepping for my Outreachy internship. I was trying to find out why Lektor is being used for the project and found out really convincing reasons for the same.
27+
28+
Let’s start with basic types of website -
29+
30+
Static website: Here the webpages are pre-loaded to the server and when the request is made, these pages are served to the client as it is. These webpages can use HTML, CSS, and Javascript. By static, it does not mean that these webpages are devoid of user interactivity. These pages can be highly reactive (because of course JS) but yes they are not generated on the server and are static in that manner.
31+
32+
Dynamic website: Here the webpages are generated by the server. It is connected to a database and all the data is fetched from there. It uses server-side scripting and also client-side scripting (if necessary). It is called dynamic because the webpages are processed by the server.
33+
34+
So why go static? — Because the vast, vast majority of websites will be read many more times than they will be updated. This is crucial because dynamic content does not come for free. It needs server resources and because program code is running there it needs to be kept up to date to ensure there are no security problems that are left unpatched. Also when a website gets a sudden spike of traffic a static website will stay up for longer on the same server than a dynamic one that needs to execute code.
35+
36+
There are certain drawbacks of static websites we are going to focus on here and how Lektor helps us to combat those.
37+
38+
Drawbacks -
39+
40+
1. Multi-page websites are really hard to manage — Updating the content can be a cumbersome task here.
41+
2. Lack of dynamic content.
42+
3. Unavailability of the admin side of the website — This makes hard for non-technical people to edit the website.
43+
44+
Lektor helps to combat all the shortcomings listed above. It is built using Node.js and Python and is very easy to understand and use.
45+
46+
1. The first drawback has a very common solution that is templates. Lektor has a file structure that consists of — content, models, and templates. This structure helps us to manage multi-page websites without any cumbersome copy-pasting.
47+
2. This drawback can be overcome by using external services or in house micro-services. These services can easily be integrated into your Lektor project. This will save time and give efficient results.
48+
3. This is the most amazing thing that Lektor does. Combating this shortcoming makes Lektor stand out. Lektor takes from content management systems like WordPress and provides a flexible browser-based admin interface from which you can edit your website’s contents. Unlike traditional CMS solutions, however, it runs entirely on your computer. This means you can give a Lektor website to people that have no understanding of programming and they can still modify the content and update the website.
49+
50+
Due to such capabilities, I find this framework amazing and I am excited about using it. Though there are many points untouched in this blog I would try to write a more in-depth blog when I learn about it more during my internship.
51+
52+
Credits — This blog is inspired by [Static websites with Lektor](https://2016.ploneconf.org/talks/static-websites-with-lektor.html) and also for more details you can check out [Lektorpython](https://www.getlektor.com/).
46.3 KB
Loading

0 commit comments

Comments
 (0)