|
| 1 | +title: Date-Partitioned Data Reingestion |
| 2 | +--- |
| 3 | +categories: |
| 4 | +airflow |
| 5 | +cc-catalog |
| 6 | +cc-search |
| 7 | +open-source |
| 8 | +product |
| 9 | +--- |
| 10 | +author: mathemancer |
| 11 | +--- |
| 12 | +pub_date: 2020-05-14 |
| 13 | +--- |
| 14 | +body: |
| 15 | + |
| 16 | +CC Catalog is a project that gathers information about images from around the |
| 17 | +internet, and stores the information so that these images can eventually be |
| 18 | +indexed in [CC Search][cc_search]. A portion of the process is directed by |
| 19 | +[Apache Airflow][airflow], which is a tool commonly used to organize workflows |
| 20 | +and data pipelines. |
| 21 | + |
| 22 | +In this blog post, we will explore the way in which we keep information we |
| 23 | +gather about images up-to-date, using metadata pulled from the Flickr API as an |
| 24 | +example case study. |
| 25 | + |
| 26 | +[cc_search]: https://ccsearch.creativecommons.org/ |
| 27 | +[airflow]: https://airflow.apache.org/ |
| 28 | + |
| 29 | +## Apache Airflow, and the `execution_date` concept |
| 30 | + |
| 31 | +Apache Airflow is open source software that loads Directed Acyclic Graphs (DAGs) |
| 32 | +defined via python files. The DAG is what defines a given workflow. The nodes |
| 33 | +are pieces of jobs that need to be accomplished, and the directed edges of the |
| 34 | +graph define dependencies between the various pieces. |
| 35 | + |
| 36 | +A [DAG Run][dag_run_docs] is an 'execution' of the overall workflow defined by |
| 37 | +the DAG, and is associated with an `execution_date`. Contrary to what one might |
| 38 | +expect, `execution_date` does *not* mean the date when the workflow is executed, |
| 39 | +but rather the date 'perspective' from which the workflow is executed. This |
| 40 | +means one can give a command that instructs Airflow to execute the workflow |
| 41 | +defined by a DAG as if the date were 2019-01-01, regardless of the actual date. |
| 42 | + |
| 43 | +[dag_run_docs]: https://airflow.apache.org/docs/1.10.9/concepts.html#dag-runs |
| 44 | + |
| 45 | +## Our Use of `execution_date` |
| 46 | + |
| 47 | +Much of the data contained in CC Catalog is pulled from various APIs on the |
| 48 | +internet, and one strategy we use quite regularly is to make a request of the |
| 49 | +form: |
| 50 | + |
| 51 | +*"please give me all the metadata for photos uploaded to Flickr on 2019-01-01"*. |
| 52 | + |
| 53 | +Since we're often requesting metadata about user-sourced content on 3rd-party |
| 54 | +sites, some sort of `date_uploaded` parameter is often available for filtering |
| 55 | +results in the API provided by the 3rd-party site. This allows us to partition |
| 56 | +large data-sets into more manageable pieces. It also leads naturally to the |
| 57 | +strategy of requesting metadata for yesterday's photos, each day: |
| 58 | + |
| 59 | +*"please give me all the metadata for photos uploaded to Flickr **yesterday**"*. |
| 60 | + |
| 61 | +Doing this each day lets us keep the metadata in our catalog synced with the |
| 62 | +upstream source (i.e., the Flickr API). This is where the `execution_date` |
| 63 | +concept comes in. By default, a workflow which is scheduled to run daily uses |
| 64 | +the previous day's date as its `execution_date`, and so an execution that |
| 65 | +happens on the actual date 2020-02-02 will have `execution_date` 2020-02-01 by |
| 66 | +default. This matches up naturally with the strategy above, so we have a number |
| 67 | +of workflows that ingest (meta)data into CC Catalog using this default |
| 68 | +`execution_date` on a daily basis. |
| 69 | + |
| 70 | +## Challenge: Data can go stale over time |
| 71 | + |
| 72 | +There are some problems with the strategy outlined above: |
| 73 | + |
| 74 | +- What if a photo changes upstream? |
| 75 | +- What if a photo is deleted upstream? |
| 76 | +- What about metadata that changes over time (e.g., 'views')? |
| 77 | + |
| 78 | +Given we're only ingesting metadata about photos the day after they're uploaded, |
| 79 | +we won't be able to capture the relevant data for any of these situations. So, |
| 80 | +we need to reingest the metadata for images on some schedule over time. |
| 81 | + |
| 82 | +## Reingestion Schedule |
| 83 | + |
| 84 | +We would prefer to reingest the metadata for newer images more frequently, and |
| 85 | +the metadata for older images less frequently. This is because we assume the |
| 86 | +metadata for newer images will be updated at the source in more interesting ways |
| 87 | +when the image is newer. For example, assume a picture is viewed 100 times per |
| 88 | +month. |
| 89 | + |
| 90 | + month | total views | % increase |
| 91 | +------:|------------:|-----------: |
| 92 | + 1 | 100 | infinite |
| 93 | + 2 | 200 | 100% |
| 94 | + 3 | 300 | 50% |
| 95 | + 4 | 400 | 33% |
| 96 | + 5 | 500 | 25% |
| 97 | + 6 | 600 | 20% |
| 98 | + 7 | 700 | 17% |
| 99 | + 8 | 800 | 14% |
| 100 | + 9 | 900 | 13% |
| 101 | + 10 | 1000 | 11% |
| 102 | + 11 | 1100 | 10% |
| 103 | + 12 | 1200 | 9% |
| 104 | + |
| 105 | +As we see, given consistent monthly views, the 'percent increase' of the total |
| 106 | +views metric drops off as the picture ages (In reality, it appears that in most |
| 107 | +cases, pictures are mostly viewed when they are new). |
| 108 | + |
| 109 | +Thus, it makes sense to focus more on keeping the information up-to-date for the |
| 110 | +most recently uploaded images. |
| 111 | + |
| 112 | +### Real numbers for Flickr |
| 113 | + |
| 114 | +For Flickr, in the worst case, we can ingest about 100 dates' worth of uploaded |
| 115 | +image metadata per day. This was calculated using the year 2016 as an example. |
| 116 | +Because 2016 was around the peak for the number of images uploaded to Flickr per |
| 117 | +day, the actual number if dates' worth of metadata we can ingest per day is |
| 118 | +quite a bit higher, perhaps 150. |
| 119 | + |
| 120 | +We'll need to choose around 150 dates for each daily run, and reingest the |
| 121 | +metadata for all images uploaded on each of those dates. We want to choose |
| 122 | +those dates preferring newer images (for the reasons outlined above), and choose |
| 123 | +them so that if we follow the same date-choosing algorithm each daily run, we'll |
| 124 | +eventually reingest the metadata for *all* images on some predictable schedule. |
| 125 | + |
| 126 | +### Strategy to choose which dates to reingest |
| 127 | + |
| 128 | +Assume we'll reingest metadata from some number `n` of dates on each daily run. |
| 129 | +We set some maximum number of days `D` we're willing to wait between reingestion |
| 130 | +of the data for a given image, subject to the constraint that we need to have |
| 131 | +`n * D > T`, where `T` is the total number of dates for which data exists. For |
| 132 | +Flickr, there's (at the time of this writing) about 6,000 days for which image |
| 133 | +metadata was uploaded. If we set |
| 134 | + |
| 135 | +- `n = 150` |
| 136 | +- `D = 180` |
| 137 | + |
| 138 | +then we have `n * D = 150 * 180 = 27,000 > 6,000`, as desired. In fact, there |
| 139 | +is quite a bit of slack in this calculation. Keep in mind, however, that we add |
| 140 | +one date's worth of metadata as each day passes in real time. Thus, we want to |
| 141 | +keep some slack here. One option would be to reingest the metadata for each |
| 142 | +image every 90 days, rather than every 180. This would still leave some slack, |
| 143 | +and we'd have generally fresher data. This means that on each day, we'd ingest |
| 144 | +metadata for photos uploaded on that date, as well as metadata for photos |
| 145 | +uploaded |
| 146 | + |
| 147 | +- 90, 180, 270, 360, ..., 13320, or 13410 days prior to the current date. |
| 148 | + |
| 149 | +This is better, but 90 days is still quite a long time to wait to reingest |
| 150 | +metadata for a recently-uploaded photo. So, it'd be better to use the slack |
| 151 | +available to reingest metadata for recently-uploaded photos more often, and back |
| 152 | +off smoothly to reingest metadata for the oldest photos only once every 180 |
| 153 | +days. We ended up using a schedule where we ingest metadata for photos uploaded |
| 154 | +on the current `execution_date`, as well as metadata for photos uploaded |
| 155 | + |
| 156 | +- 1, 2, ..., 6, or 7 days prior; |
| 157 | +- 14, 21, ..., 84, or 91 days prior; |
| 158 | +- 106, 121, ..., 376, or 391 days prior; |
| 159 | +- 421, 451, ..., 1081, or 1111 days prior; |
| 160 | +- 1201, 1291, ..., 3181, or 3271 days prior; and |
| 161 | +- 3451, 3631, ..., 10291, or 10471 days prior. |
| 162 | + |
| 163 | +These lists can be generated using the following snippet: |
| 164 | + |
| 165 | +```python |
| 166 | +def get_reingestion_day_list_list(*args): |
| 167 | + return [ |
| 168 | + [ |
| 169 | + args[i][0] * (j + 1) + sum(arg[0] * arg[1] for arg in args[:i]) |
| 170 | + for j in range(args[i][1]) |
| 171 | + ] |
| 172 | + for i in range(len(args)) |
| 173 | + ] |
| 174 | + |
| 175 | +get_reingestion_day_list_list( |
| 176 | + (1, 7), |
| 177 | + (7, 12), |
| 178 | + (15, 20), |
| 179 | + (30, 24), |
| 180 | + (90, 24), |
| 181 | + (180, 40) |
| 182 | +) |
| 183 | +``` |
| 184 | + |
| 185 | +This function creates a list of lists of integers based on input pairs |
| 186 | +describing which prior dates to ingest. An approximate interpretation of the |
| 187 | +input pairs in this example would be |
| 188 | + |
| 189 | +- Ingest data which is at most a week old daily. |
| 190 | +- Ingest data which is between a week and three months old weekly. |
| 191 | +- Ingest data which is between three months and a year old biweekly. |
| 192 | +- Ingest data which is between one and three years old monthly. |
| 193 | +- Ingest data which is between three and nine years old every three |
| 194 | + months. |
| 195 | +- Ingest data which is between nine and twenty-eight years old every six months. |
| 196 | + |
| 197 | +The astute reader will notice that these lists only define 128 dates (including |
| 198 | +the current date) for which metadata should be reingested. We prefer to be a |
| 199 | +bit conservative on the total amount we plan to ingest per day, since things can |
| 200 | +happen that put the ingestion workflow DAG out of service for some time on |
| 201 | +occasion. |
| 202 | + |
| 203 | +So, using this strategy, we ensure that all metadata is updated at least every 6 |
| 204 | +months, with a preference towards metadata about images uploaded recently. |
| 205 | +Because this schedule covers about 28.7 years back in time, this strategy should |
| 206 | +suffice to reingest all relevant Flickr data for the next 12 years or so (the |
| 207 | +current date is 2020). |
| 208 | + |
| 209 | +For more context around what we've shown here, please take a look at |
| 210 | +[the CC Catalog repo][cccatalog]. |
| 211 | + |
| 212 | +[cccatalog]: https://github.com/creativecommons/cccatalog/ |
0 commit comments