You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: content/blog/entries/crawling-500-million/contents.lr
+7-7Lines changed: 7 additions & 7 deletions
Original file line number
Diff line number
Diff line change
@@ -12,7 +12,7 @@ pub_date: 2020-08-14
12
12
---
13
13
body:
14
14
15
-
### Background
15
+
####Background
16
16
17
17
The goal of [CC Search](https://search.creativecommons.org) is to index all of the Creative Commons works on the internet, starting with images. We have indexed over 500 million images, which we believe is roughly 36% of all CC licensed content on the internet by [our last count](https://creativecommons.org/2018/05/08/state-of-the-commons-2017/). To further enhance the usefulness of our search tool, we recently started crawling and analyzing images for improved search results. This article will discuss the process of taking a paper design for a large scale crawler, implementing it, and putting it in production, with a few idealized code snippets and diagrams along the way. The full source code can be viewed on [GitHub](https://github.com/creativecommons/image-crawler).
18
18
@@ -40,7 +40,7 @@ Any decent software engineer will consider existing options before diving into a
40
40
41
41
As a reminder, we want to eventually crawl every single Creative Commons work on the internet. Effective crawling is central to the capabilities that our search engine is able to provide. In addition to being central to achieving high quality image search, crawling could also be useful for discovering new Creative Commons content of any type on any website. In my view, that's a strong argument for spending some time designing a custom crawling solution where we have complete end-to-end control of the process, as long as the feature set is limited in scope. In the next section, we'll assess the effort required to build a crawler from the ground up.
42
42
43
-
### Designing the crawler
43
+
####Designing the crawler
44
44
We know we're not going to be able to crawl 500 million images with one virtual machine and a single IP address, so it is obvious from the start that we are going to need a way to distribute the crawling and analysis tasks over multiple machines. A basic queue-worker architecture will do the job here; when we want to crawl an image, we can dispatch the URL to an inbound images queue, and a worker eventually pops that task out and processes it. Kafka will handle all of the hard work of partitioning and distributing the tasks between workers.
45
45
46
46
The worker processes do the actual analysis of the images, which essentially entails downloading the image, extracting interesting properties, and sticking the resulting metadata back into a Kafka topic for later downstream processing. The worker will also have to include some instrumentation for conforming to rate limits and error reporting.
@@ -196,7 +196,7 @@ class RateLimitedClientSession:
196
196
197
197
Meanwhile, the crawl monitor process is filling up each bucket every second.
198
198
199
-
#### Scheduling tasks (somewhat) intelligently
199
+
#####Scheduling tasks (somewhat) intelligently
200
200
The final gotcha in the design of our crawler is that we want to crawl every single website at the same time at its prescribed rate limit. That sounds almost tautological, like something that we should be able to take for granted after implementing all of this logic for preventing our crawler from working too quickly, but it turns out our crawler's processing capacity itself is a limited and contentious resource. We can only schedule so many tasks simultaneously on each worker, and we need to ensure that tasks from a single website aren't starving other sources of crawl capacity.
201
201
202
202
For instance, imagine that each worker is able to handle 5000 simultaneous crawling tasks, and every one of those tasks is tied to a tiny website with a very low rate limit. That means that our entire worker, which is capable of handling hundreds of crawl and analysis jobs per second, is stuck making one request per second until some faster tasks appear in the queue.
@@ -235,7 +235,7 @@ The one implementation detail to deal with here is that our workers can't draw f
235
235
236
236
*<center>A more complete diagram showing the system with a queue for each source</center>*
237
237
238
-
#### Designing for testability
238
+
#####Designing for testability
239
239
240
240
It's quite difficult to test IO-heavy systems because of their need to interact with lots of external dependencies. Often times it is necessary to write complex integration tests or run manual tests to be certain that key functionality works as expected. This is no good because integration tests are much more expensive to maintain and take far longer to execute. We certainly wouldn't go to production without running a smoke test to verify correctness in real-world conditions, but it's still critical to have unit tests in place for catching bugs quickly during the development process.
The main drawback of dependency injection is that initializing your objects will take some more ceremony. See the [initialization of the crawl scheduler](https://github.com/creativecommons/image-crawler/blob/00b59aba9a15faccf203a53d73a98e8c06cb69e8/worker/scheduler.py#L162) for an example of wiring up an object with a lot of dependencies. You might also find that constructors and other functions with a lot of dependencies will have a lot of arguments if care isn't taken to bundle external dependencies together. In my opinion, the price of a few extra lines of initialization code is well worth the benefits gained from testability and modularity.
289
289
290
290
291
-
### Smoke testing
291
+
####Smoke testing
292
292
Even with our unit test coverage, we still need to do some basic small-scale manual tests to make sure our assumptions hold up in the real world. We'll need to write [Terraform](https://www.terraform.io/) modules that provision a working version of the real system. Sadly, our Terraform infrastructure repository is private for now, but here's a taste of what the infra code looks like.
293
293
294
294
@@ -358,7 +358,7 @@ One `terraform plan` and `terraform apply` cycle later, we're ready to feed a fe
358
358
359
359
After fixing all of those issues and performing a larger smoke test, we're ready to start crawling on a large scale.
360
360
361
-
### Monitoring the crawl
361
+
#####Monitoring the crawl
362
362
363
363
Unfortunately, we can't just kick back and relax while the crawler does its thing for a few weeks. We need some transparency about what the crawler is doing so we can be alerted when something breaks.
364
364
@@ -415,7 +415,7 @@ Here's an example log line from one of our smoke tests, indicating that we've cr
415
415
416
416
Now that we can see what the crawler is up to, we can schedule the larger crawl and start collecting production quality data.
417
417
418
-
### Takeaways
418
+
#####Takeaways
419
419
420
420
The result here is that we have a lightweight, modular, highly concurrent, and polite distributed image crawler with only a handful of lines of code.
0 commit comments