Skip to content

Commit 183a50d

Browse files
committed
Add link to python docs
1 parent 94fb8f5 commit 183a50d

File tree

1 file changed

+2
-2
lines changed

1 file changed

+2
-2
lines changed

content/blog/entries/crawling-500-million/contents.lr

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -60,9 +60,9 @@ In the next section, we'll examine some of the key components that make up the c
6060
#### Detailed breakdown
6161

6262
##### Concurrency with `asyncio`
63-
Crawling is a massively IO bound task. The workers need to maintain lots of simultaneous open connections with internal systems like Kafka and Redis as well as 3rd party websites holding the target images. Once we have the image in memory, performing our actual analysis task is easy and cheap. For these reasons, an asynchronous approach seems more attractive than using multiple threads of execution. Even if our image processing task grows in complexity and becomes CPU bound, we can get the best of both worlds by offloading heavyweight tasks to a process pool.
63+
Crawling is a massively IO bound task. The workers need to maintain lots of simultaneous open connections with internal systems like Kafka and Redis as well as 3rd party websites holding the target images. Once we have the image in memory, performing our actual analysis task is easy and cheap. For these reasons, an asynchronous approach seems more attractive than using multiple threads of execution. Even if our image processing task grows in complexity and becomes CPU bound, we can get the best of both worlds by offloading heavyweight tasks to a process pool. See "[Running Blocking Code](https://docs.python.org/3/library/asyncio-dev.html#running-blocking-code)" in the `asyncio` docs for more details.
6464

65-
Another reason that an asynchronous approach may be desirable is that we have several interlocking components which need to react to events in real-time: our crawl monitoring process needs to simultaneously control the rate limiting process and also interrupt crawling if errors are detected; our worker processes need to consume crawl events, process images, upload thumbnails, and produce metadata events. Coordinating all of these components through inter-process communication could be difficult, but breaking up tasks into small pieces and yielding to the event loop is comparatively easy.
65+
Another reason that an asynchronous approach may be desirable is that we have several interlocking components which need to react to events in real-time: our crawl monitoring process needs to simultaneously control the rate limiting process and also interrupt crawling if errors are detected, while our worker processes need to consume crawl events, process images, upload thumbnails, and produce events documenting the metadata of each image. Coordinating all of these components through inter-process communication could be difficult, but breaking up tasks into small pieces and yielding to the event loop is comparatively easy.
6666

6767
##### The resize task
6868
This is the most vital part of our crawling system: the part that actually does the work of fetching and processing an image. As established previously, we need to execute this task concurrently, so everything needs to be defined with `async`/`await` syntax to allow the event loop to multitask. The actual task itself is otherwise straightforward.

0 commit comments

Comments
 (0)