Skip to content

Commit 68ce11b

Browse files
committed
Missing paren
1 parent 8e38b0c commit 68ce11b

File tree

1 file changed

+1
-1
lines changed

1 file changed

+1
-1
lines changed

content/blog/entries/crawling-500-million/contents.lr

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -201,7 +201,7 @@ The final gotcha in the design of our crawler is that we want to crawl every sin
201201

202202
For instance, imagine that each worker is able to handle 5000 simultaneous crawling tasks, and every one of those tasks is tied to a tiny website with a very low rate limit. That means that our entire worker, which is capable of handling hundreds of crawl and analysis jobs per second, is stuck making one request per second until some faster tasks appear in the queue.
203203

204-
In other words, we need to make sure that each worker process isn't jamming itself up with a single source. We have a [scheduling problem](https://en.wikipedia.org/wiki/Scheduling_(computing). We've naively implemented first-come-first-serve and need to switch to a different scheduling strategy.
204+
In other words, we need to make sure that each worker process isn't jamming itself up with a single source. We have a [scheduling problem](https://en.wikipedia.org/wiki/Scheduling_(computing)). We've naively implemented first-come-first-serve and need to switch to a different scheduling strategy.
205205

206206
There are innumerable ways to address scheduling problems. Since there are only a few dozen sources in our system, we can get away with using a stupid scheduling algorithm: give each source equal capacity in every worker. In other words, if there are 5000 tasks to distribute and 30 sources, we can allocate 166 simultaneous tasks to each source per worker. That's plenty for our purposes. There are obvious drawbacks of this approach in that eventually there will be so many sources that we start starving high rate limit sources of work. We'll cross that bridge when we come to it; it's better to use the simplest possible approach we can get away with instead of spending all of our time on solving hypothetical future problems.
207207

0 commit comments

Comments
 (0)