You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[Alden](https://creativecommons.org/author/aldencreativecommons-org/) is Backend Software Engineer at Creative Commons. He is `@aldenpage` on the CC Slack.
Copy file name to clipboardExpand all lines: content/blog/entries/crawling-500-million/contents.lr
+1-1
Original file line number
Diff line number
Diff line change
@@ -45,7 +45,7 @@ We know we're not going to be able to crawl 500 million images with one virtual
45
45
46
46
The worker processes do the actual analysis of the images, which essentially entails downloading the image, extracting interesting properties, and sticking the resulting metadata back into a Kafka topic for later downstream processing. The worker will also have to include some instrumentation for conforming to rate limits and error reporting.
47
47
48
-
We also know that we will need to share some information about crawl progress between worker processes, such as whether we've exceeded our proscribed rate limit for a website, the number of times we've seen a status code in the last minute, how many images we've processed so far, and so on. Since we're only interested in sharing application state and aggregate statistics, a lightweight key/value store like Redis seems like a good fit.
48
+
We also know that we will need to share some information about crawl progress between worker processes, such as whether we've exceeded our prescribed rate limit for a website, the number of times we've seen a status code in the last minute, how many images we've processed so far, and so on. Since we're only interested in sharing application state and aggregate statistics, a lightweight key/value store like Redis seems like a good fit.
49
49
50
50
Finally, we need a supervising process that centrally controls the crawl. This key governing process will be responsible for making sure our crawler workers are behaving properly by moderating crawl rates for each source, taking action in the face of errors, and reporting statistics to the operators of the crawler. We'll call this process the crawl monitor.
***[Project Ideas](/internships/project-ideas/)**: This is a list of project ideas for our internships. Please note that these project ideas are shared between all current internship programs and all ideas may not be available for all programs. Please check the project description carefully.
Copy file name to clipboardExpand all lines: content/internships/history/contents.lr
+31-3
Original file line number
Diff line number
Diff line change
@@ -6,10 +6,25 @@ title: Open Source Internships: History
6
6
---
7
7
body:
8
8
9
+
## Google Season of Docs
10
+
11
+
CC has participated in Google Season of Docs for the first time in 2020.
12
+
13
+
*[Google Season of Docs 2020](https://developers.google.com/season-of-docs/docs/participants)
14
+
*[Open Source Blog posts](/blog/categories/gsod-2020/)
15
+
*[Introducing Our Google Season of Docs 2020 Participants](https://creativecommons.org/2020/08/20/google-season-of-docs-2020/)
16
+
17
+
### Relevant links
18
+
19
+
*[Open Source Blog posts](/blog/categories/gsod/)
20
+
9
21
## Google Summer of Code
10
22
11
-
CC has participated in Google Summer of Code eight times before 2020. See below for more information about our projects and participation for each year.
23
+
CC has participated in Google Summer of Code seven times up to 2013, and twice since 2019.
12
24
25
+
*[Google Summer of Code 2020](https://summerofcode.withgoogle.com/organizations/5742311717208064/)
26
+
*[Open Source Blog posts](/blog/categories/gsoc-2020/)
27
+
*[Welcome Our Interns from Google Summer of Code and Outreachy!](https://creativecommons.org/2020/05/11/welcome-interns-google-summer-of-code-and-outreachy/)
13
28
*[Google Summer of Code 2019](https://summerofcode.withgoogle.com/archive/2019/organizations/5500455663173632/)
14
29
*[Open Source Blog posts](/blog/categories/gsoc-2019/)
15
30
*[CC + Google Summer of Code 2019](https://creativecommons.org/2019/03/04/cc-google-summer-of-code-2019/)
@@ -27,8 +42,21 @@ CC has participated in Google Summer of Code eight times before 2020. See below
27
42
*[Google Summer of Code 2007](https://developers.google.com/open-source/gsoc/2007/#cc)
28
43
*[Google Summer of Code 2006](https://developers.google.com/open-source/gsoc/2006/#cc)
29
44
45
+
### Relevant links
46
+
47
+
*[Open Source Blog posts](/blog/categories/gsoc/)
48
+
30
49
## Outreachy
31
50
32
-
CC participated in the Dec 2019 to Mar 2020 round of Outreachy.
33
-
*[Meet Our 2020 Interns From Outreachy](https://creativecommons.org/2019/12/10/2020-outreachy-interns/)
51
+
CC participated in Outreachy twice since 2019.
52
+
53
+
*[Outreachy May 2020 - August 2020](https://www.outreachy.org/outreachy-may-2020-internship-round/)
54
+
*[Open Source Blog posts](/blog/categories/outreachy-2020/)
55
+
*[Welcome Our Interns from Google Summer of Code and Outreachy!](https://creativecommons.org/2020/05/11/welcome-interns-google-summer-of-code-and-outreachy/)
56
+
*[Outreachy December 2019 - March 2020](https://www.outreachy.org/december-2019-to-march-2020-internship-round/)
57
+
*[Open Source Blog posts](/blog/categories/outreachy-2019-20/)
58
+
*[Meet Our 2020 Interns From Outreachy](https://creativecommons.org/2019/12/10/2020-outreachy-interns/)
59
+
60
+
### Relevant links
61
+
34
62
*[Open Source Blog posts](/blog/categories/outreachy/)
0 commit comments