Set Common Search UI Demo as your Default Search Engine on Chrome (very early beta)
+
+
+
Click on the Chrome menu icon on the top right of your browser
+
Click Settings
+
Under the Search header click Manage search engines
+
Scroll to the bottom of the Other search engines section
+
You should see three fields to enter a new search engine.
+
+ Add a new search engine:
+ Enter, "Common Search"
+ Keyword:
+ Enter, "commonsearch.org"
+ URL with %s in place of query:
+ Enter, "https://uidemo.commonsearch.org/?g=en&q=%s"
+
Press enter on your keyboard
+
Scroll up and find Common Search under the same Other search engines section you are already in
+
Hover your mouse over the Common Search entry you just made, and click on the Make Default button.
+
Congrats! You are now using Common Search as your default search engine :)
+
+
+
-{% endblock %}
\ No newline at end of file
+{% endblock %}
From 079a623450241c6681dd9ac15899e9af30bbdb2e Mon Sep 17 00:00:00 2001
From: Bakz
Date: Sun, 31 Jul 2016 18:43:01 -0700
Subject: [PATCH 02/18] modified readme
---
README.md | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/README.md b/README.md
index 16e2005..348e462 100644
--- a/README.md
+++ b/README.md
@@ -11,7 +11,7 @@ Running `cosr-about` on your local machine is very simple. You only need to have
Once Docker is launched, just run:
```
-make docker_devserver
+[sudo] make docker_devserver
```
Then open http://192.168.99.100:9701/ in your browser. (Replace "192.168.99.100" by the address of your Docker machine. On a Mac, you can get it with `docker-machine ip boot2docker`)
@@ -31,4 +31,4 @@ Then open http://localhost:9701/
This website uses [Pelican](http://blog.getpelican.com/) to statically generate a bunch of HTML files from [Markdown](http://commonmark.org/) source.
-We use a customized [Bootstrap 3](http://getbootstrap.com/) template. We welcome any contributions to make it prettier!
\ No newline at end of file
+We use a customized [Bootstrap 3](http://getbootstrap.com/) template. We welcome any contributions to make it prettier!
From cdde1ca408cc2bd78ac4250940eff183298887ab Mon Sep 17 00:00:00 2001
From: Sylvain Zimmer
Date: Tue, 2 Aug 2016 13:57:28 -0400
Subject: [PATCH 03/18] Replace IRC by Slack
---
README.md | 2 ++
content/pages/contact.md | 6 ++++--
2 files changed, 6 insertions(+), 2 deletions(-)
diff --git a/README.md b/README.md
index 16e2005..b0454aa 100644
--- a/README.md
+++ b/README.md
@@ -1,5 +1,7 @@
# cosr-about
+[](https://slack.commonsearch.org) [](LICENSE)
+
This repository contains our presentation website and blog hosted at https://about.commonsearch.org/.
You can use the [issues page](https://github.com/commonsearch/cosr-about) to suggest improvements to the content or layout!
diff --git a/content/pages/contact.md b/content/pages/contact.md
index 799f2ed..8c0505d 100644
--- a/content/pages/contact.md
+++ b/content/pages/contact.md
@@ -6,9 +6,11 @@ Template: page_contribute
You can send us an email to [contact@commonsearch.org](mailto:contact@commonsearch.org)
-## IRC
+## Slack
-To talk with us directly you can join [irc://irc.freenode.net/commonsearch](https://webchat.freenode.net/?channels=#commonsearch)
+To chat with us directly you can join our [Slack channel](https://slack.commonsearch.org) by clicking on the button below:
+
+
## Press
From 83d4b92a396324110ac7f8154a1f7fbb9dfc8c3c Mon Sep 17 00:00:00 2001
From: Sylvain Zimmer
Date: Tue, 2 Aug 2016 18:35:48 -0400
Subject: [PATCH 04/18] Fix GitHub link
---
...ur-first-public-datasets-host-level-webgraph-and-pagerank.md | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/content/blog/2016/07/our-first-public-datasets-host-level-webgraph-and-pagerank.md b/content/blog/2016/07/our-first-public-datasets-host-level-webgraph-and-pagerank.md
index ea4f49d..78ab519 100644
--- a/content/blog/2016/07/our-first-public-datasets-host-level-webgraph-and-pagerank.md
+++ b/content/blog/2016/07/our-first-public-datasets-host-level-webgraph-and-pagerank.md
@@ -63,7 +63,7 @@ We encourage everyone to analyze these datasets and **report their findings publ
Here is a non-exhaustive list of interesting areas to explore:
- - Spam! We have opened an [issue in GitHub]() to track the progress. Join the fight!
+ - Spam! We have opened an [issue in GitHub](https://github.com/commonsearch/cosr-back/issues/52) to track the progress. Join the fight!
- Analyzing correlations between the PageRank dataset and other public domain rankings (Top sites in Alexa, ...)
- Using the WebGraph dataset to create other metrics, for instance: [Centrality](https://en.wikipedia.org/wiki/Centrality), [CheiRank](https://en.wikipedia.org/wiki/CheiRank), [HITS](https://en.wikipedia.org/wiki/HITS_algorithm), [SALSA](https://en.wikipedia.org/wiki/SALSA_algorithm), ...
- Review our [Python code on GitHub](https://github.com/commonsearch/cosr-back) to look for bugs and speed improvements!
From 093f0a3d1a066a0db1b71d2c6a5b3e635c5b82a1 Mon Sep 17 00:00:00 2001
From: Sylvain Zimmer
Date: Sun, 7 Aug 2016 16:44:33 -0400
Subject: [PATCH 05/18] Add Tor hidden service and publish August 2016 update
---
.../08/state-of-common-search-august-2016.md | 54 +++++++++++++++++++
content/pages/contact.md | 4 ++
content/pages/privacy.md | 3 +-
3 files changed, 60 insertions(+), 1 deletion(-)
create mode 100644 content/blog/2016/08/state-of-common-search-august-2016.md
diff --git a/content/blog/2016/08/state-of-common-search-august-2016.md b/content/blog/2016/08/state-of-common-search-august-2016.md
new file mode 100644
index 0000000..a848969
--- /dev/null
+++ b/content/blog/2016/08/state-of-common-search-august-2016.md
@@ -0,0 +1,54 @@
+Title: State of Common Search - August 2016
+Slug: state-of-common-search-august-2016
+Date: 2016-08-07 15:42:35
+Author: commonsearch
+
+There has been a lot happening at [Common Search](https://about.commonsearch.org/) lately: we published our first datasets, opened a Tor hidden service, switched to Slack and started doing automated UI tests!
+
+
+
+## First datasets
+
+Last week, we published our [first two datasets](https://about.commonsearch.org/2016/07/our-first-public-datasets-host-level-webgraph-and-pagerank/): a host-level web graph and a host-level list of PageRanks.
+
+We plan to release even more datasets in the future to keep our service as transparent as we can, so feel free to test them, analyze the data and suggest improvements!
+
+
+## Tor hidden service
+
+We use CloudFlare as a CDN, but we explicitly whitelisted Tor users so they should not have any issue connecting to our [UI Demo](https://uidemo.commonsearch.org/).
+
+However, some users may want to access Common Search directly through Tor for better [privacy](/privacy), which is why we just opened a [Tor hidden service](https://www.torproject.org/docs/hidden-services.html.en) with a [.onion](https://en.wikipedia.org/wiki/.onion) address:
+
+**[http://comsearchl2zlnre.onion](http://comsearchl2zlnre.onion)**
+
+Please report any issues connecting to the service!
+
+
+## Slack channel
+
+Due to low traffic in our IRC channel, we have switched to [Slack](https://slack.commonsearch.org). Though it is closed source, we feel it is a better, pragmatic choice for allowing new contributors to join the community easily.
+
+To join, just click on the button below:
+
+
+
+If you still prefer IRC, Slack has a gateway that you can use with a regular IRC client. Register first with the button above and then follow their [instructions](https://get.slack.help/hc/en-us/articles/201727913-Connecting-to-Slack-over-IRC-and-XMPP).
+
+## Automated UI tests
+
+It is important that our [Frontend](/developer/frontend) behaves the same way regardless of the browser people use. To make sure it stays that way, we started doing [automated tests](https://github.com/commonsearch/cosr-front/tree/master/tests) with the [webdriver.io](http://webdriver.io/) project.
+
+We are using [Sauce Labs](https://saucelabs.com/) to run the tests in many different browsers and operating systems. You can even see our [latest builds](https://saucelabs.com/open_sauce/user/commonsearch) with full in-browser replays. As an example, check out how Common Search behaves on [Windows 7 with IE 10](https://saucelabs.com/beta/tests/1a42c35a7f4d41d59c613b0c60d7ed54/commands)!
+
+
+## Community
+
+So far we've had **16** contributors, **14** users who submitted issues, **28** who commented on an issue, and **141** who starred [one of our repositories](https://github.com/commonsearch). A **very big thanks** to them all! We hope you too will [join our growing community](/contributing)!
+
+
+## What's next?
+
+Our #1 goal remains to grow the community with better documentation, easier setup for [new contributors](/contributing) and making sure their first-time experience is top-notch.
+
+In the next few weeks, we will also update the index of the [UI Demo](https://uidemo.commonsearch.org/) with many more domains and get one step closer to a useful service for everyone!
diff --git a/content/pages/contact.md b/content/pages/contact.md
index 8c0505d..55d0fbc 100644
--- a/content/pages/contact.md
+++ b/content/pages/contact.md
@@ -12,6 +12,10 @@ To chat with us directly you can join our [Slack channel](https://slack.commonse
+## IRC
+
+Slack has an IRC gateway that you can use with a regular IRC client. Register first with the button above and then follow their [instructions](https://get.slack.help/hc/en-us/articles/201727913-Connecting-to-Slack-over-IRC-and-XMPP).
+
## Press
Though we are not looking for mainstream press until we have something to show, please reach out to [contact@commonsearch.org](mailto:contact@commonsearch.org) as well!
diff --git a/content/pages/privacy.md b/content/pages/privacy.md
index 6b97e89..9d9535c 100644
--- a/content/pages/privacy.md
+++ b/content/pages/privacy.md
@@ -22,7 +22,8 @@ The only information we (and our CDN) receive for each search are:
We currently do not store any of this, though we will probably start logging aggregate query volume (without IP addresses) to understand usage and provide autocompletion.
-We will also open a [Tor hidden service](https://en.wikipedia.org/wiki/Tor_(anonymity_network)#Hidden_services) so that our users can safely mask their IP address. See [cosr-ops#8](https://github.com/commonsearch/cosr-ops/issues/8).
+We also have a [Tor hidden service](https://en.wikipedia.org/wiki/Tor_(anonymity_network)#Hidden_services) so that our users can safely mask their IP address: [http://comsearchl2zlnre.onion](http://comsearchl2zlnre.onion)
+
## Why you can trust us
From f716c852a86e8a874236b7aa236fdb52815621ba Mon Sep 17 00:00:00 2001
From: Sylvain Zimmer
Date: Thu, 11 Aug 2016 17:34:02 +0200
Subject: [PATCH 06/18] Add new tutorial, beta for now
---
content/images/developer/README | 1 +
.../tutorials/spark-backlinks-pipeline.svg | 2 +
.../analyzing-the-web-with-spark-on-ec2.md | 143 ++++++++++++++++++
3 files changed, 146 insertions(+)
create mode 100644 content/images/developer/README
create mode 100644 content/images/developer/tutorials/spark-backlinks-pipeline.svg
create mode 100644 content/pages/developer/tutorials/analyzing-the-web-with-spark-on-ec2.md
diff --git a/content/images/developer/README b/content/images/developer/README
new file mode 100644
index 0000000..5b632e0
--- /dev/null
+++ b/content/images/developer/README
@@ -0,0 +1 @@
+The SVG diagrams are generated with https://www.draw.io/
\ No newline at end of file
diff --git a/content/images/developer/tutorials/spark-backlinks-pipeline.svg b/content/images/developer/tutorials/spark-backlinks-pipeline.svg
new file mode 100644
index 0000000..4f68673
--- /dev/null
+++ b/content/images/developer/tutorials/spark-backlinks-pipeline.svg
@@ -0,0 +1,2 @@
+
+
\ No newline at end of file
diff --git a/content/pages/developer/tutorials/analyzing-the-web-with-spark-on-ec2.md b/content/pages/developer/tutorials/analyzing-the-web-with-spark-on-ec2.md
new file mode 100644
index 0000000..0043e98
--- /dev/null
+++ b/content/pages/developer/tutorials/analyzing-the-web-with-spark-on-ec2.md
@@ -0,0 +1,143 @@
+Title: Tutorial: Analyzing the web with Spark on EC2
+Slug: developer/tutorials/analyzing-the-web-with-spark-on-ec2
+Template: page_developer
+
+This tutorial get you through all the steps required to analyze a large number of web pages with Spark on EC2 using our [Backend](/developer/backend).
+
+Common Search has a plugin system that allows developers to build their own processing pipeline. For this example, we will run a plugin that dumps a list of backlinks to a specific domain.
+
+Common Search uses the same pipeline to index the web in Elasticsearch, so you can follow the same steps to do any operation on the document sources we support.
+
+An [Amazon Web Services](https://aws.amazon.com/) account is needed for this tutorial, though depending on the volume you could run the pipeline entirely locally or on another cloud provider.
+
+
+## 1. Install cosr-back and cosr-ops on your local machine
+
+We have Docker containers ready to use so this step should take you only a few minutes!
+
+You can follow the instructions in [cosr-back/INSTALL.md](https://github.com/commonsearch/cosr-back/blob/master/INSTALL.md) (which contains the Python code that analyzes the documents) and then [cosr-ops/INSTALL.md](https://github.com/commonsearch/cosr-ops/blob/master/INSTALL.md) (which contains the tools to manage operations and infrastructure).
+
+
+## 2. Understand how document sources and plugins work
+
+You should view this process as a data pipeline with some document sources as input, and any number of plugins that can perform operations on the documents.
+
+In this tutorial we will use one document source (Common Crawl) and two plugins (one to filter documents, and one to dump our list of backlinks).
+
+For this example, let's collect all links to Wikipedia pages, except those coming from Blogpost or Tumblr. This is how the pipeline looks:
+
+[](/images/developer/tutorials/spark-backlinks-pipeline.svg)
+
+
+## 3. Do a test run on your local machine
+
+It is very useful to test your pipeline on a few of documents on your local machine before scaling up to billions of documents!
+
+Open a console in the `cosr-back` repository you just installed, and run:
+
+```
+make docker_shell
+```
+
+This will take you inside a Docker container that already has all the dependencies you will need.
+
+Now you need to build the command line for your job. Let's understand how it is structured first:
+
+```
+spark-submit [spark_options] jobs/spark/pipeline.py --source [source_options] --plugin [plugin_options] [other_pipeline_options]
+```
+
+We don't need many Spark options in local, let's just use `--verbose`.
+
+We are using Common Crawl as a source, let's limit ourselves to 8 segments of 1000 documents each for this test run. This is done with `--source commoncrawl:limit=8,maxdocs=1000`. This will use the latest available version of Common Crawl.
+
+We are using 2 different plugins. The first is `plugins.filter.Domains`: it blacklists some domains we want to skip. You can configure it with `--plugin "plugins.filter.Domains:skip=1,domains=tumblr.com wordpress.com"` (note the quotes! we need them because the plugin argument includes a space).
+
+The second plugin is `plugin.hyperlinks.MostExternallyLinkedPages`. It will output a list of pages on a specific domain that have the most backlinks in the document sources we are processing. Let's configure it like this: `--plugin plugin.hyperlinks.MostExternallyLinkedPages:domain=wikipedia.org,path=out/top_wikipedia.txt`.
+
+Finally, let's add another useful option to our job: `--stop_delay 600`. This will prevent Spark from exiting for 10 minutes when your job is done, so that we have time to open the Spark Web UI and see what happened!
+
+Putting it all together, the command you need to run is:
+
+```
+spark-submit --verbose jobs/spark/pipeline.py --source commoncrawl:limit=8,maxdocs=1000 --plugin "plugins.filter.Domains:skip=1,domains=tumblr.com wordpress.com" --plugin plugin.hyperlinks.MostExternallyLinkedPages:domain=wikipedia.org,path=out/top_wikipedia.txt --stop_delay 600
+```
+
+This is what you should get as output:
+
+XXXX
+
+Spark has a very convenient Web UI that lets you debug the jobs that are running. Go ahead and open it in [http://localhost:4040](http://localhost:4040).
+
+Once all the jobs are done and you have finished exploring the Spark UI, go back to the console and send a `Ctrl-C` to the command to interrupt the `stop_delay`.
+
+Now you can open the file `out/top_wikipedia.txt` that should have been created and make sure it is what you want!
+
+
+## 4. Create a Spark cluster on EC2
+
+Now, Common Crawl typically has 20,000+ segments of 50,000+ documents each so processing each document on your local machine will take a while. Let's move to the cloud!
+
+We are going to deploy a Spark cluster on a fleet of [Spot Instances on EC2](https://aws.amazon.com/ec2/spot/). Spot Instances are ideal for this kind of data processing: they cost much less than regular instances and if they are killed during our job, we can just run it again!
+
+To deploy our Spark cluster on EC2 we are using a tool called [Flintrock](https://github.com/nchammas/flintrock). All it requires is a YAML configuration file.
+
+There is a file in the `cosr-ops` repository called `configs/flintrock.yaml.template`. Rename it to `configs/flintrock.yaml` and change it with the values that match your AWS account.
+
+We should now choose an instance type and a number of machines in our cluster. Our pipeline is usually CPU-bound, so the most important metric will be the number of cores. The [C4 instance family](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/c4-instances.html) has the best CPU performance/cost ratio, and it usually takes around 15 minutes for a C4 core to index a full Common Crawl segment.
+
+For this example, this means that if we want to process the whole Common Crawl, we will need (15/60) * 20000 = 5000 CPU hours. If we want to run it in 24 hours, we will need 6 `c4.8xlarge` instances with 36 CPUs each. You could also do it in just a couple hours with more instances, but you might need to ask for a raise in EC2 limits to be able to launch tens of them at once.
+
+Once you have the configuration of your cluster filled in your `flintrock.yaml` file, you are ready to launch the cluster! Open a console in the `cosr-ops` repository and just like the previous step, do `make docker_shell` to enter the container.
+
+Once in the container, you should configure your AWS credentials, doing something like this:
+
+```
+export AWS_ACCESS_KEY_ID=AKIAxxxxxxxxxxx
+export AWS_SECRET_ACCESS_KEY=yyyyyyyyyyyyyyyyyyyyyyyyyy
+```
+
+Then, let's create the cluster:
+
+```
+make aws_spark_flintrock_create
+```
+
+If this command is successfuly, it will ultimately log you in the Spark master server.
+
+The Web UI should also have been launched. Open it at http://[spark_master_hostname]:8080
+
+
+## 5. Launch your index job
+
+You have now a shell in the Spark master. This is very similar to step 3 except now you have much more CPUs at your fingertips and you are paying for each hour that the cluster spends online. So let's not waste any time!
+
+First, let's open a `screen` in the server, so that we don't loose anything if you are temporarily disconnected from the Internet.
+
+```
+screen -S sparkjob
+```
+
+You can exit this screen with `Ctrl-A D` and go back into it with `screen -x sparkjob`.
+
+Now let's assemble a new `spark-submit` command like we did in step 3!
+
+This time, there are a few flags that will change:
+
+ - We want to use all the machines in the cluster so we have to explicitly reference the master in the Spark options. The internal address of the Spark master appears in the Web UI and should be something like `spark://ip-172-31-40-187:7077`.
+ - We want to index the whole Common Crawl, so there are no more `limit` or `maxdocs` on our document source.
+ - Gzipping the output file could be nice, so we'll add a `gzip=1` option to the hyperlinks plugin.
+ - Storing the `top_wikipedia.txt` file on the local filesystem won't work because we have many different machines! So we need to save it to S3. If you don't have an existing S3 bucket you can use, go ahead and create one named `my-spark-results`.
+
+So this is the type of command we will run inside the `screen`:
+
+```
+spark-submit --verbose --master spark://[spark_master_ip]:7077 jobs/spark/index.py --source commoncrawl --plugin "plugins.filter.Domains:skip=1,domains=tumblr.com wordpress.com" --plugin plugin.hyperlinks.MostExternallyLinkedPages:path=s3a://my-spark-results/top_wikipedia.txt,domain=wikipedia.org,gzip=1 --stop_delay 600
+```
+
+Once the pipeline is finished... congratulations! You just ran some code on billions of web pages :-)
+
+There are a few things you may want to try after this:
+
+ - Try our other plugins and document sources... or create your own!
+ - Index the documents into Elasticsearch, like Common Search does.
From 2922f970ac65b79496dd924dcc3ba03ee21b3f7a Mon Sep 17 00:00:00 2001
From: Sylvain Zimmer
Date: Thu, 11 Aug 2016 22:01:55 +0200
Subject: [PATCH 07/18] Fixes to the tutorial
---
.../analyzing-the-web-with-spark-on-ec2.md | 36 +++++++++----------
1 file changed, 18 insertions(+), 18 deletions(-)
diff --git a/content/pages/developer/tutorials/analyzing-the-web-with-spark-on-ec2.md b/content/pages/developer/tutorials/analyzing-the-web-with-spark-on-ec2.md
index 0043e98..67a30a8 100644
--- a/content/pages/developer/tutorials/analyzing-the-web-with-spark-on-ec2.md
+++ b/content/pages/developer/tutorials/analyzing-the-web-with-spark-on-ec2.md
@@ -2,9 +2,9 @@ Title: Tutorial: Analyzing the web with Spark on EC2
Slug: developer/tutorials/analyzing-the-web-with-spark-on-ec2
Template: page_developer
-This tutorial get you through all the steps required to analyze a large number of web pages with Spark on EC2 using our [Backend](/developer/backend).
+This tutorial get you through all the steps required to analyze a large number of web pages with [Apache Spark](spark.apache.org) on EC2 using our [Backend](/developer/backend).
-Common Search has a plugin system that allows developers to build their own processing pipeline. For this example, we will run a plugin that dumps a list of backlinks to a specific domain.
+Common Search has a plugin system that allows developers to build their own processing pipeline. For this tutorial, we will run a plugin that dumps a list of backlinks to a specific domain.
Common Search uses the same pipeline to index the web in Elasticsearch, so you can follow the same steps to do any operation on the document sources we support.
@@ -22,7 +22,7 @@ You can follow the instructions in [cosr-back/INSTALL.md](https://github.com/com
You should view this process as a data pipeline with some document sources as input, and any number of plugins that can perform operations on the documents.
-In this tutorial we will use one document source (Common Crawl) and two plugins (one to filter documents, and one to dump our list of backlinks).
+In this tutorial we will use one document source ([Common Crawl](https://www.commoncrawl.org)) and two plugins (one to filter documents, and one to dump our list of backlinks).
For this example, let's collect all links to Wikipedia pages, except those coming from Blogpost or Tumblr. This is how the pipeline looks:
@@ -44,49 +44,49 @@ This will take you inside a Docker container that already has all the dependenci
Now you need to build the command line for your job. Let's understand how it is structured first:
```
-spark-submit [spark_options] jobs/spark/pipeline.py --source [source_options] --plugin [plugin_options] [other_pipeline_options]
+spark-submit [spark_options] spark/jobs/pipeline.py --source [source_options] --plugin [plugin_options] [other_pipeline_options]
```
-We don't need many Spark options in local, let's just use `--verbose`.
+We don't need many [Spark options](http://spark.apache.org/docs/latest/configuration.html) in local, let's just use `--verbose`.
-We are using Common Crawl as a source, let's limit ourselves to 8 segments of 1000 documents each for this test run. This is done with `--source commoncrawl:limit=8,maxdocs=1000`. This will use the latest available version of Common Crawl.
+We are using [Common Crawl](https://github.com/commonsearch/cosr-back/blob/master/cosrlib/sources/commoncrawl.py) as a source, but let's limit ourselves to 8 segments of 1000 documents each for this test run with `--source commoncrawl:limit=8,maxdocs=1000`. This will use the latest available version of Common Crawl.
-We are using 2 different plugins. The first is `plugins.filter.Domains`: it blacklists some domains we want to skip. You can configure it with `--plugin "plugins.filter.Domains:skip=1,domains=tumblr.com wordpress.com"` (note the quotes! we need them because the plugin argument includes a space).
+We are using 2 different plugins:
-The second plugin is `plugin.hyperlinks.MostExternallyLinkedPages`. It will output a list of pages on a specific domain that have the most backlinks in the document sources we are processing. Let's configure it like this: `--plugin plugin.hyperlinks.MostExternallyLinkedPages:domain=wikipedia.org,path=out/top_wikipedia.txt`.
+ - [plugins.filter.Domains](https://github.com/commonsearch/cosr-back/blob/master/plugins/filter.py): it blacklists some domains we want to skip. You can configure it with `--plugin "plugins.filter.Domains:skip=1,domains=tumblr.com wordpress.com"` (note the quotes! we need them because the plugin argument includes a space).
+ - [plugins.hyperlinks.MostExternallyLinkedPages](https://github.com/commonsearch/cosr-back/blob/master/plugins/hyperlinks.py): it will output a list of pages on a specific domain that have the most backlinks in the document sources we are processing. Let's configure it like this: `--plugin plugins.hyperlinks.MostExternallyLinkedPages:domain=wikipedia.org,path=out/top_wikipedia/`.
Finally, let's add another useful option to our job: `--stop_delay 600`. This will prevent Spark from exiting for 10 minutes when your job is done, so that we have time to open the Spark Web UI and see what happened!
Putting it all together, the command you need to run is:
```
-spark-submit --verbose jobs/spark/pipeline.py --source commoncrawl:limit=8,maxdocs=1000 --plugin "plugins.filter.Domains:skip=1,domains=tumblr.com wordpress.com" --plugin plugin.hyperlinks.MostExternallyLinkedPages:domain=wikipedia.org,path=out/top_wikipedia.txt --stop_delay 600
+spark-submit --verbose spark/jobs/pipeline.py --source commoncrawl:limit=8,maxdocs=1000 --plugin "plugins.filter.Domains:skip=1,domains=tumblr.com wordpress.com" --plugin plugins.hyperlinks.MostExternallyLinkedPages:domain=wikipedia.org,path=out/top_wikipedia/ --stop_delay 600
```
-This is what you should get as output:
+Spark has a very convenient Web UI that lets you debug the jobs that are running. Go ahead and open it in [http://localhost:4040](http://localhost:4040).
-XXXX
+
The Spark Web UI
-Spark has a very convenient Web UI that lets you debug the jobs that are running. Go ahead and open it in [http://localhost:4040](http://localhost:4040).
Once all the jobs are done and you have finished exploring the Spark UI, go back to the console and send a `Ctrl-C` to the command to interrupt the `stop_delay`.
-Now you can open the file `out/top_wikipedia.txt` that should have been created and make sure it is what you want!
+Now you can open the file `out/top_wikipedia/part-*.txt` that should have been created and make sure it is what you want!
## 4. Create a Spark cluster on EC2
-Now, Common Crawl typically has 20,000+ segments of 50,000+ documents each so processing each document on your local machine will take a while. Let's move to the cloud!
+Now, Common Crawl typically has 30,000+ segments of 50,000+ documents each so processing each document on your local machine will take a while. Let's move to the cloud!
We are going to deploy a Spark cluster on a fleet of [Spot Instances on EC2](https://aws.amazon.com/ec2/spot/). Spot Instances are ideal for this kind of data processing: they cost much less than regular instances and if they are killed during our job, we can just run it again!
To deploy our Spark cluster on EC2 we are using a tool called [Flintrock](https://github.com/nchammas/flintrock). All it requires is a YAML configuration file.
-There is a file in the `cosr-ops` repository called `configs/flintrock.yaml.template`. Rename it to `configs/flintrock.yaml` and change it with the values that match your AWS account.
+There is a file in the [cosr-ops repository](https://github.com/commonsearch/cosr-ops) called [configs/flintrock.yaml.template](https://github.com/commonsearch/cosr-ops/blob/master/configs/flintrock.yaml.template). Rename it to `configs/flintrock.yaml` and change it with the values that match your AWS account.
We should now choose an instance type and a number of machines in our cluster. Our pipeline is usually CPU-bound, so the most important metric will be the number of cores. The [C4 instance family](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/c4-instances.html) has the best CPU performance/cost ratio, and it usually takes around 15 minutes for a C4 core to index a full Common Crawl segment.
-For this example, this means that if we want to process the whole Common Crawl, we will need (15/60) * 20000 = 5000 CPU hours. If we want to run it in 24 hours, we will need 6 `c4.8xlarge` instances with 36 CPUs each. You could also do it in just a couple hours with more instances, but you might need to ask for a raise in EC2 limits to be able to launch tens of them at once.
+For this example, this means that if we want to process the whole Common Crawl, we will need (15/60) * 30000 = 7500 CPU hours. If we want to run it in 24 hours, we will need 9 `c4.8xlarge` instances with 36 CPUs each. You could also do it in just a couple hours with more instances, but you might need to ask for a raise in EC2 limits to be able to launch tens of them at once.
Once you have the configuration of your cluster filled in your `flintrock.yaml` file, you are ready to launch the cluster! Open a console in the `cosr-ops` repository and just like the previous step, do `make docker_shell` to enter the container.
@@ -127,12 +127,12 @@ This time, there are a few flags that will change:
- We want to use all the machines in the cluster so we have to explicitly reference the master in the Spark options. The internal address of the Spark master appears in the Web UI and should be something like `spark://ip-172-31-40-187:7077`.
- We want to index the whole Common Crawl, so there are no more `limit` or `maxdocs` on our document source.
- Gzipping the output file could be nice, so we'll add a `gzip=1` option to the hyperlinks plugin.
- - Storing the `top_wikipedia.txt` file on the local filesystem won't work because we have many different machines! So we need to save it to S3. If you don't have an existing S3 bucket you can use, go ahead and create one named `my-spark-results`.
+ - Storing the output file on the local filesystem won't work because we have many different machines! So we need to save it to S3. If you don't have an existing S3 bucket you can use, go ahead and create one named `my-spark-results`.
So this is the type of command we will run inside the `screen`:
```
-spark-submit --verbose --master spark://[spark_master_ip]:7077 jobs/spark/index.py --source commoncrawl --plugin "plugins.filter.Domains:skip=1,domains=tumblr.com wordpress.com" --plugin plugin.hyperlinks.MostExternallyLinkedPages:path=s3a://my-spark-results/top_wikipedia.txt,domain=wikipedia.org,gzip=1 --stop_delay 600
+spark-submit --verbose --master spark://[spark_master_ip]:7077 spark/jobs/pipeline.py --source commoncrawl --plugin "plugins.filter.Domains:skip=1,domains=tumblr.com wordpress.com" --plugin plugins.hyperlinks.MostExternallyLinkedPages:path=s3a://my-spark-results/top_wikipedia/,domain=wikipedia.org,gzip=1 --stop_delay 600
```
Once the pipeline is finished... congratulations! You just ran some code on billions of web pages :-)
From 7194b713e36b61b8eb64aeb9f50d41d6165b6066 Mon Sep 17 00:00:00 2001
From: Sylvain Zimmer
Date: Thu, 11 Aug 2016 22:04:07 +0200
Subject: [PATCH 08/18] Fix missing image
---
.../tutorials/spark-backlinks-web-ui.png | Bin 0 -> 245403 bytes
1 file changed, 0 insertions(+), 0 deletions(-)
create mode 100644 content/images/developer/tutorials/spark-backlinks-web-ui.png
diff --git a/content/images/developer/tutorials/spark-backlinks-web-ui.png b/content/images/developer/tutorials/spark-backlinks-web-ui.png
new file mode 100644
index 0000000000000000000000000000000000000000..a7f14063815067a7d3709acc3b0585f50869b646
GIT binary patch
literal 245403
zcmb5W1ymecn>9>;1PFm3Atb@Aad)@Iy>TbOA-H>RclQuzpdmnT2^NBrAdR~<*0?+T
zoqOMzd1w9Wo4LbUO*PeZs_N7^XYc*&=V>BUm1VG=zk7~^goGt03j`q{VQ?TJp<1Co
zL+lZvYi>Y7dhym;Qc_h;Qj%QN*}=m4qd5|iY-Fl7nufX@VV*V!9fKL?34H;txOeK(
z(|_O`zS*%D%Exe@?>)FaiDM$o-}{lsUH{L7cG(NBpO9;u(WE_UQR
z=5ac-du%NUF1I;s=DtOGwXcgT4k$wvdo8N+=I!>+7!_Hwr0!3suupiSNEb}*m7Ri=
zl&{3z?mw~|+9Ly@#Hbb}VNz1fhZ9Yamynu@RZU6oVH|e2R`kGZaiXL6n$hjWkaObm+4ywEezV{AqC>|W
zG>qz80^uSbC%E3L7ExR3G)Bej;b4R+p3m;26W4iZ1Ycd6izh;{32YGJGD&
zjq(CXwe$UTsMrhJNDkmz_9d#BG@6tf2_LSQbV6d`pybfmM5^O4Uw;doZP^?G>xC7y
z*5vG6GH#pXE==^WHD+;Nl
z=yp&Kd6LLMq$tej!NvM3>UYvG7`s~ANY6Fl7h&|)W^Z_$^NQpBojEZZb~+hn7glij
z5Uq1}qin)k_7CygYX!Vf#31(&i4m{F%`AY_S6JyokL40(QOKyAC$LEz2mj!c&XLYL^KP5hRjOAgE`SW=
zZZ^G6S@!`pi{(-*`xm-`DIbX#(-D(YyZ6$wu
z5Xr}F3u+Gfi5?BBW?dJwI5&N@d}z}h=+_^gHuHS%%~Vuw56k;l9TJldmfyU;VhPgx
zFlAh=cBP3KeoeoT_GJA?+U{%l*W0h6kEZTZ=qLqeKEXlwtpV2VI6Ne=Ga9Fe4M-+C}5zEFBFuuX{WCiph6|(Hggo30j^5oOQLH6(p
z_kxk_6TZnGuQ5)|1q@Eek)ErD2uh-CBcYf)cd2_($WuW|CPXt&FM^&sz9D|<_8B+lDPbqU=o9--K8|*0L4+nqttgV{K?ykT
zBxJa747<^*#XjO%cB352NF>TqKarCF;fhd7Y*HyC;+YB8V9f?=P<{pv04IT7Pd0DFvSOB~YR
z+GAp0iuUKrVcIM8o|yhvhrMV=RjBX8qfNveX;AUkFd!kbA(RsI??qQpSA`Y?+d@2m
zl!@Y86aqimGQ|{Whp8$l56Gu*vLkZ4N!Q3DflzYYgax@(>euw~8o)Vu0ib{+obo(L
zBw=Hq#Z2EBi9hidEn$+v0NOywfFY~|M!`?RPsOj@?8qOtV9?C%d<%OAs~kKZ#8OM4
z2fXr-=c38238@JMDewB!V3d}MDwi)nuYp=+(sF@fmROEtGao)EectD@}+
zKJ{v`YK`NJHO(G>8h#p68e^I?s05e;oGjm%@+@^{kY{jpPOG
zHwV|0^@O$1XgBw7?h>2fsRf9aQEfmg+_Uj`@3?0@KCYuY)F#tb%I5d@T=`@%cWFf-
zNh$N_QkkFfcjNGab#QWJt=d+daz2aUv}Ah8ceTry1DE1=%O7WPj8E|eSX77RP^Q}mle+&wh8W>4}8nt29GKea}sOSvc@b5wF)nL
z0liMWL$SLt`I)8JsM)VFG%^=>!I|D!n;B`@M!Zy6{MqIi`)s>xGQUZFjsF_@Ri*d&
zb=+&AXp+~t_~`fsfFs%zl_qn!`m=^#jbCf_8b0WR>JjKo>bB)r@v%FtwM16ncjyV~
zSn3_twp$i{;G8WUE&j1tU(w7w-|*|Ah2%F&r+uPf+ajY6rcdR)^6fmfYdAgj5J?tE
zJD&jmbCPnB+ng7
zL_WUAp`W22;T6yId;dWH9{>FZh6nbC`8%>_+h|44G@iad6+j)v{P>&_H59c1vxVR*
z$uI267o-_`TwXt5X*f2r8vP9kG6kCjU87<2+bIL;Jc`X^7O#3=mSdu={#YgIA_Has
zVZb|{R6gzxHnS77i9Cs?RQiS6YPDamrQ4z}B8~9@1s8(c)L8lXdatuV0xZ7WBg(
zehZ`lWup`XGR`syh8Zle0BE!hDx@a2HN3UPj;(w|{C)-=2i^zcuP#&0=|@0|>byl4
zV6WoT;`38quX)cZXfV_R+%bYa!k56F^EO>KS!!5*P+&lN)$&Nc#;s<{%;Px|suZ$Q
zob7ON&Xe~MjsYx5f>eznDg>l%bHL&+`M||%(rfc;GK;=p5gr*Y>;UC?gVSclEfHpP
zx+`;1k$B-MulA)on$z!uo~$#qIJz0F-@Fokd@+yqCPZVY)RE9ls)k)k>~@`pI;8h9
z+vyayeLSwo%SR3Q?@Hs=l^A^X4mKmTDGy=Ihp|Hu%X1
zkEp}7m#5nx{DM2q*1eLZa-*uPdDz9U$9v=a>8=lt;;icT#O7+x^?9XQ9z%wD*2VhF
z8r?eGRvK4@d##@pwvGNvX$||*nJT)!-7EGP8#|VN%#PKM+ghHp>>JX27unz3Z*QY@
zRIzJz^4c+4Te@pgx#&5tz3;hBAH2`HRk#S??-V7kjQDB=aV`5QRO;raM4(4g1{dVI
zSM7V>@$-B#ZPLf)4L6NUC+=9{RtMnT`_!s`Y}0hOuPmO3WZ6H+|4Y+rNUpx1YPuF}
zrAq0c5dXEAOR)aW$mOs8f}^F=I_~oO6ItUa<3q|X^Xd6GXQ-{<&mZ7#=K
zm5qva)}mz24{M>HFt_yD^hXR_J7zow&Y=TC!|A(~d`1vE7c0Tz?>`S|=Sb#ls{{-y
z4jp%_wsR)DT83(O_qaQ32`kZzJDjQZq_=)&WH&k}bh_mSwn=6KOd79R;8(2d5W^#