diff --git a/README.md b/README.md index 16e2005..cf5a6a2 100644 --- a/README.md +++ b/README.md @@ -1,5 +1,7 @@ # cosr-about +[![Chat with us on Slack](https://slack.commonsearch.org/badge.svg)](https://slack.commonsearch.org) [![Apache License 2.0](https://img.shields.io/github/license/commonsearch/cosr-back.svg)](LICENSE) + This repository contains our presentation website and blog hosted at https://about.commonsearch.org/. You can use the [issues page](https://github.com/commonsearch/cosr-about) to suggest improvements to the content or layout! @@ -11,7 +13,7 @@ Running `cosr-about` on your local machine is very simple. You only need to have Once Docker is launched, just run: ``` -make docker_devserver +[sudo] make docker_devserver ``` Then open http://192.168.99.100:9701/ in your browser. (Replace "192.168.99.100" by the address of your Docker machine. On a Mac, you can get it with `docker-machine ip boot2docker`) @@ -31,4 +33,4 @@ Then open http://localhost:9701/ This website uses [Pelican](http://blog.getpelican.com/) to statically generate a bunch of HTML files from [Markdown](http://commonmark.org/) source. -We use a customized [Bootstrap 3](http://getbootstrap.com/) template. We welcome any contributions to make it prettier! \ No newline at end of file +We use a customized [Bootstrap 3](http://getbootstrap.com/) template. We welcome any contributions to make it prettier! diff --git a/content/blog/2016/07/our-first-public-datasets-host-level-webgraph-and-pagerank.md b/content/blog/2016/07/our-first-public-datasets-host-level-webgraph-and-pagerank.md index ea4f49d..78ab519 100644 --- a/content/blog/2016/07/our-first-public-datasets-host-level-webgraph-and-pagerank.md +++ b/content/blog/2016/07/our-first-public-datasets-host-level-webgraph-and-pagerank.md @@ -63,7 +63,7 @@ We encourage everyone to analyze these datasets and **report their findings publ Here is a non-exhaustive list of interesting areas to explore: - - Spam! We have opened an [issue in GitHub]() to track the progress. Join the fight! + - Spam! We have opened an [issue in GitHub](https://github.com/commonsearch/cosr-back/issues/52) to track the progress. Join the fight! - Analyzing correlations between the PageRank dataset and other public domain rankings (Top sites in Alexa, ...) - Using the WebGraph dataset to create other metrics, for instance: [Centrality](https://en.wikipedia.org/wiki/Centrality), [CheiRank](https://en.wikipedia.org/wiki/CheiRank), [HITS](https://en.wikipedia.org/wiki/HITS_algorithm), [SALSA](https://en.wikipedia.org/wiki/SALSA_algorithm), ... - Review our [Python code on GitHub](https://github.com/commonsearch/cosr-back) to look for bugs and speed improvements! diff --git a/content/blog/2016/08/state-of-common-search-august-2016.md b/content/blog/2016/08/state-of-common-search-august-2016.md new file mode 100644 index 0000000..a848969 --- /dev/null +++ b/content/blog/2016/08/state-of-common-search-august-2016.md @@ -0,0 +1,54 @@ +Title: State of Common Search - August 2016 +Slug: state-of-common-search-august-2016 +Date: 2016-08-07 15:42:35 +Author: commonsearch + +There has been a lot happening at [Common Search](https://about.commonsearch.org/) lately: we published our first datasets, opened a Tor hidden service, switched to Slack and started doing automated UI tests! + + + +## First datasets + +Last week, we published our [first two datasets](https://about.commonsearch.org/2016/07/our-first-public-datasets-host-level-webgraph-and-pagerank/): a host-level web graph and a host-level list of PageRanks. + +We plan to release even more datasets in the future to keep our service as transparent as we can, so feel free to test them, analyze the data and suggest improvements! + + +## Tor hidden service + +We use CloudFlare as a CDN, but we explicitly whitelisted Tor users so they should not have any issue connecting to our [UI Demo](https://uidemo.commonsearch.org/). + +However, some users may want to access Common Search directly through Tor for better [privacy](/privacy), which is why we just opened a [Tor hidden service](https://www.torproject.org/docs/hidden-services.html.en) with a [.onion](https://en.wikipedia.org/wiki/.onion) address: + +**[http://comsearchl2zlnre.onion](http://comsearchl2zlnre.onion)** + +Please report any issues connecting to the service! + + +## Slack channel + +Due to low traffic in our IRC channel, we have switched to [Slack](https://slack.commonsearch.org). Though it is closed source, we feel it is a better, pragmatic choice for allowing new contributors to join the community easily. + +To join, just click on the button below: + + + +If you still prefer IRC, Slack has a gateway that you can use with a regular IRC client. Register first with the button above and then follow their [instructions](https://get.slack.help/hc/en-us/articles/201727913-Connecting-to-Slack-over-IRC-and-XMPP). + +## Automated UI tests + +It is important that our [Frontend](/developer/frontend) behaves the same way regardless of the browser people use. To make sure it stays that way, we started doing [automated tests](https://github.com/commonsearch/cosr-front/tree/master/tests) with the [webdriver.io](http://webdriver.io/) project. + +We are using [Sauce Labs](https://saucelabs.com/) to run the tests in many different browsers and operating systems. You can even see our [latest builds](https://saucelabs.com/open_sauce/user/commonsearch) with full in-browser replays. As an example, check out how Common Search behaves on [Windows 7 with IE 10](https://saucelabs.com/beta/tests/1a42c35a7f4d41d59c613b0c60d7ed54/commands)! + + +## Community + +So far we've had **16** contributors, **14** users who submitted issues, **28** who commented on an issue, and **141** who starred [one of our repositories](https://github.com/commonsearch). A **very big thanks** to them all! We hope you too will [join our growing community](/contributing)! + + +## What's next? + +Our #1 goal remains to grow the community with better documentation, easier setup for [new contributors](/contributing) and making sure their first-time experience is top-notch. + +In the next few weeks, we will also update the index of the [UI Demo](https://uidemo.commonsearch.org/) with many more domains and get one step closer to a useful service for everyone! diff --git a/content/images/developer/README b/content/images/developer/README new file mode 100644 index 0000000..5b632e0 --- /dev/null +++ b/content/images/developer/README @@ -0,0 +1 @@ +The SVG diagrams are generated with https://www.draw.io/ \ No newline at end of file diff --git a/content/images/developer/tutorials/spark-backlinks-pipeline.svg b/content/images/developer/tutorials/spark-backlinks-pipeline.svg new file mode 100644 index 0000000..00c159f --- /dev/null +++ b/content/images/developer/tutorials/spark-backlinks-pipeline.svg @@ -0,0 +1,3 @@ + + +



Document Source
Common Crawl

Raw HTML
+ HTTP headers
[Not supported by viewer]
Parsing
Transforms raw HTML in Document objects
<b>Parsing</b><div>Transforms raw HTML in Document objects</div>
Filter plugin
Skips some URLs
[Not supported by viewer]
Backlinks plugin
Collects links to a domain
[Not supported by viewer]

Output file
out/top_wikipedia.txt
[Not supported by viewer]
\ No newline at end of file diff --git a/content/images/developer/tutorials/spark-backlinks-web-ui.png b/content/images/developer/tutorials/spark-backlinks-web-ui.png new file mode 100644 index 0000000..a7f1406 Binary files /dev/null and b/content/images/developer/tutorials/spark-backlinks-web-ui.png differ diff --git a/content/pages/contact.md b/content/pages/contact.md index 799f2ed..55d0fbc 100644 --- a/content/pages/contact.md +++ b/content/pages/contact.md @@ -6,9 +6,15 @@ Template: page_contribute You can send us an email to [contact@commonsearch.org](mailto:contact@commonsearch.org) +## Slack + +To chat with us directly you can join our [Slack channel](https://slack.commonsearch.org) by clicking on the button below: + + + ## IRC -To talk with us directly you can join [irc://irc.freenode.net/commonsearch](https://webchat.freenode.net/?channels=#commonsearch) +Slack has an IRC gateway that you can use with a regular IRC client. Register first with the button above and then follow their [instructions](https://get.slack.help/hc/en-us/articles/201727913-Connecting-to-Slack-over-IRC-and-XMPP). ## Press diff --git a/content/pages/developer/tutorials/analyzing-the-web-with-spark-on-ec2.md b/content/pages/developer/tutorials/analyzing-the-web-with-spark-on-ec2.md new file mode 100644 index 0000000..34dba60 --- /dev/null +++ b/content/pages/developer/tutorials/analyzing-the-web-with-spark-on-ec2.md @@ -0,0 +1,243 @@ +Title: Tutorial: Analyzing the web with Spark on EC2 +Slug: developer/tutorials/analyzing-the-web-with-spark-on-ec2 +Template: page_developer + +This tutorial get you through all the steps required to analyze a large number of web pages with [Apache Spark](spark.apache.org) on EC2 using our [Backend](/developer/backend). + +Common Search has a plugin system that allows developers to build their own processing pipeline. For this tutorial, we will run a plugin that takes a domain and dumps a list of its pages with the most backlinks on the web. + +Common Search uses the same pipeline to index the web in Elasticsearch, so you can follow the same steps to do any operation on the document sources we support. + +An [Amazon Web Services](https://aws.amazon.com/) account is needed for this tutorial, though depending on the volume you could run the pipeline entirely locally or on another cloud provider. + + + + +## 1. Install cosr-back and cosr-ops on your local machine + +We have Docker containers ready to use so this step should take you only a few minutes! + +There are 2 sets of instructions to follow: + + - [cosr-back/INSTALL.md](https://github.com/commonsearch/cosr-back/blob/master/INSTALL.md): Python code that analyzes the documents + - [cosr-ops/INSTALL.md](https://github.com/commonsearch/cosr-ops/blob/master/INSTALL.md): Tools to manage operations and infrastructure + + + + +## 2. Understand how document sources and plugins work + +You should view this process as a data pipeline with some document sources as input, and any number of plugins that can perform operations on the documents. + +In this tutorial we will use one document source ([Common Crawl](https://www.commoncrawl.org)) and two plugins (one to filter documents, and one to dump our list of backlinks). + +For this example, let's collect all links to Wikipedia pages, except those coming from Blogpost or Tumblr. The pipeline looks like this: + +[![Pipeline](/images/developer/tutorials/spark-backlinks-pipeline.svg)](/images/developer/tutorials/spark-backlinks-pipeline.svg) + + + + +## 3. Do a test run on your local machine + +It is very useful to test your pipeline on a few of documents on your local machine before scaling up to billions of documents! + +Open a console in the [cosr-back repository](https://github.com/commonsearch/cosr-back) you just installed, and run: + +``` +make docker_shell +``` + +This will take you inside a Docker container that already has all the dependencies you will need. + +Now you need to build the command line for your job. Let's understand how it is structured first: + +``` +spark-submit [spark_options] \ + /cosr/back/spark/jobs/pipeline.py \ + --source [source_options] \ + --plugin [plugin_options] \ + [other_pipeline_options] +``` + +We don't need many [Spark options](http://spark.apache.org/docs/latest/configuration.html) in local, let's just use `--verbose`. + +We are using [Common Crawl](https://github.com/commonsearch/cosr-back/blob/master/cosrlib/sources/commoncrawl.py) as a source, but let's limit ourselves to 8 segments of 1000 documents each for this test run with `--source commoncrawl:limit=8,maxdocs=1000`. This will use the latest available version of Common Crawl. + +We are using 2 different plugins: + + - [plugins.filter.Domains](https://github.com/commonsearch/cosr-back/blob/master/plugins/filter.py): blacklists some domains we want to skip. You can configure it with `--plugin "plugins.filter.Domains:skip=1,domains=tumblr.com wordpress.com"` (note the quotes! we need them because the plugin argument includes a space). + - [plugins.backlinks.MostExternallyLinkedPages](https://github.com/commonsearch/cosr-back/blob/master/plugins/backlinks.py): outputs a list of pages on a specific domain that have the most backlinks in the document sources we are processing. Let's configure it like this: `--plugin plugins.backlinks.MostExternallyLinkedPages:domain=wikipedia.org,output=out/top_wikipedia/`. + +Finally, let's add another useful option to our job: `--stop_delay 600`. This will prevent Spark from exiting for 10 minutes when your job is done, so that we have time to open the Spark Web UI and see what happened! + +Putting it all together, the command you need to run is: + +``` +spark-submit --verbose \ + /cosr/back/spark/jobs/pipeline.py \ + --source commoncrawl:limit=8,maxdocs=1000 \ + --plugin "plugins.filter.Domains:skip=1,domains=tumblr.com wordpress.com" \ + --plugin plugins.backlinks.MostExternallyLinkedPages:domain=wikipedia.org,output=out/top_wikipedia/ \ + --stop_delay 600 +``` + +Spark has a very convenient Web UI that lets you debug the jobs that are running. Go ahead and open it in [http://localhost:4040](http://localhost:4040). + +
Spark Web UI
The Spark Web UI
+ + +Once all the jobs are done and you have finished exploring the Spark UI, go back to the console and send a `Ctrl-c` to the command to interrupt the `stop_delay`. + +Now you can open the file `out/top_wikipedia/part-*.txt` that should have been created and make sure it is what you want! + + + + +## 4. Prepare a Spark cluster on EC2 + +Common Crawl typically has 30,000+ segments of 50,000+ documents each so processing each document on your local machine will take a while. Let's move to the cloud! + +We are going to deploy a Spark cluster on a fleet of [Spot Instances on EC2](https://aws.amazon.com/ec2/spot/). Spot Instances are ideal for this kind of data processing: they cost much less than regular instances and if they are killed during our job, we can just run it again! + +To deploy our Spark cluster on EC2 we are using a tool called [Flintrock](https://github.com/nchammas/flintrock). All it requires is a YAML configuration file. + +There is a file in the [cosr-ops repository](https://github.com/commonsearch/cosr-ops) called [configs/flintrock.yaml.template](https://github.com/commonsearch/cosr-ops/blob/master/configs/flintrock.yaml.template). Rename it to `configs/flintrock.yaml` and change it with the values that match your AWS account. + +If you don't have any yet, you will need to create a [Security Group](https://docs.aws.amazon.com/AmazonVPC/latest/GettingStartedGuide/getting-started-create-security-group.html) that allows at least ports 22, 4040 and 8080 from the outside, a [Placement Group](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/placement-groups.html) and a [Key Pair](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-key-pairs.html#having-ec2-create-your-key-pair). Additionally, you should also use your default [VPC](https://docs.aws.amazon.com/AmazonVPC/latest/GettingStartedGuide/getting-started-create-vpc.html) and [Subnets](https://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/VPC_Subnets.html#AddaSubnet), or create new ones. + +Make sure to create a new Internet Gateway as well and attach it to your recently created VPC. Finally, make sure to configure your vpc route table to [allow traffic to/from the internet](https://aws.amazon.com/premiumsupport/knowledge-center/ec2-linux-ssh-troubleshooting/) you'll need to create a route that points to your internet gateway ID. + +We recommend saving your `.pem` key file in the `cosr-ops/configs/` directory so that it is visible from the Docker container (don't worry, it will be ignored by git). Remember to run `chmod 400 /cosr/ops/configs/*.pem` to avoid SSH errors. + +We should now choose an instance type and a number of machines in our cluster. Our pipeline is usually CPU-bound, so the most important metric will be the number of cores. The [C4 instance family](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/c4-instances.html) has the best CPU performance/cost ratio, and it usually takes around 15 minutes for a C4 core to index a full Common Crawl segment. + +So for this example, if we want to process the whole Common Crawl, we will need (15/60) * 30000 = 7500 CPU hours. If we want to run it in 24 hours, we will need 9 `c4.8xlarge` instances with 36 CPUs each. You could also do it in just a couple hours with more instances, but you might need to ask for a raise in [EC2 limits](https://docs.aws.amazon.com/general/latest/gr/aws_service_limits.html) to be able to launch tens of them at once. + +Note: it might be safer to start testing the job with only 2 instances and make sure everything goes well over ~1% of the data (`--source commoncrawl:limit=300`). + +The last step is to create a configuration file for `cosr-back`. There is a file in the [cosr-ops repository](https://github.com/commonsearch/cosr-ops) called [configs/cosr-back.json.template](https://github.com/commonsearch/cosr-ops/blob/master/configs/cosr-back.json.template). You must rename it to `configs/cosr-back.json` but there should be no changes to perform, mainly because we are not using Elasticsearch in this tutorial. + + + + +## 5. Launch your Spark cluster + +Once the configuration files are ready, you can launch the cluster! Open a console in the [cosr-ops repository](https://github.com/commonsearch/cosr-ops) and just like the previous step, do this to enter the container: + +``` +make docker_shell +``` + +Once in the container, you should configure your AWS credentials, doing something like this: + +``` +export AWS_ACCESS_KEY_ID=AKIAxxxxxxxxxxx +export AWS_SECRET_ACCESS_KEY=yyyyyyyyyyyyyyyyyyyyyyyyyy +``` + +Then, let's create the cluster: + +``` +make aws_spark_flintrock_create +``` + +If this command is successful, it will ultimately log you in the Spark master server. + +The Web UI should also have been launched. Open it at http://[spark_master_hostname]:8080 + +If you want to log back in the Spark master, you can do `make aws_spark_flintrock_shell`. Other useful commands can be found in the [cosr-ops Makefile](https://github.com/commonsearch/cosr-ops/blob/master/Makefile). + + + + +## 6. Launch your index job + +You have now a shell in the Spark master. This is very similar to step 3 except now you have much more CPUs at your fingertips and you are paying for each hour that the cluster spends online. So let's not waste any time! + +First, let's open a [screen](https://kb.iu.edu/d/acuy) in the server, so that we don't loose anything if you are temporarily disconnected from the Internet. + +``` +screen -S sparkjob +``` + +You can exit this screen with the `Ctrl-a d` keys and go back into it with `screen -x sparkjob`. + +If you are planning to save files to S3, you should also export your AWS credentials like you did on your local machine: + +``` +export AWS_ACCESS_KEY_ID=AKIAxxxxxxxxxxx +export AWS_SECRET_ACCESS_KEY=yyyyyyyyyyyyyyyyyyyyyyyyyy +``` + +Now let's assemble a new `spark-submit` command like we did in step 3! + +This time, there are a few flags that will change: + + - We want to use all the machines in the cluster so we have to explicitly reference the master in the Spark options. The internal address of the Spark master appears in the Web UI and should be something like `spark://ip-172-31-40-187:7077`. + - We want to index the whole Common Crawl, so there are no more `limit` or `maxdocs` on our document source. + - Gzipping the output file could be nice, so we'll add a `gzip=1` option to the backlinks plugin. + - Storing the output file on the local filesystem won't work because we have many different machines! So we need to save it to S3. If you don't have an existing S3 bucket you can use, go ahead and [create one](https://docs.aws.amazon.com/AmazonS3/latest/gsg/CreatingABucket.html) named `my-spark-results`. + - With large EC2 instances, you should tell Spark to use most of the available RAM. For `c4.8xlarge` instances, we recommend adding the Spark arguments `--driver-memory 50G --executor-memory 50G`. + +So this is the type of command we will run inside the `screen`: + +``` +spark-submit --verbose --master spark://[spark_master_ip]:7077 --driver-memory 50G --executor-memory 50G --properties-file /cosr/back/spark/conf/spark-defaults.conf \ + /cosr/back/spark/jobs/pipeline.py \ + --source commoncrawl \ + --plugin "plugins.filter.Domains:skip=1,domains=tumblr.com wordpress.com" \ + --plugin plugins.backlinks.MostExternallyLinkedPages:output=s3a://my-spark-results/top_wikipedia/,domain=wikipedia.org,gzip=1 +``` + + + + +## 7. Optional: Save intermediate results to S3 + +In the previous step, we are parsing data straight from Common Crawl and feeding it directly to the backlinks plugin. This can be suboptimal for 2 reasons: + + - If you want to run more than one plugin, Spark might go over all of Common Crawl again, wasting time and resources. + - If the plugin fails or if you need to tweak the options, you will need to start from scratch again. + +There is a convenient way of separating the parsing from the plugin execution: saving intermediate parsing results to S3! + +This can be done by adding the `dump.DocumentMetadata` plugin to our pipeline: + +``` +spark-submit --verbose --master spark://[spark_master_ip]:7077 --driver-memory 50G --executor-memory 50G --properties-file /cosr/back/spark/conf/spark-defaults.conf \ + /cosr/back/spark/jobs/pipeline.py \ + --source commoncrawl \ + --plugin "plugins.filter.Domains:skip=1,domains=tumblr.com wordpress.com" \ + --plugin plugins.dump.DocumentMetadata:output=s3a://my-spark-results/intermediate-metadata/,coalesce=6000,abort=1 \ + --plugin plugins.backlinks.MostExternallyLinkedPages +``` + +Note the `abort=1` option for the dump plugin: this will interrupt the pipeline before starting to aggregate the backlinks. However we still need to include the backlinks plugin in the pipeline so that its additional fields are included in the intermediate metadata. + +There is another important option: `coalesce=6000`. Common Crawl usually has ~30,000 files, and you want to go through them in parrallel over all the EC2 cores you started. So you need to specify how much intermediate files you want with this `coalesce` parameter. Using 6000 means that each core will read ~5 Common Crawl segments, write an intermediate file, and continue to the next task. Without this option, only one core would read all Common Crawl segments and output a single, huge and unpractical intermediate dump file. + +After the pipeline has finished, you can check the folder `intermediate-metadata` in your S3 bucket. It will contain the list of outgoing links of each page, in [Apache Parquet](https://parquet.apache.org/) format. + +Now, we can run the pipeline again with the `metadata` source pointed at the intermedate data: + +``` +spark-submit --verbose --master spark://[spark_master_ip]:7077 --driver-memory 50G --executor-memory 50G --properties-file /cosr/back/spark/conf/spark-defaults.conf \ + /cosr/back/spark/jobs/pipeline.py \ + --source metadata:path=s3a://my-spark-results/intermediate-metadata/ \ + --plugin plugins.backlinks.MostExternallyLinkedPages:output=s3a://my-spark-results/top_wikipedia/,domain=wikipedia.org,gzip=1 +``` + +This should be pretty fast because the intermediate data is much smaller than the original source. You can run more pipelines and plugins based on the same data before discarding the intermediate metadata! + + +## 8. What's next + +Once the pipeline is finished... congratulations! You just ran some code on billions of web pages :-) + +Don't forget to terminate the EC2 instances with `make aws_spark_flintrock_destroy`. + +There are a few things you may want to try after this: + + - Try our other [plugins](https://github.com/commonsearch/cosr-back/tree/master/plugins) and [document sources](https://github.com/commonsearch/cosr-back/tree/master/cosrlib/sources)... or create your own! + - Index the documents into Elasticsearch! Stay tuned for a tutorial on this topic... diff --git a/content/pages/developer/tutorials/running-pagerank-on-the-web.md b/content/pages/developer/tutorials/running-pagerank-on-the-web.md new file mode 100644 index 0000000..b7126ea --- /dev/null +++ b/content/pages/developer/tutorials/running-pagerank-on-the-web.md @@ -0,0 +1,62 @@ +Title: Tutorial: Running PageRank on the Web +Slug: developer/tutorials/running-pagerank-on-the-web +Template: page_developer + +This tutorial get you through all the steps required to run PageRank on billions of pages using Common Search's codebase and tools such as Apache Spark and AWS. + + + +## 1. Prerequisites + +You should go through our [Analyzing the web with Spark on EC2](/developer/tutorials/analyzing-the-web-with-spark-on-ec2) first, to install the required software, understand the basic concepts of our pipeline, and run a simpler job first, at least on your local machine. + +You should also be familiar with basic [Graph theory](https://en.wikipedia.org/wiki/Graph_theory). + + + +## 2. Dumping the Web Graph + +Before computing PageRank, we need to parse all the link in our corpus and save them as a directed graph. + +(In some cases, you can actually skip this step by using one of the [dumps we publish](https://about.commonsearch.org/2016/07/our-first-public-datasets-host-level-webgraph-and-pagerank) directly.) + +To dump the web graph, we are doing to use the `webgraph` plugin. Here is how you would dump it for the first 400 URLs from Common Crawl, at the host level: + +``` +spark-submit --verbose \ + /cosr/back/spark/jobs/pipeline.py \ + --source commoncrawl:limit=4,maxdocs=100 \ + --plugin plugins.webgraph.DomainToDomainParquet:output=out/webgraph/ \ + --stop_delay 600 +``` + +This will actually create 2 subdirectories in `out/webgraph/`: one for the vertices and one for the edges. Both dumps will be stored as Apache Parquet format, so that we can easily reuse them in the next step. + +You might notice this command will go over the source documents multiple times. This shouldn't be a big issue with so few documents, but at scale you will definitely want to use an intermediate dump as explained in the [Analyzing the web with Spark on EC2](/developer/tutorials/analyzing-the-web-with-spark-on-ec2) tutorial. + + +## 3. Computing PageRank + +We are now ready to run the iterative PageRank algorithm over our web graph dump. + +This is not done though our usual `pipeline.py` Spark job but with a dedicated one, like this: + +``` +spark-submit --verbose \ + /cosr/back/spark/jobs/pagerank.py \ + --webgraph out/webgraph/ \ + --output out/pagerank/ \ + --tmpdir /tmp/spark-pr/ \ + --maxiter 20 --tol 0.001 --precision 0.000001 \ + --stats 5 \ + --include_orphans \ + --stop_delay 600 +``` + +Let's review these new options: + + - `--output` specifies the final output directory for the list of PageRanks + - `--tmpdir` specifies a directory (which may also be on S3) that will be used to store intermediate results every 5 iterations of the PageRank algorithm, for performance and lower memory requirements. + - `--maxiter 20 --tol 0.001 --precision 0.000001` are parameters for the PageRank convergence. (TODO explain them better) + - `--stats 5` will print statistics on the algorithm every 5 iterations + - `--include_orphans` will keep vertices without any inbound link in the graph (they should all have PR=0.15) diff --git a/content/pages/privacy.md b/content/pages/privacy.md index 6b97e89..9d9535c 100644 --- a/content/pages/privacy.md +++ b/content/pages/privacy.md @@ -22,7 +22,8 @@ The only information we (and our CDN) receive for each search are: We currently do not store any of this, though we will probably start logging aggregate query volume (without IP addresses) to understand usage and provide autocompletion. -We will also open a [Tor hidden service](https://en.wikipedia.org/wiki/Tor_(anonymity_network)#Hidden_services) so that our users can safely mask their IP address. See [cosr-ops#8](https://github.com/commonsearch/cosr-ops/issues/8). +We also have a [Tor hidden service](https://en.wikipedia.org/wiki/Tor_(anonymity_network)#Hidden_services) so that our users can safely mask their IP address: [http://comsearchl2zlnre.onion](http://comsearchl2zlnre.onion) + ## Why you can trust us diff --git a/theme/static/css/style.css b/theme/static/css/style.css index e8c7fb6..17db61b 100644 --- a/theme/static/css/style.css +++ b/theme/static/css/style.css @@ -103,6 +103,18 @@ a, a:hover { margin:10px; } +.banner-height-75 h1 { + line-height:75px; + margin:0; + padding:0; +} + +.banner-height-75 a { + display:inline-block; + line-height:35px; +} + + body.with-banner section.body { padding-top:20px; } @@ -245,3 +257,10 @@ body.with-banner section.body { padding-right:5px; } } + +@media (max-width: 1199px) { + + .banner-height-75 h1 { + font-size:32px; + } +} diff --git a/theme/templates/base.html b/theme/templates/base.html index 0582a55..d59c17d 100644 --- a/theme/templates/base.html +++ b/theme/templates/base.html @@ -86,7 +86,7 @@ {% endif %} } -