From a8d122c8e2823b7491ffbd038fb906fc4702fbff Mon Sep 17 00:00:00 2001 From: Bakz Date: Fri, 9 Dec 2016 16:18:47 -0500 Subject: [PATCH] updated tutorial to include additional steps like creating internet gateway and creating a route in vpc route table --- .../developer/tutorials/analyzing-the-web-with-spark-on-ec2.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/content/pages/developer/tutorials/analyzing-the-web-with-spark-on-ec2.md b/content/pages/developer/tutorials/analyzing-the-web-with-spark-on-ec2.md index 9f8824c..ce84643 100644 --- a/content/pages/developer/tutorials/analyzing-the-web-with-spark-on-ec2.md +++ b/content/pages/developer/tutorials/analyzing-the-web-with-spark-on-ec2.md @@ -106,6 +106,8 @@ There is a file in the [cosr-ops repository](https://github.com/commonsearch/cos If you don't have any yet, you will need to create a [Security Group](https://docs.aws.amazon.com/AmazonVPC/latest/GettingStartedGuide/getting-started-create-security-group.html) that allows at least ports 22, 4040 and 8080 from the outside, a [Placement Group](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/placement-groups.html) and a [Key Pair](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-key-pairs.html#having-ec2-create-your-key-pair). Additionally, you should also use your default [VPC](https://docs.aws.amazon.com/AmazonVPC/latest/GettingStartedGuide/getting-started-create-vpc.html) and [Subnets](https://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/VPC_Subnets.html#AddaSubnet), or create new ones. +If you don't have one yet, make sure to create a new Internet Gateway as well and attach it to your recently created VPC. After, make sure to configure your vpc route table to [allow traffic to/from the internet](https://aws.amazon.com/premiumsupport/knowledge-center/ec2-linux-ssh-troubleshooting/) you'll need to create a route that points to your internet gateway ID. + We recommend saving your `.pem` key file in the `cosr-ops/configs/` directory so that it is visible from the Docker container (don't worry, it will be ignored by git). Remember to run `chmod 400 /cosr/ops/configs/*.pem` to avoid SSH errors. We should now choose an instance type and a number of machines in our cluster. Our pipeline is usually CPU-bound, so the most important metric will be the number of cores. The [C4 instance family](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/c4-instances.html) has the best CPU performance/cost ratio, and it usually takes around 15 minutes for a C4 core to index a full Common Crawl segment.