Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -106,6 +106,8 @@ There is a file in the [cosr-ops repository](https://github.com/commonsearch/cos

If you don't have any yet, you will need to create a [Security Group](https://docs.aws.amazon.com/AmazonVPC/latest/GettingStartedGuide/getting-started-create-security-group.html) that allows at least ports 22, 4040 and 8080 from the outside, a [Placement Group](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/placement-groups.html) and a [Key Pair](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-key-pairs.html#having-ec2-create-your-key-pair). Additionally, you should also use your default [VPC](https://docs.aws.amazon.com/AmazonVPC/latest/GettingStartedGuide/getting-started-create-vpc.html) and [Subnets](https://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/VPC_Subnets.html#AddaSubnet), or create new ones.

If you don't have one yet, make sure to create a new Internet Gateway as well and attach it to your recently created VPC. After, make sure to configure your vpc route table to [allow traffic to/from the internet](https://aws.amazon.com/premiumsupport/knowledge-center/ec2-linux-ssh-troubleshooting/) you'll need to create a route that points to your internet gateway ID.

We recommend saving your `.pem` key file in the `cosr-ops/configs/` directory so that it is visible from the Docker container (don't worry, it will be ignored by git). Remember to run `chmod 400 /cosr/ops/configs/*.pem` to avoid SSH errors.

We should now choose an instance type and a number of machines in our cluster. Our pipeline is usually CPU-bound, so the most important metric will be the number of cores. The [C4 instance family](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/c4-instances.html) has the best CPU performance/cost ratio, and it usually takes around 15 minutes for a C4 core to index a full Common Crawl segment.
Expand Down